# ClimateHack.AI 2023: Data Exploration

Thank you for participating in ClimateHack.AI 2023! 

Your contributions could help cut carbon emissions by up to 100 kilotonnes per year in Great Britain alone. We look forward to seeing what you build over the course of the competition!

As with any machine learning task, the best place to start is by inspecting the data available, and for this competition, we are spoiled for choice!

You do not have you use all of the data for this challenge (and in fact, you probably shouldn't!). Having said that, it is up to you to be creative to decide which data sources you actually do want to use and train on!

## Prerequisites

If you do not have the following Python packages installed, you can uncomment and run the following line to install them with `pip`. 

In [None]:
# %pip install numpy matplotlib zarr xarray ipykernel gcsfs fsspec dask cartopy ocf-blosc2 doxa-cli

## Importing packages

In [None]:
from datetime import datetime, time, timedelta

import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
from ocf_blosc2 import Blosc2

plt.rcParams["figure.figsize"] = (20, 12)

## HRV Satellite Imagery

One benefit of the Zarr format is that Zarr datasets can be streamed straight from the cloud. While this most likely will not be fast enough in training, it already lets us perform some initial data exploration without having to download entire months of data.

In [None]:
hrv = xr.open_dataset(
    "zip:///::https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/satellite-hrv/2020/7.zarr.zip",
    engine="zarr",
    consolidated=True,
)

hrv

We can use the `.plot()` method to take a look at what the HRV data looks like at a particular moment in time.

In [None]:
hrv["data"].sel(time="2020-07-20 10:00").plot()  # type: ignore

A slightly more advanced version of this allows us to draw coastlines on top of the data.

In [None]:
axes = plt.axes(projection=ccrs.Geostationary(central_longitude=9.5))

hrv["data"].sel(time="2020-07-20 10:00", channel="HRV").plot.pcolormesh(
    ax=axes,
    transform=ccrs.Geostationary(central_longitude=9.5),
    x="x_geostationary",
    y="y_geostationary",
    add_colorbar=False,
)  # type: ignore

axes.coastlines()

## Non-HRV Satellite Imagery

We can also perform something similar for the non-HRV satellite imagery data.

In [None]:
nonhrv = xr.open_dataset(
    "zip:///::https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/satellite-nonhrv/2020/7.zarr.zip",
    engine="zarr",
    consolidated=True,
)

nonhrv

Notice how the non-HRV satellite imagery data is composed of 11 different channels:

In [None]:
nonhrv.channel

We can select one of these channels (in this case, an infrared one) and plot it in a similar way to the previous example involving HRV data.

In [None]:
nonhrv["data"].sel(time="2020-07-20 10:00", channel="IR_016").plot()  # type: ignore

## Weather Forecasts

We can also look at the weather forecast dataset by loading and visualising it in a very similar way!

As you can see, this dataset is composed of 38 different data variables (many of which correspond to different altitudes), such as for ground temperatures, total precipitation and more. For further information on each of these data variables, check out the data section on the [ClimateHack.AI 2023 competition page](https://doxaai.com/competition/climatehackai-2023/overview).

In [None]:
nwp = xr.open_dataset(
    "zip:///::https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/weather/2020/7.zarr.zip",
    engine="zarr",
    consolidated=True,
)

nwp

### Ground temperatures

Just as with the satellite imagery data, we can also plot individual data variables in the weather forecast dataset. Here, `t_g` corresponds to ground-level temperatures in Kelvin (which we convert to Celsius in the visualisation below).

In [None]:
axes = plt.axes(projection=ccrs.PlateCarree())

(nwp["t_g"].sel(time="2020-07-20 10:00") - 273.15).plot.pcolormesh(
    ax=axes,
    transform=ccrs.PlateCarree(),
    x="longitude",
    y="latitude",
    add_colorbar=True,
    cmap="coolwarm",
)  # type: ignore

axes.coastlines()

### Cloud cover

Similarly, we can also look at total cloud cover forecasts (`clct`).

In [None]:
axes = plt.axes(projection=ccrs.PlateCarree())

(nwp["clct"].sel(time="2020-07-20 10:00") - 273.15).plot.pcolormesh(
    ax=axes,
    transform=ccrs.PlateCarree(),
    x="longitude",
    y="latitude",
    add_colorbar=True,
)  # type: ignore

axes.coastlines()

## All weather variables

Here are all the weather variables available in this dataset.

In [None]:
nrows = 8
ncols = 5

fig, axes = plt.subplots(
    nrows=nrows,
    ncols=ncols,
    figsize=(10, 20),
    subplot_kw={"projection": ccrs.PlateCarree()},
)

for i, var in enumerate(nwp.data_vars):
    nwp[var].sel(time="2020-07-20 10:00",).plot.pcolormesh(
        ax=axes[i // ncols][i % ncols],
        transform=ccrs.PlateCarree(),
        x="longitude",
        y="latitude",
        add_colorbar=False,
        cmap="coolwarm" if var.split("_")[0] in ("t", "v", "u") else "viridis",
    )

    axes[i // ncols][i % ncols].coastlines()
    axes[i // ncols][i % ncols].get_xaxis().set_visible(False)
    axes[i // ncols][i % ncols].get_yaxis().set_visible(False)
    axes[i // ncols][i % ncols].set_title(var)

fig.tight_layout()
fig.subplots_adjust(wspace=0.1, hspace=0.1)

## Air Quality Forecasts

Finally, we can also explore the ECMWF CAMS air quality forecast dataset, which contains a number of data variables related to aerosols in the atmosphere at 8 different levels. There is a lot of aerosol data available, so if you are interested in using the aerosol data as part of your submission, it is worth spending some time to get familiar with the data and figure out which data variables are actually useful to you. For example, not all aerosol types are found in large concentrations over Great Britain, which is our area of interest. 

In [None]:
aerosols = xr.open_dataset(
    "zip:///::https://huggingface.co/datasets/climatehackai/climatehackai-2023/resolve/main/aerosols/2020/7.zarr.zip",
    engine="zarr",
    consolidated=True,
)

aerosols

In [None]:
aerosols.level

In [None]:
axes = plt.axes(projection=ccrs.PlateCarree())

aerosols["pm10_conc"].sel(time="2020-07-20 10:00", level=1000).plot.pcolormesh(
    ax=axes,
    transform=ccrs.PlateCarree(),
    x="longitude",
    y="latitude",
    add_colorbar=True,
)  # type: ignore

axes.coastlines()

In [None]:
fig, axes = plt.subplots(
    nrows=len(aerosols.data_vars),
    ncols=len(aerosols.level),
    figsize=(15, 28),
    subplot_kw={"projection": ccrs.PlateCarree()},
)

for i, var in enumerate(aerosols.data_vars):
    for j, level in enumerate(aerosols.level):
        aerosols[var].sel(time="2020-07-20 10:00", level=level).plot.pcolormesh(
            ax=axes[i][j],
            transform=ccrs.PlateCarree(),
            x="longitude",
            y="latitude",
            add_colorbar=False,
            cmap="viridis",
        )

        axes[i][j].coastlines()
        axes[i][j].get_xaxis().set_visible(False)
        axes[i][j].get_yaxis().set_visible(False)
        axes[i][j].set_title(f"{var} ({int(level)}m)")

fig.tight_layout()
fig.subplots_adjust(wspace=0.1, hspace=0.1)