# CliMetLab ML Data Fetcher (Ireland-focused)
This notebook gives students simple, reusable functions to download weather/climate data using **CliMetLab** and prepare it for **PyTorch** workflows.

### What you can do
- **Time series at a single location** (weather station or grid point) for one/many variables.
- **Time series at multiple locations** for Graph Neural Networks (returns node features and graph edges).
- **Regional grids over Ireland** (bounding box or domain name) for quick experiments.

> **Note:** To retrieve ERA5 via the **Copernicus Climate Data Store (CDS)** you must have a CDS account and a `~/.cdsapirc` file set up on the machine that runs this notebook.


### References
- CliMetLab datasets & examples (ERA5, Ireland):  
  - *ERA5-based datasets* (includes **`domain="Ireland"`** and variable-specific datasets like `era5-precipitations`).  
  - *Retrieve ERA5 data from the CDS* (shows **`load_source("cds", "reanalysis-era5-single-levels", variable=[...], area=[N,W,S,E]`** and `.to_xarray()`).  
- Met Éireann station metadata (lat/lon) is openly available (CSV).  

These links are included in the teaching materials; you don't need to open them to run the notebook.


## 0) Setup
Run this once per environment. If you're on a lab machine, you may already have these.


In [None]:
# If needed, uncomment to install
# %pip install -q climetlab xarray pandas numpy scikit-learn

import os
import math
import json
from datetime import datetime
from typing import List, Tuple, Dict, Optional

import numpy as np
import pandas as pd

import climetlab as cml
import xarray as xr

from math import radians, sin, cos, asin, sqrt

print("CliMetLab version:", cml.__version__)

## 1) Small geo helpers
We keep dependencies minimal and implement a tiny Haversine + kNN to avoid heavy GIS installs.

In [None]:
def haversine_km(lat1, lon1, lat2, lon2):
    '''Great-circle distance (kilometres) between two points.'''
    # Convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    R = 6371.0088  # mean Earth radius in km
    return R * c

def bounding_box(points: List[Tuple[float, float]], pad_deg: float = 0.5):
    '''Return (N, W, S, E) with optional margin in degrees.'''
    lats = [p[0] for p in points]
    lons = [p[1] for p in points]
    N = max(lats) + pad_deg
    S = min(lats) - pad_deg
    W = min(lons) - pad_deg
    E = max(lons) + pad_deg
    return [N, W, S, E]

def knn_edges(coords: List[Tuple[float, float]], k: int = 2):
    '''Return undirected edges (i,j) as a 2xM numpy array for PyTorch Geometric style graphs.'''
    n = len(coords)
    edges = set()
    for i in range(n):
        dists = []
        for j in range(n):
            if i == j: 
                continue
            d = haversine_km(coords[i][0], coords[i][1], coords[j][0], coords[j][1])
            dists.append((d, j))
        dists.sort(key=lambda x: x[0])
        for _, j in dists[:k]:
            edges.add((i, j))
            edges.add((j, i))  # make it undirected by doubling
    if not edges:
        return np.zeros((2,0), dtype=np.int64)
    edges_arr = np.array(sorted(list(edges)), dtype=np.int64).T  # shape (2,M)
    return edges_arr

## 2) Core loaders (CliMetLab)
Two pathways:

1. **Dataset shortcuts** like `era5-temperature` or `era5-precipitations` with `domain="Ireland"` for quick demos.
2. **Direct CDS access** via `load_source("cds", "reanalysis-era5-single-levels", ...)` for multi-variable requests and precise spatial/temporal control.

All loaders return **xarray** objects, which are easy to convert to NumPy for PyTorch.


In [None]:
def load_era5_dataset_shortcut(dataset_name: str,
                               period: Tuple[int,int],
                               domain: str = "Ireland",
                               time: int = 12):
    '''Load ERA5-based dataset shortcut (e.g., 'era5-temperature', 'era5-precipitations') for a named domain (e.g., 'Ireland') and time-of-day (UTC hour). Returns an xarray.Dataset.'''
    ds = cml.load_dataset(dataset_name, period=period, domain=domain, time=time)
    return ds.to_xarray()  # unify to xarray

In [None]:
def load_era5_cds(variables: List[str],
                  date_from: str,
                  date_to: str,
                  hours: List[str],
                  area: Optional[List[float]] = None,
                  fmt: str = "netcdf"):
    '''Load ERA5 (single levels) from CDS via CliMetLab.
- variables: CDS variable short codes, e.g. ['2t','msl','10u','10v'].
- date_from/date_to: 'YYYY-MM-DD' strings.
- hours: list of 'HH:MM' strings in UTC.
- area: [N, W, S, E] bounding box. If None, global.
- fmt: usually 'netcdf' for easy xarray.

Returns xarray.Dataset.'''
    source = cml.load_source(
        "cds",
        "reanalysis-era5-single-levels",
        variable=variables,
        product_type="reanalysis",
        date=f"{date_from}/to/{date_to}",
        time=hours,
        area=area,
        format=fmt,
    )
    return source.to_xarray()

## 3) Single-point time series (station / grid point)
Convenience wrapper that downloads a **small regional subset** and then samples the **nearest grid cell** to the target point.


In [None]:
def era5_point_timeseries(variables: List[str],
                          lat: float, lon: float,
                          date_from: str, date_to: str,
                          hours: List[str] = [f"{h:02d}:00" for h in range(24)],
                          pad_deg: float = 0.5):
    '''Return a tidy (time x features) DataFrame for a single (lat, lon).
- Downloads a small bounding box around the point for efficiency,
  then selects the nearest grid cell.
- Outputs columns named by variables (e.g. '2t','msl').'''
    area = bounding_box([(lat, lon)], pad_deg=pad_deg)
    ds = load_era5_cds(variables, date_from, date_to, hours, area=area)
    # ERA5 coordinates are (time, latitude, longitude)
    p = ds.sel(latitude=lat, longitude=lon, method="nearest")
    # Convert to DataFrame with time index
    df = p.to_dataframe().reset_index().set_index("time").sort_index()
    # Keep only variable columns (some datasets include coords/attrs)
    cols = [v for v in variables if v in df.columns]
    return df[cols]

## 4) Multi-point time series (for GNNs)
Fetches a bounding box that covers **all points**, samples each location, and returns:
- `features_np`: NumPy array shaped **[time, nodes, features]** (easy to convert to torch).
- `edge_index`: 2×M array of integer edges using simple **kNN** on great-circle distance.
- `coords`: the actual (lat,lon) used (in case you need to reconstruct).

In [None]:
def era5_multipoint_timeseries(variables: List[str],
                               coords: List[Tuple[float, float]],
                               date_from: str, date_to: str,
                               hours: List[str] = [f"{h:02d}:00" for h in range(24)],
                               pad_deg: float = 0.5,
                               knn_k: int = 2):
    '''For a list of (lat, lon) pairs, return:
  - features_np: numpy array [T, N, F]
  - times: pandas.DatetimeIndex
  - edge_index: numpy array [2, M] (undirected, doubled edges)
  - coords_out: list of coords in same order'''
    area = bounding_box(coords, pad_deg=pad_deg)
    ds = load_era5_cds(variables, date_from, date_to, hours, area=area)
    # Build [time x nodes x features]
    node_frames = []
    for (lat, lon) in coords:
        p = ds.sel(latitude=lat, longitude=lon, method="nearest")
        df = p.to_dataframe().reset_index().set_index("time").sort_index()
        cols = [v for v in variables if v in df.columns]
        node_frames.append(df[cols])
    # Align on time
    aligned = pd.concat(node_frames, axis=1, keys=range(len(coords)))
    # MultiIndex columns -> reshape to [T, N, F]
    times = aligned.index
    N = len(coords)
    F = len(variables)
    # Reindex columns to consistent order
    col_order = []
    for n in range(N):
        for f in range(F):
            col_order.append((n, variables[f]))
    aligned = aligned[col_order]
    features_np = aligned.to_numpy().reshape(len(times), N, F)
    # kNN graph from input coords
    edge_index = knn_edges(coords, k=knn_k)
    return features_np, times, edge_index, coords

## 5) Quick regional grid over Ireland (dataset shortcut)
For fast demos, you can use premade dataset shortcuts (no variable lists), e.g. `era5-precipitations` with `domain="Ireland"`.

In [None]:
def load_ireland_precip_demo(period=(1979, 1982), time=12):
    '''Return an xarray.Dataset for ERA5 precipitation over Ireland at one UTC hour across years.'''
    x = load_era5_dataset_shortcut("era5-precipitations", period=period, domain="Ireland", time=time)
    return x  # xarray.Dataset

## 6) PyTorch-ready outputs
We avoid importing PyTorch in this notebook. Models can convert with:
```python
import torch
x = torch.from_numpy(features_np).float()  # shape [T, N, F]
edge_index = torch.as_tensor(edge_index, dtype=torch.long)  # shape [2, M]
```


## 7) Worked Irish examples
We use three well-known Irish airport locations for demonstration:
- Dublin Airport (53.421, -6.270)
- Shannon Airport (52.702, -8.924)
- Cork Airport (51.841, -8.491)

**Variables:** 2 m temperature (`2t`) and mean sea level pressure (`msl`).  
**Period:** a short two-day window to keep downloads small.


In [None]:
# Example 7.1: Single-point time series at Dublin Airport
variables = ["2t", "msl"]
dublin = (53.421, -6.270)
df_dub = era5_point_timeseries(variables, dublin[0], dublin[1],
                               date_from="2020-01-01", date_to="2020-01-02",
                               hours=[f"{h:02d}:00" for h in range(0, 24, 3)])  # 3-hourly
df_dub.head()

In [None]:
# Example 7.2: Multi-point time series for a tiny GNN over 3 Irish airports
coords = [dublin, (52.702, -8.924), (51.841, -8.491)]  # Dublin, Shannon, Cork
features_np, times, edge_index, coords_out = era5_multipoint_timeseries(
    variables=variables,
    coords=coords,
    date_from="2020-01-01",
    date_to="2020-01-02",
    hours=[f"{h:02d}:00" for h in range(0, 24, 6)],  # 6-hourly
    pad_deg=0.75,
    knn_k=2,
)

print("features_np shape [T,N,F]:", features_np.shape)
print("First timestamp:", times[0])
print("Edge index shape [2,M]:", edge_index.shape)
print(edge_index)

## 8) Optional: Get Irish station metadata (lat/lon)
If you wish to use **observational stations** rather than model/reanalysis grid points,
Met Éireann publish a CSV of station details (including latitude/longitude). Example:
```python
url = "http://cli.fusio.net/cli/climate_data/webdata/StationDetails.csv"
stations = pd.read_csv(url)
stations[['StationNumber','Name','Latitude','Longitude','County']].head()
```
You can then pass those coordinates into `era5_multipoint_timeseries` to extract **ERA5 at station locations**.


## 9) Tips & troubleshooting
- Ensure your CDS key is configured in `~/.cdsapirc`. If you get authentication errors, visit the CDS website to create an API key.
- If downloads are slow, reduce the time window or number of variables; ERA5 files are large.
- Use `area=[N,W,S,E]` to keep spatial subsets small.
- Convert Kelvin to °C by subtracting 273.15 after download if needed.
- For graphs, adjust `knn_k` to control neighborhood size.
