# Case Study 4 - Validating gridded data products
## Description 
As a user of spatial data products (satellite or modelled), I want to compare high-quality ground-based data from multiple sites with the product, so that I can assess its precision and accuracy for estimating the same variables at other sites.
## Case Breakdown 
- **Actors:** Gridded Data User
- **Goals:** Finding correlations between gridded data and ground-based data
- **Scope:** National, point-based
## Generalised case
I want to compare measurements from a gridded data product with actual on-ground measurements from different sites and report mean and standard deviation for the error.
## Comparable cases
- I want to compare local weather station values at APPN and/or TERN sites with the associated daily measurememts from national weather datasets (BOM).
## Stakeholders 
- **Name:** Donald Hobern
- **Contact:** donald.hobern@adelaide.edu.au


## Data Sources
The case study uses national weather data products from the Bureau of Meteorology for daily mean minimum/minimum temperature, accessible from http://www.bom.gov.au/jsp/awap/temp/index.jsp. Seven daily maximum and minimum temperature grids were downloaded for the dates 7 to 13 April 2025 inclusive. These data can be accessed in the source_data folder in the downloaded ASCII grid format (\*.grid). These data will be loaded into the data cube as WGS84 Geotiff files. To avoid extra dependencies in this notebook, the data have already been converted using QGIS Desktop and are also included in the source_data folder (\*.tiff).

Comparison data for maximum and minimum air temperature were downloaded for all public weather stations in Western Australia from https://weather.agric.wa.gov.au/ for the 10 day period 4 to 13 April 2025. These are included in source_data as CSV files. These downloads do not include the coordinates for the weather stations. These were downloaded via the https://api.agric.wa.gov.au/v2/weather/openapi/#/Stations/getStations API method and are included in source_data as DPIRD_weather_stations.json.

## Imports

In [None]:
import xarray as xr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path 
import shutil
import subprocess
import json
import os

from stac_generator.factory import StacGeneratorFactory
from stac_generator.core.base.generator import StacSerialiser
from stac_generator.core.base.schema import StacCollectionConfig, ColumnInfo
from stac_generator.core.raster.schema import RasterConfig, BandInfo
from stac_generator.core.vector.schema import VectorConfig
from stac_generator.core.point.schema import PointConfig

from mccn.client import MCCN

from xarray.groupers import TimeResampler

## Data Paths

In [None]:
# Paths to current folder and scratch folder for working files
current_folder = Path.cwd()
source_folder = current_folder / "source_data"
scratch_folder = current_folder/"scratch"
if not scratch_folder.exists():
    scratch_folder.mkdir()


# Paths to data from weather stations
weather_stations = source_folder/ "DPIRD_weather_stations.json"
weather_maxima_source = source_folder/ "10DAY_MAX_AIRTEMPERATURE_20250414162054.csv"
weather_minima_source = source_folder/ "10DAY_MIN_AIRTEMPERATURE_20250414162111.csv"

# Paths for outputs merging coordinates and CSV data for weather stations
weather_maxima = scratch_folder/ "weather_maximum_readings.csv"
weather_minima = scratch_folder/ "weather_minimum_readings.csv"

# Lists of paths for Geotiffs from BOM data
maxima_layers = {f.name: f for f in source_folder.iterdir() if not f.is_dir() and f.name.startswith("mean_max") and f.name.endswith(".tiff")}
minima_layers = {f.name: f for f in source_folder.iterdir() if not f.is_dir() and f.name.startswith("mean_min") and f.name.endswith(".tiff")}

## Prepare weather station data
Read coordinates from JSON weather station metadata. Join these coordinates with the maximum and minimum air temperature readings. Drop values for the three dates preceding the BOM layers. Reshape the data to one value per row. Save using the output paths.

In [None]:
# Get station coordinates as dataframe
station_columns = ["Name", "Latitude", "Longitude"]
stations = []
with open(weather_stations) as stations_file:
    stations_data = json.load(stations_file)
    for s in stations_data["collection"]:
        stations.append({"Name": f"{s['stationName']} ({s['stationCode']})", "Latitude": s["latitude"], "Longitude": s["longitude"]})
df_station = pd.DataFrame(columns=station_columns, data=stations)

# Process maximum air temperature data
df_maxima = pd.merge(df_station, pd.read_csv(weather_maxima_source), how="left", on="Name")
df_maxima = df_maxima.drop(columns=["04/04", "05/04", "06/04"])
df_maxima = df_maxima.rename(columns={c: f"2025-{c[3:]}-{c[0:2]}T12:00:00Z" for c in df_maxima.columns if c.endswith("04")})
df_maxima = pd.melt(df_maxima, id_vars=["Name", "Latitude", "Longitude"], var_name="Date", value_name="MaxTemp").reset_index()
df_maxima.to_csv(weather_maxima, index=False)

# Process minimum air temperature data
df_minima = pd.merge(df_station, pd.read_csv(weather_minima_source), how="left", on="Name")
df_minima = df_minima.drop(columns=["04/04", "05/04", "06/04"])
df_minima = df_minima.rename(columns={c: f"2025-{c[3:]}-{c[0:2]}T12:00:00Z" for c in df_minima.columns if c.endswith("04")})
df_minima = pd.melt(df_minima, id_vars=["Name", "Latitude", "Longitude"], var_name="Date", value_name="MinTemp").reset_index()
# Unavailable values are mostly represented by empty strings, but sometime by a hyphen.
df_minima = df_minima.drop(df_minima[df_minima["MinTemp"] == "-"].index)
df_minima.to_csv(weather_minima, index=False)


## Generate configuration files for STAC collection

Dynamically generate STAC configuration for the whole collection and for all maximum and minimum Geotiffs and for the point data files

In [None]:
# Configuration for collection as a whole.
collection_config = StacCollectionConfig(
    id="TemperatureStudy",
    title="Datasets for national temperature data validation study",
    description="STAC records for accessing datasets to explore as part of the MCCN case study 4 relating to comparisons between national temperature data products and local weather stations",
    license="CC-BY-4.0",
)

# Configurations for:
# 1) seven gridded maximum temperature layers
# 2) seven gridded minimum temperature layers
# 3) point data and site names for maximum temperature
# 4) point data for minimum temperature
configurations = [
    RasterConfig(
        id=f.split(".")[0],
        location=p.as_posix(),
        collection_date=f"{f[9:13]}-{f[13:15]}-{f[15:17]}",
        collection_time=f"12:00:00",
        band_info=[
            BandInfo(name="max_temp_gridded", description=f)
        ]
    ) for f, p in maxima_layers.items()
] + [
    RasterConfig(
        id=f"Minimum {f.split('.')[0]}",
        location=p.as_posix(),
        collection_date=f"{f[9:13]}-{f[13:15]}-{f[15:17]}",
        collection_time=f"12:00:00",
        band_info=[
            BandInfo(name="min_temp_gridded", description=f)
        ]
    ) for f, p in minima_layers.items()
] + [
    PointConfig(
        id="WeatherStationMaxima",
        location=weather_maxima.as_posix(),
        collection_date="2024-12-31",
        collection_time="00:00:00",
        X="Longitude",
        Y="Latitude",
        T="Date",
        column_info=[
            ColumnInfo(name="MaxTemp", description=f"Weather station data"),
            ColumnInfo(name="Name", description=f"Weather station data"),
        ]
    ),
] + [
    PointConfig(
        id="WeatherStationMinima",
        location=weather_minima.as_posix(),
        collection_date="2024-12-31",
        collection_time="00:00:00",
        X="Longitude",
        Y="Latitude",
        T="Date",
        column_info=[
            ColumnInfo(name="MinTemp", description=f"Weather station data"),
        ]
    ),
]

# Build the generator using the configurations.
generator = StacGeneratorFactory.get_collection_generator(
    source_configs=configurations,
    collection_config=collection_config
)

# Serialise the STAC collection. This will generate the collection JSON file and item JSON files for each layer.
serialiser = StacSerialiser(generator, scratch_folder.as_posix())
serialiser()

## Load data into data cube
Import the data cube using a 1000*1000 grid. Group the data for the seven days as the time dimension.

In [None]:
# Load using the locally generated collection
endpoint = os.path.join(scratch_folder, "collection.json")
client = MCCN(endpoint, shape=(1000,1000), nodata={"Name": 0}, nodata_fallback=np.nan)
ds = client.load()

# Mask the 9999 nodata value in the source data
ds["max_temp_gridded"] = ds["max_temp_gridded"].where(ds["max_temp_gridded"] < 100, np.nan)
ds["min_temp_gridded"] = ds["min_temp_gridded"].where(ds["min_temp_gridded"] < 100, np.nan)
ds = ds.where(ds > -99, np.nan)

# Group layers by calendar days (timestamps do not match completely) and restrict to target dates
ds = ds.resample(time="1D").max()
#ds = ds.sel(time=slice("2025-04-07", "2025-04-13"))


# Display
ds

## DataCube contents
Daily mean maximum temperatures from BOM.

In [None]:
ds["max_temp_gridded"].plot(x="x", y="y", col="time", col_wrap=7)

Daily mean minimum temperatures from BOM.

In [None]:
ds["min_temp_gridded"].plot(x="x", y="y", col="time", col_wrap=7)

Daily maximum air temperatures from weather stations (coarsened to 100*100 so values are visible).

In [None]:
ds["MaxTemp"].coarsen(x=10).mean().coarsen(y=10).mean().plot(x="x", y="y", col="time", col_wrap=7)

Daily minimum air temperatures from weather stations (coarsened to 100*100 so values are visible).

In [None]:
ds["MinTemp"].coarsen(x=10).mean().coarsen(y=10).mean().plot(x="x", y="y", col="time", col_wrap=7)

## Analyse errors
For all points and dates with weather station data, count the number of measured values over the week andcalculate the difference between the measured data and the gridded products.

For each point, calculate the maximum, minimum and mean errors and the standard deviation of the errors for the seven days.

In [None]:
# Calculations for maximum temperatures
ds["max_temp_count"] = ds["MaxTemp"].count(dim="time")
ds["max_temp_count"] = ds["max_temp_count"].where(ds["max_temp_count"] > 0, np.nan)
ds["error_max_temp"] = ds["MaxTemp"].where(ds["MaxTemp"] == np.nan, ds["MaxTemp"] - ds["max_temp_gridded"])
ds["error_max_temp"] = ds["error_max_temp"].where(ds["error_max_temp"] > 0, -ds["error_max_temp"])
ds["mean_error_max_temp"] = ds["error_max_temp"].mean(dim="time")
ds["max_error_max_temp"] = ds["error_max_temp"].max(dim="time")
ds["min_error_max_temp"] = ds["error_max_temp"].min(dim="time")
ds["std_error_max_temp"] = ds["error_max_temp"].std(dim="time")

# Calculations for minimum temperatures
ds["min_temp_count"] = ds["MinTemp"].count(dim="time")
ds["min_temp_count"] = ds["min_temp_count"].where(ds["min_temp_count"] > 0, np.nan)
ds["error_min_temp"] = ds["MinTemp"].where(ds["MinTemp"] == np.nan, ds["MinTemp"] - ds["min_temp_gridded"])
ds["error_min_temp"] = ds["error_min_temp"].where(ds["error_min_temp"] > 0, -ds["error_min_temp"])
ds["mean_error_min_temp"] = ds["error_min_temp"].mean(dim="time")
ds["max_error_min_temp"] = ds["error_min_temp"].max(dim="time")
ds["min_error_min_temp"] = ds["error_min_temp"].min(dim="time")
ds["std_error_min_temp"] = ds["error_min_temp"].std(dim="time")

# Generate pandas dataframe with values for analysis
computed_layers = [
    "max_temp_count",
    "mean_error_max_temp",
    "max_error_max_temp",
    "min_error_max_temp",
    "std_error_max_temp",
    "min_temp_count",
    "mean_error_min_temp",
    "max_error_min_temp",
    "min_error_min_temp",
    "std_error_min_temp"
]
da = ds[["Name"]+computed_layers].to_dataframe().drop(columns="spatial_ref").dropna(axis=0).reset_index()
da = da.loc[da["Name"] > 0].reset_index().drop(columns=["index","time"]).drop_duplicates()

# Restore actual station names
da["Name"] = da["Name"].map(lambda x: ds.attrs["Name"][int(x)])

# Display
da

## Display results
Map the values for each calculated layer

In [None]:
for v in computed_layers:
    da.plot(x="x", y="y", kind="scatter", c=v)

## Sites with high errors for mean maximum temperature
Show details for sites that have one of the three most extreme values for any of the maximum temperature error values. Since most sites show low error values, it is likely that these sites have miscalibrated or poorly positioned sensors.

In [None]:
extreme_cases = set()
max_temp_columns = [c for c in computed_layers if c.endswith("max_temp")]
for v in max_temp_columns:
    extreme_cases |= set(da.sort_values(v, ascending=False).iloc[0:3,].index)
extreme_sites = da.iloc[sorted(list(extreme_cases))][["x","y","Name","max_temp_count"]+max_temp_columns]
pd.merge(extreme_sites, pd.read_csv(weather_maxima_source).drop(columns=["04/04","05/04","06/04"]), how="left", on="Name")

## Sites with high errors for mean minimum temperature
Show details for sites that have one of the three most extreme values for any of the minimum temperature error values. Since most sites show low error values, it is likely that these sites have miscalibrated or poorly positioned sensors.

In [None]:
extreme_cases = set()
min_temp_columns = [c for c in computed_layers if c.endswith("min_temp")]
for v in max_temp_columns:
    extreme_cases |= set(da.sort_values(v, ascending=False).iloc[0:3,].index)
extreme_sites = da.iloc[sorted(list(extreme_cases))][["x","y","Name","min_temp_count"]+min_temp_columns]
pd.merge(extreme_sites, pd.read_csv(weather_minima_source).drop(columns=["04/04","05/04","06/04"]), how="left", on="Name")

## Sites with low errors
Show details for sites that show low error values for both maximum and minimum mean temperature. These sites probably represent the best calibrated and positioned sensors and might be useful for future calibration.

In [None]:
da.loc[(da["mean_error_min_temp"] < 0.7) & (da["mean_error_max_temp"] < 0.7)]

## Cleanup
Beware - this will delete all generated files.

In [None]:
# Clean up scratch folder

shutil.rmtree(scratch_folder)