## 01c - acquiring forecast weather data
In this notebook, we will obtain medium-range forecast weather data, also produced by the [European Centre for Medium-Range Weather Forecasts (ECMWF)](https://www.ecmwf.int/) and available through their [ECMWF Data Store (ECPDS)](https://www.ecmwf.int/en/forecasts/datasets/open-data).

This data is freely available (without registration) but its production is delayed. In general, the forecast model outputs are also at a lower spatial resolution than would be availble for paying subscribers but it will suffice for supporting our example application.

You can read much more about the data products provided by ECMWF in their [open data documentation](https://confluence.ecmwf.int/display/DAC/ECMWF+open+data%3A+real-time+forecasts+from+IFS+and+AIFS) but to summarise, we will be using forecasts that are:
+ produced by the Integrated Forecasting System (IFS) (i.e. the 'physical' weather model) maintained by ECMWF;
+ 'operational' forecasts, the main high-resolution deterministic forecast produced by ECMWF;
+ extend out to the next 15-days ahead, initially in 3-hour timesteps (for the first five days), moving to 6-hour steps thereafter;
+ published as [GRIB messages](https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them#WhatareGRIBfilesandhowcanIreadthem-HowtoreadGRIBfiles) with a large number of weather variables, which we will subset to just the five we looked at in the historical datasets.

ECMWF also produces extended-range ensemble forecasts that go out to 46 days. These have lower spatial resolution and are typically output at daily intervals.

We're looking at the bottom row of this diagram:

<img src="../docs/imgs/energy-sa-ingest-flow.png" width="800">


#### Data licensing

As before, we are obliged to state that the ECMWF has not in any way endorsed this demo and remind you to fully read and comply with to their [terms of use](https://apps.ecmwf.int/datasets/licences/general/) when accessing the data.

ECMWF Open Data is © 2025 European Centre for Medium-Range Weather Forecasts (ECMWF). 

This data is published under a Creative Commons Attribution 4.0 International (CC BY 4.0). https://creativecommons.org/licenses/by/4.0/

ECMWF does not accept any liability whatsoever for any error or omission in the data, their availability, or for any loss or damage arising from their use.

The material presented here has not been modified from its original form.

### Notebook set-up
Installation of dependencies and library imports.

In [0]:
%run ./includes/common_functions_and_imports

GRIB files require a native code dependency installed through the `cfgrib` python package. We use this command to check that it has been correctly installed.

In [0]:
%sh python -m cfgrib selfcheck

Rather than present the custom datasource reader for this dataset within this notebook, it has been packaged within `utils/datasource.py` and is imported and registered now.

The datasource obtains a list of URLs from the ECMWF site, downloads the corresponding GRIB files and then reads and extracts the data using the popular `xarray` package, designed for dealing with multidimensional datasets.

In [0]:
from datetime import datetime, UTC, timedelta

from pyspark.sql import functions as F

In [0]:
%run ./includes/custom_spark_data_source_ifs

In [0]:
spark.dataSource.register(IFSDataSource)

spark.conf.set("spark.databricks.delta.formatCheck.enabled", "false")

### Data filtering
#### Spatial filtering
Unlike with the historical data access through CDS, we can't filter our downloads by a spatial boundary. Instead, we will read all values and drop any out-of-bounds rows before writing into Delta.
#### Time horizon
By default, we will download the forecast produced yesterday, since that is guaranteed to be available at any point during the day. Current day forecasts from the 0h UTC run can take up to 9 hours to be published to the open data store.

In [0]:
bbox = {
  "xmin": -8.25, "xmax": 1.75,
  "ymin": 49.75, "ymax": 61.25,
  }

forecast_date = (
  datetime.now() - timedelta(days=1) # D - 1
  ).strftime('%Y%m%d')

We can now go ahead and read the data and store it in Delta.

In [0]:
raw_df = (
  spark.read.format("ifs")
  .option("variables", "t2m,u100,v100,ssrd,strd")
  .option("forecastDate", forecast_date)
  .option("forecastTime", 0) # midnight model run
  .load()
  )

In [0]:
target_table_name = f"{CONFIG.target_catalog}.{CONFIG.target_schema}.weather_forecast_raw"
print(f'Saving ifs data to: {target_table_name}')

(
  raw_df
  .where(F.col("x").between(bbox["xmin"], bbox["xmax"]))
  .where(F.col("y").between(bbox["ymin"], bbox["ymax"]))
  .write.saveAsTable(target_table_name, mode="overwrite")
)

forecast_df = spark.table(target_table_name)

What does our data look like? How many rows have we ingested?

In [0]:
print(f"{forecast_df.count()=:,}")

Here you can see that the rows are keyed by:
* `time` the datetime the forecast was produced;
* `valid_time` the time for which the forecast relates;
* `variable` the forecast variable; and
* spatial dimensions, where:
  * `x` represents longitude; and
  * `y` represents latitude

In [0]:
forecast_df.display()

How far into the future does the forecast extend?

In [0]:
timesteps = forecast_df.select("valid_time").distinct()
print(f"{timesteps.count()=:,}")

In [0]:
display(
  timesteps.groupBy().agg(
  F.min("valid_time").alias("first"),
  F.max("valid_time").alias("last"),
  )
)

In [0]:
dbutils.notebook.exit("0")