
# 01 - Acquiring historical weather data
In this notebook, we will obtain historical weather data for the period of interest (Jan 2024 until now) produced by the [European Centre for Medium-Range Weather Forecasts (ECMWF)](https://www.ecmwf.int/) through their [Climate Data Store (CDS) service](https://cds.climate.copernicus.eu/).

This data is freely available but subject to registration and limits on the size of each request for data submitted to the service. 

Register for an account at the [ECMWF site](https://accounts.ecmwf.int/auth/realms/ecmwf/login-actions/registration) and log into the CDS site and check under "Your Profile" to generate an API token. Copy your token and store it as a [Databricks Secret](https://docs.databricks.com/aws/en/security/secrets/example-secret-workflow). Update the `secret_scope` and `secret_key` variables in the workflow configuration to reflect (see notebook `00_project_setup`).

We're looking at the middle row of this diagram:

<img src="../docs/imgs/energy-sa-ingest-flow.png" width="800">

#### What does this notebook do? / What will you learn?

Part 1:  Obtain data from the CDS API
- Break down our time horizon and desired parameters into a number of smaller queries to avoid over-loading the API.
- Use requests to submit the calls and retrieve the data.
- Store the data in volumes.
- Defining a custom spark data source to handle new formats

Part 2:
- Learning about the netCDF data type
- Defining a custom spark data source
- Loading our data from a volume

Part 3:
- Save the results

#### Data licensing
We are obliged to state that the ECMWF has not in any way endorsed this demo and remind you to fully read and comply with to their [terms of use](https://apps.ecmwf.int/datasets/licences/general/) when accessing the data.

Generated using Copernicus Climate Change Service information 2025.

ECMWF Open Data is © 2025 European Centre for Medium-Range Weather Forecasts (ECMWF).

This data is published under a Creative Commons Attribution 4.0 International (CC BY 4.0). https://creativecommons.org/licenses/by/4.0/

ECMWF does not accept any liability whatsoever for any error or omission in the data, their availability, or for any loss or damage arising from their use.

The material presented here has not been modified from its original form.



In [0]:
%run ./includes/common_functions_and_imports

In [0]:
import cdsapi
import itertools
from copy import deepcopy
import os

In [0]:
target_table_name = f"{CONFIG.target_catalog}.{CONFIG.target_schema}.weather_data_raw"
target_volume_path = f"/Volumes/{CONFIG.target_catalog}/{CONFIG.target_schema}/unstructured_data"
download_root = os.path.join(target_volume_path, "weather")

if spark.catalog.tableExists(target_table_name) and (not CONFIG.overwrite_data):
  dbutils.notebook.exit('Target table exists and config is not set to overwrite, skipping notebook.')

## Part 1: Accessing the Climate Data Store

CDS comes with its own Python API (`CDSAPI`) for coordinating requests and downloads and we'll use it here to orchestrate the data acquisition process.

We'll use this bounding box to constrain our CDS query (rounded to the most generous 0.25°) and limit the volume of data we need to handle.

> United Kingdom bounding box:
> + North: 61.25°
> + East: 1.75°
> + South: 49.75°
> + West: -8.25°

<img src="../docs/imgs/ukbbox.png" width="600">


In [0]:
query_bounding_box=[61.25, -8.25, 49.75, 1.75]

We need to instantiate the client with our token, then construct a series of requests in order to obtain the data we need.

In [0]:
c = cdsapi.Client(
    url="https://cds.climate.copernicus.eu/api",
    key=dbutils.secrets.get(CONFIG.secret_scope, CONFIG.secret_key),
)

dataset = "reanalysis-era5-single-levels"

We'll split our time horizon into three parts to allow us to successfully download the entire dataset without hitting any request limits.

We'll also consider "instantaneous" and "accumulative" weather variables separately [(see these helpful definitions to understand the different between these)](https://confluence.ecmwf.int/pages/viewpage.action?pageId=85402030#heading-Instantaneousaccumulatedmeanrateandminmaxparameters).

The variables we're interested are described in the table below:

| Variable | I/A | Name | Units | Definition |
| --- | --- | --- | --- | --- |
| 2t   | I | 2m_temperature | K | [2 metre temperature](https://apps.ecmwf.int/codes/grib/param-db/167) |
| 10u  | I | 10m_u_component_of_wind | m s-1 | [10 metre U wind component](https://apps.ecmwf.int/codes/grib/param-db/165) |
| 10v  | I | 10m_v_component_of_wind | m s-1 | [10 metre V wind component](https://apps.ecmwf.int/codes/grib/param-db/166) |
| ssrd | A | surface_solar_radiation_downwards | J m**-2 | [Surface solar radiation downwards](https://apps.ecmwf.int/codes/grib/param-db/169) |
| strd | A | surface_thermal_radiation_downwards | J m**-2 | [Surface thermal radiation downwards](https://apps.ecmwf.int/codes/grib/param-db/175) |

We use `itertools.product(intervals, variable_sets)` to get all permutations of download links that we need, like this:

<img src="../docs/imgs/itertools_product.png" width="1440">


In [0]:
months = [f"{m:02}" for m in range(1, 13)]

intervals = {
  "2024H1": {"year": ["2024"], "month": months[:6]},
  "2024H2": {"year": ["2024"], "month": months[6:]},
  "2025H1": {"year": ["2025"], "month": months[:6]},
  }
variable_sets = {
  "instantaneous": {"variable": ["2t", "10u", "10v"]},
  "accumulative": {"variable": ["ssrd", "strd",]},
}

downloads = list(itertools.product(intervals.items(), variable_sets.items()))

# # View the structure of our download request parameters here.
# for i, v in downloads:
#   print(i[0])
#   print(f'intervals = {i}')
#   print(f'variables = {v}\n')

# Define a base dictionary of common parameters we will use as a template
request_base = {
  "product_type": ["reanalysis"],
  "day": [f"{d:02}" for d in range(1, 32)],
  "time": [f"{h:02}:00" for h in range(0, 24)],
  "data_format": "netcdf",
  "download_format": "unarchived",
  "area": query_bounding_box
}

Let's go ahead and execute those downloads now. The resulting data files will be stored in a UC Volume.

We could attempt to parallelise the request / download process (using e.g. multithreading) but, to prevent high volumes of requests breaking the CDS platform, our requests will be queued and executed sequentially anyway.

In [0]:
dbutils.fs.mkdirs(download_root)

for interval, parameter_set in downloads:
  # Unpack our tuples
  interval_name, interval_values = interval 
  variable_name, variable_values = parameter_set
  target_path = f"{download_root}/{interval_name}-{variable_name}.nc"
  # If we arn't set to overwrite data and the file already exists, skip
  if not CONFIG.overwrite_data and any(dbutils.fs.ls(target_path)):
    print(f"File: {target_path} exists, skipping...")
    continue
  request = deepcopy(request_base)
  request.update(interval_values)
  request.update(variable_values)
  print(f'Requesting data for {interval_name} and {variable_name}')
  c.retrieve(dataset, request, target_path)

## Part 2: Loading data from files downloaded from CDS

### NetCDF, a brief introduction
Now we have data downloaded and stored away safely in a Volume. However, we still have an obstacle to using this data directly in that it is stored in an unusual format.

Rather that publishing giant CSV files full of weather data coordinates, timestamps and values, meteorological organisations like ECMWF tend to publish their data in specialist scientific data formats such as [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) or [GRIB](https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them).

Both are designed to efficiently store multidimensional array data, e.g. 'grids' of values, or collections of these grids. By convention, these grids are usually two-dimensional (but not always!) with the dimensions representing spatial dimensions such as longitude and latitude at regular intervals. In our case these intervals are 0.25 of a degree in both dimensions.

In the case of NetCDF, the format we are working with here, the files are also inherently hierarchical:

```
.
├── subdataset 1: 2 metre temperature
├── subdataset 2: 'U' component windspeed
│   ├── band 1: 2024-01-01T00:00:00.000Z
│   ├── band 2: 2024-01-01T01:00:00.000Z
│   ├── band_...
│   └── band N: 2025-03-30T23:00:00.000Z
├── subdataset ...
├── subdataset N
```

+ Data for each weather variable is stored in a 'subdataset', a logically separated partition of the data within the file.
+ Data for each forecast timestep within a subdataset is contained within a 'band'. Sequential bands represent the state of the variable on a regular spatial grid for sequential time steps.

The files are therefore often very large and, in an ideal world, we would just use Spark to read the values from the inside of these files.

### Defining a custom Spark DataSource

To save space in this notebook, we packaged this up into the includes folder. If you're interested in the source code, follow the link from the `%run` cell below.

If you just want to see it in action, keep reading.

In [0]:
%run ./includes/custom_spark_data_source_era5

In [0]:
# Once we have a customer reader correctly set up, we need to register this reader with Spark.
spark.dataSource.register(ERA5DataSource)

We can now read this data in with Spark just as though it _is_ a big ol' CSV.

In [0]:
raw_df = (
  spark.read
  .format("era5")
  .load(download_root)
  )
raw_df.display()

## Part 3: Saving the results

Now we have our dataframe, we're going to grab the variable indicator from the subdataset name (Final string value following the `:` after filename.nc) and save our table to UC.

**Note** Depending on your cluster size, this can take some time. We need to process 87M rows of data with a high-degree of partitioning. You will end up with around 54887 tasks to run.

The typical task time is ~1.5-2 seconds,  which with a 32 core cluster will equate to somewhere between 43 and 57 minutes to run.


In [0]:
print(f'Saving era5 data to: {target_table_name}')
(
  raw_df.withColumn(
      "variable", F.reverse(F.split("subdataset", ":")).getItem(0)
  ).write.saveAsTable(target_table_name, mode="overwrite")
)

Onto data pre-processing and analysis!