# Retrieve and process ERA5 fields
The following fields are needed for the metfut machine learning project:
#### time invariant fields
Meaning they just need to be retrieved for one point in time

* soil type (slt/[43](https://codes.ecmwf.int/grib/param-db/43))
* type of low vegetation (tvl/[29](https://codes.ecmwf.int/grib/param-db/29))
* type of high vegetation (tvh/[30](https://codes.ecmwf.int/grib/param-db/30))
* land-sea mask (lsm/[172](https://codes.ecmwf.int/grib/param-db/172))

#### time variant fields
* 2 meter temperature (t2m/[167](https://codes.ecmwf.int/grib/param-db/167))
* soil temperature level 1 (stl1/[139](https://codes.ecmwf.int/grib/param-db/139))
* sea surface temperateure (sst/[34](https://codes.ecmwf.int/grib/param-db/34))
* sea ice area fraction (siconc/[31](https://codes.ecmwf.int/grib/param-db/31))
* geopotential 500 hPa (z/[129](https://codes.ecmwf.int/grib/param-db/129))
* temperature 850 hPa (t/[130](https://codes.ecmwf.int/grib/param-db/130))

In [None]:
# import packages
import os
import cdsapi
c = cdsapi.Client()
import numpy as np
import xarray as xr

#### parameters
time period: 1979-01-01 to 2023-12-23 (one daily file at 9:00 UTC)  
resolution: 5.625 deg

In [None]:
# time period
years = np.arange('1979', '2024', dtype='datetime64[Y]')
# paths to relevant directories
root_download  = '/work/awidulla/METFUT/data_download/ERA5/'
root_processed = '/work/awidulla/METFUT/data_processed/ERA5/'

#### retrieve time invariant fields
They are retrieved seperately and don't need to be processed further

In [None]:
# set param ids and names
param_ids    = [29, 30, 43, 172]
param_names  = ['tvl', 'tvh', 'slt', 'lsm']
# loop over requests
for i, n in zip(param_ids,param_names):
    c.retrieve("reanalysis-era5-complete", {
        "class": "ea",
        "date": "1979-01-01",
        "expver": "1",
        "levtype": "sfc",
        "grid": "5.625/5.625",
        "param": str(i),
        "step": "0",
        "stream": "oper",
        "time": "09:00:00",
        "type": "4v",
        "format": "netcdf"
    }, root_processed+n+'_5.625deg.nc' )

#### retrieve time variant fields
These need to be retrieved for every time step. For efficient Mars requests, only one tape will be accessed at once. This requires some post processing from monthly to yearly files.

In [None]:
# create folder structure - some housekeeping for monthly files
# create directory if it doesn't already exist
for y in years:
    folder_path = root_download+str(y)
    try:
        os.mkdir(folder_path)
    except FileExistsError:
        # directory already exists
        pass

Retrieving surface level variables:

In [None]:
# loop over all months
for y in years:
    for m in np.arange(y,  y+np.timedelta64(1,'Y'), dtype='datetime64[M]'):
        days = np.arange(m, m+np.timedelta64(1,'M'), dtype='datetime64[D]')
        date_stamp = str(days[0])+'/to/'+str(days[-1])
        # execute CDS request for each month
        c.retrieve("reanalysis-era5-complete", {
            "class": "ea",
            "date": date_stamp,
            "expver": "1",
            "levtype": "sfc",
            "grid": "5.625/5.625",
            "param": "31/34/139/167",
            "step": "0",
            "stream": "oper",
            "time": "09:00:00",
            "type": "4v",
            "format": "netcdf"
        }, root_download+str(y)+'/era5_timevariant_'+str(m)+'.nc')

Retrieving pressure level variables:

In [None]:
# loop over all months
for y in years:
    for m in np.arange(y,  y+np.timedelta64(1,'Y'), dtype='datetime64[M]'):
        days = np.arange(m, m+np.timedelta64(1,'M'), dtype='datetime64[D]')
        date_stamp = str(days[0])+'/to/'+str(days[-1])
        # execute CDS request for each month
        c.retrieve("reanalysis-era5-complete", {
            "class": "ea",
            "date": date_stamp,
            "expver": "1",
            "levtype": "pl",
            "levelist": "500/850",
            "grid": "5.625/5.625",
            "param": "129/130",
            "step": "0",
            "stream": "oper",
            "time": "09:00:00",
            "type": "4v",
            "format": "netcdf"
        }, root_download+str(y)+'/era5_timevariant_pl_'+str(m)+'.nc')

#### post processing
Monthly files that contain all time variant fields will be processed to yearly files that just contain one variable.

In [None]:
# loop over years - make sure these sub-folders exist in root_download
for y in years:
    path = root_download+str(y)+'/era5*.nc'
    data = xr.open_mfdataset(path, combine='by_coords')
    data['siconc'].to_netcdf(root_processed+'siconc_5.625deg/siconc_'+str(y)+'_5.625deg.nc')
    data['sst'].to_netcdf(root_processed+'sst_5.625deg/sst_'+str(y)+'_5.625deg.nc')
    data['stl1'].to_netcdf(root_processed+'stl1_5.625deg/stl1_'+str(y)+'_5.625deg.nc')
    data['t2m'].to_netcdf(root_processed+'t2m_5.625deg/t2m_'+str(y)+'_5.625deg.nc')
    data.z.sel(level=500).to_netcdf(root_processed+'z500_5.625deg/z500_'+str(y)+'_5.625deg.nc')
    data.t.sel(level=850).to_netcdf(root_processed+'t850_5.625deg/t850_'+str(y)+'_5.625deg.nc')