# Table of Contents
1. [Introduction](#introduction)
2. [Environment](#environment)
    1. [Imports](#imports)
    2. [User-defined inputs](#inputs)
3. [Data Download](#download)
    1. [Data preprocessing](#preprocessing)
    2. [Download Atmospheric Variables](#atm_vars)
    3. [Download Precipitation](#precipitation)

# Download data for the analysis in the paper: <a name="introduction"></a>
### [Extreme precipitation events in the Mediterranean: Spatiotemporal characteristics and connection to large-scale atmospheric flow patterns](https://rmets.onlinelibrary.wiley.com/doi/10.1002/joc.6985)

---
Author: Nikolaos Mastrantonas\
Email: nikolaos.mastrantonas@ecmwf.int; nikolaos.mastrantonas@doktorand.tu-freiberg.de

---
This works uses ERA5 data for quantifying the connections of localized extreme precipitation to large-scale patterns. The main variables that are used are:
1. Mean Sea Level Pressure
2. Temperature at 850hPa
3. Geopotential height at 500hPa
4. Total Precipitation

Moreover, there are some additional variables tested that are related to moisture/water content. These variables are:
1. Specfic humidity at 850hPa 
2. Water Vapour Flux (eastwards and northwards components)

The downloading is done via the MARS data storage facility of ECMWF. Information about how to download data is available at [this](https://confluence.ecmwf.int/display/CKB/How+to+download+data+via+the+ECMWF+WebAPI) link.
In case of no access to MARS, the data are also available from Copernicus CDS (https://cds.climate.copernicus.eu/#!/home). Downloading data from CDS requires non-trivial amendments in this script.

# Environment<a name="environment"></a>
Load the required packages and get the user-defined inputs.
The downloading was done in a Linux machine with 8 CPUs and 32 GB RAM. The total duration was about 8 hours. Downloading variables takes about 6 hours per variable if single CPU is used.

## Imports<a name="imports"></a>

Import the required packages (full package or specific functions).

In [1]:
import multiprocessing # parallel processing
import tqdm # timing
from datetime import datetime
from itertools import groupby
from pathlib import Path # creation of dictionaries

import numpy as np 
import pandas as pd
import metview as mv # metview package for downloading the data from MARS

## User-defined inputs <a name="inputs"></a>

Define the main folder where the data will be stored.

In [2]:
dir_loc = 'Data/'
Path(dir_loc).mkdir(parents=True, exist_ok=True) # generate the subfolder for storing the results

Define the inputs for the spatiotemporal coverage, and grid resolution.

In [3]:
dates_generated_all = pd.date_range(start = '19790101', end = '20191231').strftime('%Y%m%d').to_list() # used dates

area_precipt = [47, -8, 29, 38] # coordinates as in N/W/S/E of the subdomain of interest (Mediterranean domain)
grid_precipt = [.25, .25] # grid resolution in degrees

area_atm_var = [80, -90, 10, 50] # [70, -60, 10, 80] # extended area compared to Precipitation data
grid_atm_var = [1, 1] # coarser resolution compared to Precipitation, as the interest is on large-scale patterns

Define variables to be downloaded. The required information should be given in a list and include the following, in the exact order:
1. **levelist**: level of interest, e.g. 0 for surface parameters, 500 for 500 hPa
2. **levtype**: leveltype of interest, e.g. pressure levels ('pl'), surface ('sfc')
3. **atm_var**: paramater of interest (e.g. the SLP is flagged as 151.128 at *MARS*)
4. **file_name**: name of the file to save the data

The above information is used on the function to retrieve data from *MARS* with the *metview* package.

In [4]:
data_inputs = [[0, 'sfc', '151.128', 'D1_Mean_SLP'],   # SLP data
               [500, 'pl', '129.128', 'D1_Mean_Z500'], # Z500 data
               [850, 'pl', '130.128', 'D1_Mean_T850'], # T850 data
               [850, 'pl', 'q', 'D1_Mean_Q850'], # Q850 data  (not used for the main analysis)
               [0, 'sfc', '71.162', 'D1_Mean_WVFeast'], # Water Vapour Flux east data (not used for main analysis)
               [0, 'sfc', '72.162', 'D1_Mean_WVFnorth'], # Water Vapour Flux north data (not used for main analysis)
               ]

# Data Download<a name="download"></a>

## Data preprocessing<a name="preprocessing"></a>

In [5]:
InitializationTime = datetime.now()

In [6]:
 # dates are chunked per year-month for efficient download, since MARS uses this subsetting for storing the data
dates_atm_vars = [list(v) for l, v in groupby(dates_generated_all[:], lambda x: x[:6])]

# repeat the last value of each chunk to the next one, since daily precip data need info from previous day as well!
dates_precipit = [dates_atm_vars[i][-1:] + dates_atm_vars[i+1] for i in range(len(dates_atm_vars)-1)] # from 2nd chunk
dates_precipit.insert(0, dates_atm_vars[0]) # append the 1st chunk so all the dates are now complete

# create a slighly larger extend so that the interpolation of precip data on the edges of the domain works better
Area_precipt_ext = [coord+2 if i in [0, 3] else coord-2 for i, coord in enumerate(area_precipt)]

## Download mean daily data of atmospheric variables<a name="atm_vars"></a>

**For some reason, the multiprocessing that is used for speading up the process, works only if the atmospheric variables data are downloaded first, and then the precipitation data**. There is no understanding of how and why this issue occurs, but at least data are correct and there are no wrong outputs from the downloading process.

In [7]:
def atm_subset(input_data):
    
    levelist, levtype, atm_var, dates_subset = input_data # inputs to be a list of 4 in specific order!
    
    '''
    Function for downloading data of atmospheric variables from MARS and calculating daily mean values
    
    :param levelist: level of interest, e.g. 0 for surface parameters, 500 for 500 hPa
    :param levtype: leveltype of interest, e.g. pressure levels ('pl'), surface ('sfc')
    :param atm_var: paramater of interest (e.g. the SLP is flagged as 151.128 at MARS)
    :param dates_subset: the subset of dates to be downloaded
    '''
    
    # function for retrieving the data from MARS
    fc_all = mv.retrieve(Class = 'ea', # class of data, e.g. ERA5 ('ea')
                         stream = 'oper', # stream of interest, e.g. Ensemble ('enfo'), Deterministic ('oper') 
                         expver = 1, # experiment's version, e.g. Operational (1), Research (xxxx[A-Z/0-9])
                         type = 'an', # type of data, e.g. Analysis ('an')
                         param = atm_var, 
                         levtype = levtype,
                         levelist = levelist,
                         date = dates_subset,
                         time = list(range(0,24)), # all hourly timesteps
                         area = area_atm_var,
                         grid = grid_atm_var, 
                         )

    Daily_sub = mv.Fieldset() # mv for values for dates_subset
    fields = mv.grib_get(fc_all, ['date']) # get the 'date' field from the fc_all object

    for day_i in dates_subset: # loop through the whole list of unique dates_subset

        used_indices = list(np.where(np.array(fields) == day_i)[0]) # indices that belong to the day of interest
        used_indices = np.array(used_indices, dtype='float64') # convert to float64 for using it at mv object
        daily_subset = fc_all[used_indices] # subset and keep only the fields of the day of interest
        Daily_sub.append(mv.mean(daily_subset)) # calculate the daily mean and append it
    
    return Daily_sub

Download data per variable in a dictionary and name the keys, based on the variable name, e.g. for "D1_Mean_SLP", keep the "SLP" for the key.

Note that for optimizing the downloading in MARS, it is preferable to loop through dates and download all variables, instead of looping through variables and downloading the dates ([find out more](https://confluence.ecmwf.int/display/WEBAPI/Retrieval+efficiency)). The latter is used in this script for making it simpler.

In [8]:
Times = len(dates_atm_vars)
    
AtmVar = {}
for var in data_inputs:
    
    levelist, levtype, atm_var, file_name = var 
    Inputs = list(zip([levelist]*Times, [levtype]*Times, [atm_var]*Times, dates_atm_vars))
    
    pool_atmvar = multiprocessing.Pool() # object for multiprocessing
    Daily = list(tqdm.tqdm(pool_atmvar.imap(atm_subset, Inputs), 
                           total=Times, position=0, leave=True)) # list of mv.Fieldsets
    pool_atmvar.close()
    del(pool_atmvar)
    
    for i in range(1, Times): # concatenate all Fieldsets to the first one
        Daily[0].append(Daily[i])

    Daily = Daily[0] # keep the full set of the atmospheric variable data
    
    mv.write(dir_loc + file_name + '.grb', Daily) # save data

    AtmVar[var[-1].split('_')[-1]] = Daily
    
del(Times, var, levelist, levtype, atm_var, file_name, Inputs, Daily, i)

100%|██████████| 492/492 [1:18:00<00:00,  9.51s/it]
100%|██████████| 492/492 [59:33<00:00,  7.26s/it]  
100%|██████████| 492/492 [55:07<00:00,  6.72s/it]  
100%|██████████| 492/492 [1:00:22<00:00,  7.36s/it]
100%|██████████| 492/492 [1:16:12<00:00,  9.29s/it]  
100%|██████████| 492/492 [1:12:04<00:00,  8.79s/it]


## Download total daily precipitation<a name="precipitation"></a>

In [9]:
def precip_subset(dates_subset):
    
    ' Function for downloading precipitation data from MARS and calculating daily total values '
    
    fc_all = mv.retrieve(Class = 'ea', # class of data, e.g. ERA5 ('ea')
                         stream = 'oper', # stream of interest, e.g. Ensemble ('enfo'), Deterministic ('oper') 
                         expver = 1, # experiment's version, e.g. Operational (1), Research (xxxx[A-Z/0-9])
                         type = 'fc', # type of data, e.g. Forecast ('fc'), Analysis ('an')
                         param = 'tp', # used paramater: Total Precipitation ('tp' = '228.128')
                         levtype = 'sfc',
                         levelist = 0,
                         date = dates_subset, # use the subset of dates
                         time = [6, 18], # time steps of interest (forecast fields only at 06:00 & 18:00)
                         step = list(range(7,19)), # precipitation is calculated from short-range forecasted data
                         grid = grid_precipt,
                         area = Area_precipt_ext,
                         interpolation = '"--interpolation=grid-box-average"' # first-order conservative remapping
                         )

    Daily_sub = mv.Fieldset() # create the mv object for storing the daily values for the dates_subset

    for i_day in range(len(dates_subset) - 1): # loop through the whole list of unique dates_subset

        # downloaded data are in the sequence: day_i 06:00 steps 7-18 (12 steps), 18:00 steps 7-18 (12 steps)
        start_indice = 12 + 24*i_day # data for daily accumulation start at 18:00 step 7 of previous day
        end_indice = 12 + 24*(i_day+1) # data end at 06:00 step 12 of current day (24 hourly steps in total)

        sub = mv.sum(fc_all[start_indice : end_indice]) # total daily precipitation
        sub = mv.grib_set(sub, ['date', int(dates_subset[i_day + 1])]) # replace the date field with the correct date

        Daily_sub.append(sub) # append to Daily_sub metview object
    
    return Daily_sub

In [10]:
pool_precip = multiprocessing.Pool() # object for multiprocessing for creating a list of mv.Fieldsets
Precip = list(tqdm.tqdm(pool_precip.imap(precip_subset, dates_precipit), 
                        total=len(dates_precipit), position=0, leave=True))
pool_precip.close()

for i in range(1, len(Precip)): # concatenate all Fieldsets to the first one
    Precip[0].append(Precip[i])

Precip = Precip[0] # keep the full set of the precipitation data
Precip = Precip*1000 # convert to mm
Precip = mv.read(data=Precip, area=area_precipt) # crop to the actual area of interest

mv.write(dir_loc + 'D1_Total_Precipitation.grb', Precip) # Save the daily total precipitation file

del(pool_precip, i)

100%|██████████| 492/492 [1:11:28<00:00,  8.72s/it]


In [11]:
print('Downloading completed in:', datetime.now() - InitializationTime, ' HR:MN:SC.')
del(InitializationTime)

Downloading completed in: 7:56:01.443161  HR:MN:SC.
