# Table of Contents
1. [Introduction](#introduction)
2. [Environment](#environment)
    1. [Imports](#imports)
    2. [User-defined inputs](#inputs)
3. [Data Analysis](#analysis)
    1. [Spatiotemporal Analysis of Extreme Precipitation Events](#spatiotemporal)
        1. [Preprocessing](#spatiotemporal-preprocessing)
        2. [Seasonality](#seasonality)
        3. [Temporal Dependencies](#temporal-dependencies)
        4. [Spatiotemporal Overlap](#spatiotemporal-overlap)
    2. [EOF and Clustering](#eof-clustering)
        1. [Preprocessing](#eof-clustering-preprocessing)
        2. [EOF Analysis](#eof-analysis)
        3. [Clustering Analysis](#clustering-analysis)
        4. [Frequencies of Clusters](#clustering-frequencies)
    3. [Connecting Extremes to Large-Scale Patterns](#extremes-to-patterns)
        1. [Auxiliary Functions](#auxiliary)
        2. [Quantifying the Connections](#quantifying-connections)
        3. [Summarizing Results](#summarizing)

# Data analysis for the work presented in the paper: <a name="introduction"></a>
### [Extreme precipitation events in the Mediterranean: Spatiotemporal characteristics and connection to large-scale atmospheric flow patterns](https://rmets.onlinelibrary.wiley.com/doi/10.1002/joc.6985)

---
Author: Nikolaos Mastrantonas\
Email: nikolaos.mastrantonas@ecmwf.int; nikolaos.mastrantonas@doktorand.tu-freiberg.de

---
The data analysis is oranized in three sections:
1. **Spatiotemporal Analysis of Extreme Precipitation Events (EPEs)**: Find the seasonality, temporal dependencies, and spatiotemporal overlap.
2. **EOF and Clustering**: Perform EOF analysis and subsequent K-means clustering for a range of combinations of domains and atmospheric variables.
3. **Connecting Extremes to Large-Scale Patterns**: Use the results of previous step and analyse how the different clusters relate to EPEs for each grid cell of the studied domain.

# Environment<a name="environment"></a>
Load the required packages and get the user-defined inputs.

The analysis was done in a Linux machine with 8 CPUs and 32 GB RAM. The total duration was about 1.5 hour.

## Imports<a name="imports"></a>

Import the required packages (full package or specific functions).

In [1]:
import multiprocessing # parallel processing
import tqdm # timing
from datetime import datetime # timing
from pathlib import Path # creation of dictionaries
import warnings # for suppressing RuntimeWarning

# basic libraries for data analysis
import numpy as np 
import pandas as pd 
import xarray as xr

from itertools import combinations, product

# specialized libraries
from eofs.xarray import Eof # EOF analysis
from sklearn.cluster import KMeans # K-means clustering
from scipy.stats import binom # binomial distribution for significance testing of extremes and large-scale patterns

## User-defined inputs <a name="inputs"></a>

Define the dictionary for the input data, and specify & create the subfolder for storing all data for generating the plots.

In [2]:
dir_loc = '' # the main folder where the input data are stored

results_loc = dir_loc + 'DataForPlots/' # the subfolder for storing the results
Path(results_loc).mkdir(parents=True, exist_ok=True) # generate the subfolder for storing the results

Define the inputs related to Spatiotemporal Analysis of the Precipitation.

In [3]:
grb_file_name = 'Data/D1_Total_Precipitation.grb' # the name of the grb file of the precipitation data

P_used = [95, 97, 99] # define the percentile(s) of interest

lags_depend_sets = [1, 3, 7, 15] # check % of extremes that happen up to x days from a previous extreme on same cell

Define the inputs related to EOF and Clustering analysis of the Atmospheric Variables.

In [4]:
variables_used = ['SLP', 'T850', 'Z500'] # variables used for the clustering analysis

# define the areas used for EOF and clustering: dict{area_name: [N, W, S, E]}. Prefered order from larger to smaller
Area_used = {'EuroAtlantic': [80, -90, 20, 30], # EuroAtlantic as used by Cassou 2008
             'MedExt': [55, -20, 20, 50], 'MedExtAtl': [51, -17, 26, 41], 'Med': [50, -11, 26, 41] }

Var_ex = 90 # define the minimum total variance [0-100] that the subset of kept EOFs should explain
Clusters_used = [4]+list(range(7, 13)) # clusters for Kmeans (4 for also analysing the common EuroAtlantic regimes)

# Data Analysis<a name="analysis"></a>

In [5]:
InitializationTime = datetime.now()

## Spatiotemporal Analysis of Extreme Precipitation <a name="spatiotemporal"></a>

### Read Precipitation data and generate auxiliary elements <a name="spatiotemporal-preprocessing"></a>

In [6]:
Precipitation_xr = xr.open_dataarray(dir_loc + grb_file_name, engine='cfgrib') # read data
Precipitation_xr = Precipitation_xr.drop(['valid_time', 'step', 'surface', 'number']) # drop not-used coordinates

In [7]:
P_used = sorted(list(np.array(P_used).flatten())) # make P_used a sorted list for consistency and avoiding errors
lags_depend_sets = sorted(list(np.array(lags_depend_sets).flatten())) # same as above but for temporal dependancies

dates_all = pd.to_datetime(Precipitation_xr.time.values) # extract the dates of the xarray
dates_all = pd.to_datetime(dates_all.strftime('%Y%m%d')) # convert time to refer to start of day (actual is at 18:00)
Precipitation_xr = Precipitation_xr.assign_coords({'time': dates_all}) # change time to start of day
    
# calculate thresholds per location and percentile
Quant = Precipitation_xr.quantile(np.array(P_used)/100, interpolation='linear', dim='time', keep_attrs=True) # thresh.
Quant = Quant.rename({'quantile': 'percentile'}) # rename coordinate
Quant = Quant.assign_coords({'percentile': P_used}) # assign the coord values based on percentiles
    
Quant.to_netcdf(results_loc+'ThresholdsEPEs.nc') # save data

# boolean xarray for identifying if an event is over the threshold
Exceed_xr = [(Precipitation_xr>Quant.sel(percentile=i_p))*1 for i_p in P_used] 
Exceed_xr = xr.concat(Exceed_xr, dim=pd.Index(P_used, name='percentile')) # concatenate data for all percentiles

# create a 2d dataframe of the precipitation data, with index refer to dates, and columns to grid cells
Prcp_df = Precipitation_xr.values.flatten()
Prcp_df = pd.DataFrame(np.reshape(Prcp_df, (len(Precipitation_xr), -1)), index=dates_all.strftime('%Y%m%d'))

# create DF with the coordinates of the grid points when used as a 2d dataframe
LON, LAT = np.meshgrid(Precipitation_xr.longitude.values, Precipitation_xr.latitude.values)
Coords = pd.DataFrame({'LAT': LAT.flatten(), 'LON': LON.flatten()})

del(LON, LAT)

### Seasonality of Extremes <a name="seasonality"></a>

In [8]:
dates_grouped = [(int(month)%12 + 3)//3 for month in dates_all.month] # 1-4 refers to Winter-Autumn

Exceed_xr = Exceed_xr.assign_coords({'time': dates_grouped}) # change the time values so they refer to the Season
Seasonality = Exceed_xr.groupby('time').sum()/Exceed_xr.sum(dim='time')*100 # percentage of EPEs per Season
Seasonality = Seasonality.rename({'time': 'season'}) # rename coordinate
Seasonality.to_netcdf(results_loc+'Seasonality.nc') # save data

Exceed_xr = Exceed_xr.assign_coords({'time': Precipitation_xr.time.values}) # rename to the actual time

del(dates_grouped)

### Temporal Dependencies of Extremes <a name="temporal-dependencies"></a>

In [9]:
def dependence(x, required_lags=[1, 3, 4, 7, 15]):
    
    '''Return the % of events ("x": dates of events) occuring up to "required_lags" days after a preceding events'''
    
    # pre-process the provided lags so that the function works without errors
    required_lags_used = np.array([required_lags]).flatten()
    required_lags_used = required_lags_used[required_lags_used != 1] # remove lag 1 cause it causes problems
    required_lags_used = list(required_lags_used)
    
    # sort array of dates and find the temporal distance in days
    dates_datetime = pd.to_datetime(x).sort_values() # convert to datetime and sort
    dates_dif = np.diff(dates_datetime)/np.timedelta64(1, 'D') # find difference in days (dif. days)
    final_stats = pd.Series(dates_dif).value_counts().sort_index().cumsum() # cumulative sum of events per dif. days
    final_stats.index = final_stats.index.astype(int) # convert index to integer (index refers to temporal lag)
    
    lag1_aux = pd.Series({1:0}) # in case there is no lag1 instances, the index is created and given 0 (0% of EPEs)
    final_stats = final_stats.combine(lag1_aux, max, fill_value=0) # fill 1-day lag with 0 if not available
    wanted_lags = pd.Series(np.nan, index=required_lags_used) # empty series with the lags of interest (lag 1 always)
    final_stats = final_stats.combine(wanted_lags, max) # add the user-defined lags if not available
    final_stats.fillna(method='ffill', inplace=True, axis=0) # fill nan (in case some user-defined lags were missing)
    
    final_stats = final_stats/(len(x)-1)*100 # calculate the percetange of events within each lag
    final_stats = final_stats.loc[required_lags, ] # return only the lags of interest
    
    return final_stats

In [10]:
def temporal_dependence_xr(percentile):
    
    Quant_used = Quant.sel(percentile=percentile)
    Extremes = Prcp_df > Quant_used.values.flatten() # Boolean over / under-upto the Precip threshold
    DaysExceed = Extremes.apply(lambda x: list(Extremes.index[np.where(x == 1)[0]]), axis=0) # exceedance days
    depend_perc = DaysExceed.apply(dependence, required_lags=lags_depend_sets).T # find temporal dependence
    
    Temp_depend = Precipitation_xr[:len(lags_depend_sets)].copy(deep=True) # generate final xr with grided results
    Temp_depend = Temp_depend.rename({'time': 'temporal_lag'}) # rename coordinate
    Temp_depend = Temp_depend.assign_coords({'temporal_lag': lags_depend_sets}) # assign the dim values based on lags
    depend_perc = depend_perc.values # get the values in np array so that xr gets the values in the correct order
    Temp_depend.values = np.reshape(depend_perc, Temp_depend.shape) # place the values to the final xarray item
        
    return Temp_depend

In [11]:
pool = multiprocessing.Pool() # object for multiprocessing
TemporalDependencies = list(tqdm.tqdm(pool.imap(temporal_dependence_xr, P_used), 
                                      total=len(P_used), position=0, leave=True))
pool.close()

TemporalDependencies = xr.concat(TemporalDependencies, dim=pd.Index(P_used, name='percentile')) # concatenate data
TemporalDependencies.to_netcdf(results_loc+'TemporalDependencies.nc') # save data

del(pool)

100%|██████████| 3/3 [03:13<00:00, 64.63s/it] 


### Spatiotemporal Dependencies of Extremes <a name="spatiotemporal-overlap"></a>

In [12]:
def spatiotemporal_dep(percentile):
    
    Quant_used = Quant.sel(percentile=percentile)
    Extremes = Prcp_df > Quant_used.values.flatten() # Boolean over / under-up to Precip threshold
    total_extremes = Extremes.sum() # number of extremes per grid cell
    
    global spatiotemporaloverlap_singlecell# Extremes, total_extremes
    ' Funtion for spatiotemporal analysis per gridcell. Set in global env. so it can be used in parallel processing '
    def spatiotemporaloverlap_singlecell(i_cell):
        
        Data = Precipitation_xr[0].copy(deep=True) # keep a slice of xarray for having the same format 
        Data = Data.rename({'time': 'coordinates'}) # rename time to coordinates (lat, lon of target grid cell)
        Data = Data.assign_coords({'coordinates': i_cell}) # assign the value of index of the analysed grid cell

        extremes_i_cell = Extremes[Extremes.iloc[:,i_cell]==1] # keep only dates of extremes
        overlap = extremes_i_cell.sum() # count extremes per each loc that happen same day as extremes at i_cell
        overlap /= total_extremes # divide with total extremes per location, to find the percentage of co-occurrence
        overlap *= 100 # multiply with 100 for giving the value as percentage   

        Data.values = np.reshape(overlap.values, Data.shape) # place the values to the final xarray item

        return Data
    
    pool = multiprocessing.Pool() # object for multiprocessing
    Spatiotemporal_overlap = list(tqdm.tqdm(pool.imap(spatiotemporaloverlap_singlecell, range(Extremes.shape[1])), 
                                            total=Extremes.shape[1], position=0, leave=True))
    Spatiotemporal_overlap = xr.concat(Spatiotemporal_overlap, dim='coordinates') # concatenate all data
    pool.close()
    
    del(spatiotemporaloverlap_singlecell) # delete the function so it is not anymore in the global environment
    
    return Spatiotemporal_overlap

In [13]:
SpatiotemporalOverlap = [spatiotemporal_dep(percentile=i_p) for i_p in P_used]
SpatiotemporalOverlap = xr.concat(SpatiotemporalOverlap, dim=pd.Index(P_used, name='percentile')) # concatenate data

# change the simple indexing value of the "coordinates" dim, with the actual coordinates of the target grid cell
Coordinates = Coords.apply(lambda x:'Lat:{}, Lon:{}'.format(x[0], x[1]) , axis=1).values
SpatiotemporalOverlap = SpatiotemporalOverlap.assign_coords({'coordinates': Coordinates})

# save only the locations used for the plots
locs_of_interest = [3435, 2417, 4385, 9072, 8118, 820, 622] # locs of interest for the plots
SpatiotemporalOverlap.isel(coordinates=locs_of_interest).to_netcdf(results_loc+'SpatiotemporalOverlap.nc') # save data

del(Coordinates, locs_of_interest)

100%|██████████| 13505/13505 [02:08<00:00, 104.80it/s]
100%|██████████| 13505/13505 [02:03<00:00, 109.72it/s]
100%|██████████| 13505/13505 [01:14<00:00, 181.75it/s]


Delete variables not needed any more, for emptying memory.

In [14]:
del(SpatiotemporalOverlap, spatiotemporal_dep, TemporalDependencies, temporal_dependence_xr, dependence, Seasonality,
    Prcp_df, Precipitation_xr, lags_depend_sets, grb_file_name, Coords)

## EOF of Atmospheric variables and Clustering of daily patterns based on derived PCs <a name="eof-clustering"></a>

### Read the data and perform preprocessing by calculating the daily anomalies <a name="eof-clustering-preprocessing"></a>

In [15]:
def anomalies(variable):
    
    # read actual daily values
    file_path = dir_loc + 'Data/D1_Mean_'+variable+'.grb'
    Daily = xr.open_dataarray(file_path, engine='cfgrib') # read data
    
    actual_days = Daily.time.values # get actual timesteps
    dates_grouped = pd.to_datetime(Daily.time.values).strftime('%m%d') # get Month-Day of each timestep
    
    # 5-day smoothed climatology. Rolling can be applied directly because the daily data refer to consequtive days. If
    # days are not consecutive, firstly the xr.resample should be applied, so that missing days are generated with NaN
    Smoothed = Daily.rolling(time=5, center=True, min_periods=1).mean() # 5-day smoothing
    
    Daily = Daily.assign_coords({'time': dates_grouped}) # change the time to Month-Day
    Smoothed = Smoothed.assign_coords({'time': dates_grouped}) # change the time to Month-Day
    
    Climatology = Smoothed.groupby('time').mean() # climatology of the smoothed data
    
    Anomalies = Daily.groupby('time') - Climatology
    Anomalies = Anomalies.assign_coords({'time': actual_days}) # change back to the original timestep information
    
    return Anomalies

In [16]:
pool = multiprocessing.Pool() # object for multiprocessing
Anomalies = list(tqdm.tqdm(pool.imap(anomalies, variables_used), 
                           total=len(variables_used), position=0, leave=True))
pool.close()

Anomalies = {variables_used[i_c]: i_anom for i_c, i_anom in enumerate(Anomalies)}

del(pool)

100%|██████████| 3/3 [00:36<00:00, 12.13s/it]


### EOF Analysis <a name="eof-analysis"></a>

In [17]:
def eof_analysis(input_data):
    
    area_subset = input_data[0] # name of area_used (based on the keys of the "Area_used" dictionary)
    area_subset = Area_used[area_subset] # list with the cordinates of the boundary box of the selected area
    variable = input_data[1] # name of variable used (based on the keys of the "Anomalies" dictionary)
    
    dataset_used = Anomalies[variable].copy(deep=True) # dataset to be used for the analysis
    
    # subset area of interest 
    dataset_used = dataset_used.sel(latitude=slice(area_subset[0], area_subset[2]), 
                                    longitude=slice(area_subset[1], area_subset[3]))    
    
    
    coslats = np.cos(np.deg2rad(dataset_used.latitude.values)).clip(0, 1) # coslat for weights on EOF
    wgts = np.sqrt(coslats)[..., np.newaxis] # calculation of weights
    solver = Eof(dataset_used, weights=wgts) # EOF analysis of the subset
    
    N_eofs = int(np.searchsorted(np.cumsum(solver.varianceFraction().values), Var_ex/100)) # n. of EOFs needed
    N_eofs += 1 # add 1 since python does not include the last index of a range
    
    EOFS = solver.eofs(neofs=N_eofs)
    PCS = pd.DataFrame(solver.pcs(npcs=N_eofs).values, index=dataset_used.time.values)
    VARS = solver.varianceFraction(neigs=N_eofs).values*100
    NOR = solver.northTest(neigs=N_eofs, vfscaled=True).values*100
    
    return {'EOFS': EOFS, 'PCS': PCS, 'VARS': VARS}

Noe that the EOF analysis as perfomed below with multiprocessing consumes all available memory of 32 Gb, so there might be a need to run the below section with less cores, or in a loop (single core).

In [18]:
pool = multiprocessing.Pool()
Combs_used = list(product(Area_used.keys(), variables_used)) # generate all combinations of area and variable
EOF_analysis = list(tqdm.tqdm(pool.imap(eof_analysis, Combs_used), total=len(Combs_used), position=0, leave=True))
pool.close()

Combs_used = ['_'.join(i) for i in Combs_used]
EOF_analysis = {Combs_used[i_c]: i_eof for i_c, i_eof in enumerate(EOF_analysis)}

# save data used for plots (identified from the results of the last part of this script; Check Script4 Aux. Figure)
for key in ['Med_SLP', 'Med_Z500']:
    EOF_analysis[key]['EOFS'].to_netcdf(results_loc+'ComponentsEOF_{}.nc'.format(key)) # save EOF data for plotting
    np.savetxt(results_loc+'VarianceEOF_{}.out'.format(key), EOF_analysis[key]['VARS']) # save Variance data 
    
del(pool, Combs_used, key)

100%|██████████| 12/12 [11:18<00:00, 56.57s/it]  


### Clustering of Daily Patterns <a name="clustering-analysis"></a>

In [19]:
def PC_norm(var_used):
    
    ' Normalize PCs based on standard deviation and weight them based on the % of explained variance'
    
    PCs = EOF_analysis[var_used]['PCS'] # extract PCs
    Stand = PCs/PCs.std() # standardize PCs
    
    # normalize per sqrt of variance so K-means distance is weighted based on the importance of each PC to expl. var.
    variance = EOF_analysis[var_used]['VARS']
    
    return Stand*np.sqrt(variance)

In [20]:
def combo_clusters(input_data):
    
    area_used = input_data[0] # area_subset used
    var_used = input_data[1] # variable used
    
    var_used = list( np.array(var_used).flatten() ) # consistent format; always list
    var_used = [area_used+'_'+i_var for i_var in var_used] # add info about area, so it is same name as EOF keys
    
    """ 
    Get the PCs of interest. If only 1 variable, then use the actual PCs, and if more than 1, then use normalized PCs.
    In fact the results with actual data or normalized for only 1 variable are practically the same (~99% similarity),
    but the actual ones are prefered for reducing additional computations that potentially lead to rounding errors.
    """
    if len(var_used) == 1:
        PCs = EOF_analysis[var_used[0]]['PCS']
    else:
        all_PCs = [PC_norm(i) for i in var_used] # make a list with the PCs of all variables of interest
        PCs = pd.concat(all_PCs, axis=1) # concatenate all the PCs to the final DF
    
    Col_names = ['Clusters_'+str(i_c) for i_c in Clusters_used]
    Labels_all = pd.DataFrame(np.nan, columns=Col_names, index=PCs.index)
    
    for i_c, clusters_used in enumerate(Clusters_used):
        
        KM_cluster = KMeans(n_clusters=clusters_used)
        np.random.seed(10) # set always the same seed for reproducibility
        KM_cluster.fit(PCs)
        Labels_all.iloc[:, i_c] = KM_cluster.labels_
        
    return Labels_all

In [21]:
def comb_lists(r):
    
    ' Use the combinations function to generate all combs of used variables from only 1, up to all of them together '
    
    data = combinations(variables_used, r)
    
    return [list(i) for i in data]

In [22]:
All_combs = [comb_lists(i) for i in range(1, len(variables_used)+1)] # create list with all combinations of variables
All_combs = [j for i in All_combs for j in i] # concat the sublists
All_combs = list(product(Area_used.keys(), All_combs)) # final list with all combinations of variables and areas

pool = multiprocessing.Pool() # object for multiprocessing
Clustering = list(tqdm.tqdm(pool.imap(combo_clusters, All_combs), total=len(All_combs), position=0, leave=True))
pool.close()

# change format so it can be used as dictionary key; format: <area_used>_<var1>~<var2>~<varN>, var2, ..N if avail.
All_combs = [[i, '~'.join(j)]  for i, j in All_combs]
All_combs = ['_'.join(i) for i in All_combs]

Clustering = {All_combs[i_c]: i_clustering for i_c, i_clustering in enumerate(Clustering)}

# save data used for plots (identified from the results of the last part of this script; Check Script4 Aux. Figure)
Clustering['Med_SLP~Z500'].to_csv(results_loc+'Clusters_Med_SLP~Z500.csv') 

del(All_combs, pool)

100%|██████████| 28/28 [01:19<00:00,  2.84s/it]


Save the composites of **Med_SLP~Z500** combination for **9 clusters**, which is the preferred combination. (This is identified from the results of the last part of this script; Check Script4 Auxiliary Figure.)

In [23]:
# Mediterranean 9 Regimes
Labels_selection = Clustering['Med_SLP~Z500']['Clusters_9'] # get labels of cluster of interest
Comp_SLP = Anomalies['SLP'].assign_coords({'time': Labels_selection.values}).groupby('time').mean() # composites SLP
Comp_Z500 = Anomalies['Z500'].assign_coords({'time': Labels_selection.values}).groupby('time').mean() # compos. Z500
Composites = xr.concat([Comp_SLP, Comp_Z500], dim=pd.Index(['SLP', 'Z500'], name='variable')) # concatenate variables
Composites = Composites.rename({'time': 'cluster'}).drop(['surface', 'isobaricInhPa', 'step', 'number'])
Composites = Composites.sel(latitude=slice(Area_used['Med'][0], Area_used['Med'][2]), # adjust domain for having only
                            longitude=slice(Area_used['Med'][1], Area_used['Med'][3])) # .. area of actual data used 

Composites.to_netcdf(results_loc+'ClusteringComposites_Med_SLP~Z500_Clusters9.nc') # save data used for plotting

del(Labels_selection, Comp_SLP, Comp_Z500, Composites)

### Frequencies of derived clusters per year and seasons <a name="clustering-frequencies"></a>

In [24]:
def aggregated_occurrence(x, aggregation='M', percentage=False):
    
    """
    Aggregate temporal dataframe and return the summation of occurrences or their percentages per column
    As default the aggregation is done for monthly data and the summation of occurrenes is returned.
    """
    
    data = x.copy()
    data.index = pd.to_datetime(data.index.astype(str))
    data.index = data.index.to_period(aggregation).astype(str)
    
    y = data.groupby([data.index, data.values]).size().to_frame('Occurrence').reset_index()
    y.columns = ['Date', 'level_1', 'Occurrence']
    y = y.pivot_table(values='Occurrence', index='Date', columns='level_1').fillna(0)
    
    if percentage:
        y = y.apply(lambda x: (100*x)/x.sum(), axis=1)
        
    return y

In [25]:
def seasonal_occurrences(combination_used):
    
    seasons_names = {'Q1': 'Winter', 'Q2': 'Spring', 'Q3': 'Summer', 'Q4': 'Autumn'} # auxiliary for renaming

    data_clusters = Clustering[combination_used] # read the data with the labels
    Occurrences = pd.DataFrame()
    for i_clusters in data_clusters: # loop through the columns (refering to number of clusters)
        
        data_used = data_clusters[i_clusters] # daily data used for the analysis

        # create the dataset for the seasonal occurrences and the occurences for seasons and years
        ssn_occur = aggregated_occurrence(data_used, aggregation='Q-Nov', percentage=True)
        ssn_occur = ssn_occur.iloc[1:-1,:] # first and last seasons are not complete, so remove the data
        ann_occur = aggregated_occurrence(data_used, aggregation='A', percentage=True)

        # process the ssn_occur for deriving all subsets of year-season-cluster in a melted format
        occurs_all = pd.melt(ssn_occur.reset_index(), id_vars='Date', value_vars=ssn_occur.columns)
        occurs_all['Season'] = [i[-2:] for i in occurs_all.Date.astype(str)] # keep season info, without the year
        occurs_all.replace({'Season': seasons_names}, inplace=True)
        occurs_all['Date'] = occurs_all.Date.astype(str).apply(lambda x: x[:-2]).astype(int) # keep year info

        # process the ann_occur data for having same format as the occurs_all
        ann_occur = pd.melt(ann_occur.reset_index(), id_vars='Date', value_vars=ann_occur.columns)
        ann_occur['Season'] = 'All'
        ann_occur['Date'] = ann_occur['Date'].astype(str).astype(int)

        # concatenate both dataframes for having the final dataframe
        occurs_all = occurs_all.append(ann_occur)
        occurs_all.index = pd.RangeIndex(len(occurs_all))
        occurs_all.columns = ['Date', 'Cluster', 'Occurrence (%)', 'Season']
        occurs_all['Clusters'] = i_clusters
        
        Occurrences = Occurrences.append(occurs_all)
        
    return Occurrences

In [26]:
pool = multiprocessing.Pool() # object for multiprocessing
All_combs_names = list(Clustering.keys())
SeasonalOccurrences = list(tqdm.tqdm(pool.imap(seasonal_occurrences, All_combs_names), 
                           total=len(All_combs_names), position=0, leave=True))
pool.close()

SeasonalOccurrences = {All_combs_names[i_c]: i_clustering for i_c, i_clustering in enumerate(SeasonalOccurrences)}

# save data used for plots (identified from the results of the last part of this script; Check Script4 Aux. Figure)
SeasonalOccurrences['Med_SLP~Z500'].to_csv(results_loc+'Frequencies_Med_SLP~Z500.csv')

del(pool, All_combs_names)

100%|██████████| 28/28 [00:20<00:00,  1.33it/s]


Delete variables not needed any more, for emptying memory.

In [27]:
del(SeasonalOccurrences, seasonal_occurrences, aggregated_occurrence, comb_lists, combo_clusters, 
    PC_norm, EOF_analysis, eof_analysis, anomalies, Anomalies)

## Connection of Localized Extremes to Large-Scale Atmoshperic Flow Patterns <a name="extremes-to-patterns"></a>

### Auxiliary functions <a name="auxiliary"></a>

In [28]:
# cumulative distribution of binomial for statistical significance testing
def binom_test(occurrences, propabilities):
    return binom.cdf(k=occurrences-1, n=occurrences.sum(), p=propabilities)

In [29]:
def transition_matrix(data, lead=1):
    
    '''     
    Function for calculating the transition matrix M of an item (list/numpy/pandas Series/pandas single column DF),
    where M[i][j] is the probablity of transitioning from state i to state j.
    Basic code taken from stackoverflow:
    https://stackoverflow.com/questions/46657221/generating-markov-transition-matrix-in-python
    
    NOTE!: Data should not have NaN values, otherwise code crushes!
    
    :param data : input data: one dimensional vector with elements of same type (e.g. all str, or all float, etc)
    :param lead : lead time for checking the transition (default=1)
    :return     : transition matrix as pandas DataFrame
    '''
    
    if type(data) == pd.core.frame.DataFrame:
        data_used = list(data.values.flatten())
    else:
        data_used = data

    unique_states = sorted(set(data_used)) # get the names of the unique states and sort them
    
    dict_sequencial = {val: i for i, val in enumerate(unique_states)} # sequencial numbering of states
    
    transitions_numbered = pd.Series(data_used).map(dict_sequencial) # map the data to sequencial order
    transitions_numbered = transitions_numbered.values # get only the actual values of the Series
    
    n = len(unique_states) # number of unique states

    M = [[0]*n for _ in range(n)] # transition matrix

    for (i,j) in zip(transitions_numbered,transitions_numbered[lead:]): # the total times of the transition M[i][j]
        M[i][j] += 1

    # now convert to probabilities:
    for row in M:
        s = sum(row)
        if s > 0:
            row[:] = [f/s for f in row]
    
    M = pd.DataFrame(M, columns=unique_states, index=unique_states) # convert to DF and name columns/rows as per data
    
    return M

In [30]:
def statistics_clusters(input_data):
    
    ' Calculate statistics of occurences and limits of climatological frequencies for each cluster '
    
    type_used = input_data[0] # variables & area used for clustering (e.g. "Med_T850" or "Full_SLP~Z500")
    clusters_used = input_data[1] # number of clusters used for K-means (e.g. "Clusters_9")

    Data = Clustering[type_used][clusters_used] # cluster pd.Series with cluster label for each day
    n_clusters = int(clusters_used.split('_')[1]) # number of clusters used, as integer

    # days per cluster, and statistics of total occurrences
    Totals = Data.value_counts() # days per cluster [use all the daily data available at the clustering results]
    Totals = pd.DataFrame(Totals.reindex(range(n_clusters))) # sort the data per cluster order
    Totals.rename(columns={clusters_used: 'Occurrences'}, inplace=True) # rename column
    
    # persistence, climatological frequencies, and effective size due to persistence
    transitions = transition_matrix(Clustering[type_used][clusters_used]) # next-day transition probabilities matrix
    Totals['Persistence'] = np.diag(transitions) # self-transition probability
    total_days = len(Data) # total days used for clustering
    Totals['Percent'] = Totals['Occurrences']/total_days # climatological frequencies
    Totals['N_ef'] = total_days*(1-Totals['Persistence'])/(1+Totals['Persistence']) # effective length

    # 95% CI of climatological frequencies: use normal approximation to Binomial distr. considering effective length 
    Totals['Perc_Upper'] = Totals['Percent']+1.96*np.sqrt(Totals['Percent']*(1-Totals['Percent'])/Totals['N_ef'])
    Totals['Perc_Lower'] = Totals['Percent']-1.96*np.sqrt(Totals['Percent']*(1-Totals['Percent'])/Totals['N_ef'])

    # Precipitation data do not include 1st Jan 1979, so use the Precipitation dates for accurate results
    subset_totals = Data.loc[dates_all].value_counts() # days per cluster for the Precipitation data
    Totals['Subset_Occurrences'] = subset_totals.reindex(range(n_clusters)) # sort the data per cluster order
    Totals['Occur_Max'] = np.ceil(Totals['Perc_Upper']*len(dates_all)) # ceiling to get the next integer
    
    return (Totals, Data)

### Quantifying the connections <a name="quantifying-connections"></a>

In [31]:
def extremes_to_clusters(input_data):
    
    ' Calculate connection of extremes to patterns; % of events per cluster, condit. prob. and stat. sign. '
    ' inputa data: list of 2: 1) variables & area used for clustering, and 2) number of clusters used'
    
    Totals, Data = statistics_clusters(input_data) # get statistics of clusters and daily attributions of labels
    
    ExceedCounts = Exceed_xr.copy(deep=True)
    ExceedCounts = ExceedCounts.assign_coords({'time': Data.loc[Exceed_xr.time.values].values}) # change to cluster id
    ExceedCounts = ExceedCounts.rename({'time': 'cluster'}) # rename the coordinate
    ExceedCounts = ExceedCounts.groupby('cluster').sum() # find total extremes at each cell allocated per cluster
    
    RatioCluster = ExceedCounts.transpose(..., 'cluster')/Totals['Subset_Occurrences'].values*100 # conditional prob.
    RatioClusterMax = ExceedCounts.transpose(..., 'cluster')/Totals['Occur_Max'].values*100 # cond. prob. of 95% freq.
    Exceed_Perc = ExceedCounts/ExceedCounts.sum(dim=['cluster'])*100 # percent of extremes per cluster
    
    "check statistical significance of occurrences based on binomial distribution for 95% Confidence Interval"
    # perform the analysis for the Upper tail and use the Upper 95% CI for the cluster propability
    Binom_Cum_Upper = ExceedCounts.copy(deep=True) # new xr (SOS: deep=True otherwise the data are overwritten later)
    Binom_Cum_Upper = Binom_Cum_Upper.astype(float) # convert to float from int
    Binom_Cum_Upper = Binom_Cum_Upper.transpose('cluster', ...)
    Counts_np = Binom_Cum_Upper.values.copy() # numpy of values for applying the function below
    Binom_Cum_Upper_np = np.apply_along_axis(binom_test, propabilities=Totals['Perc_Upper'],  axis=0, arr=Counts_np)
    Binom_Cum_Upper[:] = Binom_Cum_Upper_np # pass the results to the xr

    # perform the analysis for the Lower tail and use the Lower 95% CI for the cluster propability
    Binom_Cum_Lower = Binom_Cum_Upper.copy(deep=True)
    Binom_Cum_Lower_np = np.apply_along_axis(binom_test, propabilities=Totals['Perc_Lower'],  axis=0, arr=Counts_np)
    Binom_Cum_Lower[:] = Binom_Cum_Lower_np

    Sign = (Binom_Cum_Upper > .975)*1 + (Binom_Cum_Lower < .025)*(-1) # assign boolean for statistical significance

    # final object with counts, percentages, and statistical significance
    All_data = [ExceedCounts, Exceed_Perc, RatioCluster, RatioClusterMax, Sign]
    Coord_name = ['Counts', 'PercExtremes', 'CondProb', 'CondProbUpperLimit', 'Significance']
    Coord_name = pd.Index(Coord_name, name='indicator')
    Final = xr.concat(All_data, dim=Coord_name)
    
    return Final

In [32]:
StartTime = datetime.now()

ExtremesClusters = {}
for i_type in Clustering:
    
    clusters_used = ['Clusters_'+ str(i) for i in Clusters_used]
    input_combo = list(zip([i_type]*len(clusters_used), clusters_used))
    
    # this section is memory demanding so it is not recommended to use multiprocessing
    Extremes_Subset = [extremes_to_clusters(input_combo_i) for input_combo_i in input_combo] 
    ExtremesClusters[i_type] = {clusters_used[i]: j for i, j in enumerate(Extremes_Subset)}
    print('Analysis for {} completed.'.format(i_type))
    
print('\nAnalysis completed in:', datetime.now()-StartTime, ' HR:MN:SC.')

# save data used for plots (identified from the results of the last part of this script; Check Script4 Aux. Figure)
ExtremesClusters['Med_SLP~Z500']['Clusters_9'].to_netcdf(results_loc+'ClusteringStats_Med_SLP~Z500_Clusters9.nc')

del(i_type, clusters_used, input_combo, Extremes_Subset, StartTime)

Analysis for EuroAtlantic_SLP completed.
Analysis for EuroAtlantic_T850 completed.
Analysis for EuroAtlantic_Z500 completed.
Analysis for EuroAtlantic_SLP~T850 completed.
Analysis for EuroAtlantic_SLP~Z500 completed.
Analysis for EuroAtlantic_T850~Z500 completed.
Analysis for EuroAtlantic_SLP~T850~Z500 completed.
Analysis for MedExt_SLP completed.
Analysis for MedExt_T850 completed.
Analysis for MedExt_Z500 completed.
Analysis for MedExt_SLP~T850 completed.
Analysis for MedExt_SLP~Z500 completed.
Analysis for MedExt_T850~Z500 completed.
Analysis for MedExt_SLP~T850~Z500 completed.
Analysis for MedExtAtl_SLP completed.
Analysis for MedExtAtl_T850 completed.
Analysis for MedExtAtl_Z500 completed.
Analysis for MedExtAtl_SLP~T850 completed.
Analysis for MedExtAtl_SLP~Z500 completed.
Analysis for MedExtAtl_T850~Z500 completed.
Analysis for MedExtAtl_SLP~T850~Z500 completed.
Analysis for Med_SLP completed.
Analysis for Med_T850 completed.
Analysis for Med_Z500 completed.
Analysis for Med_SLP

In [33]:
del(Exceed_xr, binom_test, transition_matrix, statistics_clusters, extremes_to_clusters)

### Summary Statistics for the connections between Extremes and Large-Scale Patterns <a name="summarizing"></a>

In [34]:
def sort_data(slice_used):
    
    """ 
    Sort data based on multiple conditions: sorting from min to max with the following importance from most important 
    to least: significance (-1, 0, 1), cond prob, perc extremes, cond prob lower quantile
    """
    test_sort = np.lexsort((slice_used[3], slice_used[2], slice_used[1], slice_used[0]))
    
    return test_sort[::-1] # reverse the sorting from max to min (descending)

In [35]:
def reorder_get_max_xr(xr_object, sorted_data, indicator):
    
    """
    Reorder the xarray with the support of np, and then get the corresponding values assosicated to the 1st slice.
    This is needed because the sorting is done initially on significance, and then on cond. prob, % of extremes, etc.
    So the idea is to be able to get the values that correspond to the selected "best" slice.
    """
    
    Sorted = np.take_along_axis(xr_object.sel(indicator=indicator).values, sorted_data, 0)
    Sorted_max = xr.DataArray(Sorted[0], dims=['latitude', 'longitude'],
                              coords={'latitude': xr_object['latitude'].values, 
                                      'longitude': xr_object['longitude'].values})
    return Sorted_max    

In [36]:
def max_values(subset, percentile):
    
    ' Get values of indicators that correspond to the cluster with highest association with extremes at each cell '
    
    Subset_perc = subset.sel(percentile=percentile) # get the data of interest

    # sort data based on the combined sorting: most important is significance, then cond. prob., etc ...
    Data_sort = Subset_perc.sel(indicator=['Significance', 'CondProb', 'PercExtremes', 'CondProbUpperLimit'])
    
    Data_sort = Data_sort.values

    Sorted_data = Data_sort[0,:].copy()
    Sorted_data[:] = np.nan
    for i_y in range(Sorted_data.shape[1]):
        for i_x in range(Sorted_data.shape[2]):
            Sorted_data[:, i_y, i_x] = sort_data(slice_used=Data_sort[:, :, i_y, i_x])

    Sorted_data = Sorted_data.astype(int) # convert the sorted indices to integer, as they refer to the order

    MaxPerExt = reorder_get_max_xr(xr_object=Subset_perc, sorted_data=Sorted_data, indicator='PercExtremes')
    MaxPerClu = reorder_get_max_xr(xr_object=Subset_perc, sorted_data=Sorted_data, indicator='CondProb')
    MaxPerCluMax = reorder_get_max_xr(xr_object=Subset_perc, sorted_data=Sorted_data, indicator='CondProbUpperLimit')
    
    return (MaxPerExt, MaxPerClu, MaxPerCluMax)

In [37]:
def summary_stats(input_data):
    
    type_used = input_data[0] # variables used for clustering (e.g. "T850" or "SLP~Z500")
    clusters_used = input_data[1] # number of clusters used for K-means (e.g. "Clusters_9")

    Subset = ExtremesClusters[type_used][clusters_used] # get the data of interest
    
    total_locs = len(Subset.longitude)*len(Subset.latitude) # total number of grid cells
    
    # general stats about % of locations that have significant connections to at least 1 cluster
    MaxSign = Subset.sel(indicator='Significance').max(dim='cluster') # maximum significance per location
    Sing_Perc = (MaxSign == 1).sum(dim=['latitude', 'longitude']) # number of sign. locations per percentile
    Sing_Perc = Sing_Perc/total_locs*100 # percent of significant location
    
    Percentiles = ['P'+str(i_perc) for i_perc in P_used] # get the percentile names as Px, x:[0-100]
    SignPerc = pd.DataFrame({'Percentile': Percentiles, 'PercentSign': Sing_Perc.values}) # dataframe with results
    
    # analyse more detailed statistics for all locations that have significant connections
    # get the max values to the "best" cluster per cell
    MaxStats = [max_values(subset=Subset, percentile=i_perc) for i_perc in P_used] 
    MaxPerExt = xr.concat([i[0] for i in MaxStats], dim=pd.Index(P_used, name='percentile')) # % of extremes to clust.
    MaxPerClu = xr.concat([i[1] for i in MaxStats], dim=pd.Index(P_used, name='percentile')) # cond. prob. of cluster
    MaxPerCluMax = xr.concat([i[2] for i in MaxStats], dim=pd.Index(P_used, name='percentile')) # cond prob for 95% CI
    
    Sign_all = Subset.sel(indicator='Significance').where(Subset.sel(indicator='Significance') == 1) # all sign. locs
    
    TotalPercSign = (Subset.sel(indicator='PercExtremes')*Sign_all).sum(dim='cluster') # total % EPEs sign. connected
    with warnings.catch_warnings(): # if all are NaN then it gives Runtimewarning, which is now suppressed
        warnings.simplefilter('ignore', category=RuntimeWarning)
        MeanCond = (Subset.sel(indicator='CondProb')*Sign_all).mean(dim='cluster') # mean of conditional probabilities
    
    HighestOccurEx = MaxPerExt.where(MaxSign == 1).values # mask max % EPEs only for significant locations
    HighestOccurClu = MaxPerClu.where(MaxSign == 1).values # mask max cond. prob. only for significant locations
    HighestOccurCluMax = MaxPerCluMax.where(MaxSign == 1).values # as above for 95% CI
    TotalPercSign = TotalPercSign.where(MaxSign == 1).values # total % of EPEs, significant for each grid cell
    MeanCond = MeanCond.where(MaxSign == 1).values # mean cond. prob. of EPEs, significant for each grid cell
    
    Data_HighestOccur = pd.DataFrame() # dataframe for storing all values
    datasets_used = [HighestOccurEx, HighestOccurClu, HighestOccurCluMax, TotalPercSign, MeanCond]
    for i in range(HighestOccurEx.shape[0]): # loop through the percentiles
        
        data_aux = pd.DataFrame(columns=['MCP_PercExtr', 'MCP', 'MCP_UpperLim', 'TotalPercSign', 'MeanCondProbSign'])
        for i_tp, i_c in zip(datasets_used, data_aux.columns):
            data_aux_var = i_tp[i, :, :].flatten() # flatten the array
            data_aux_var = data_aux_var[~np.isnan(data_aux_var)] # remove the non-significant locations
            data_aux[i_c] = list(data_aux_var)
         
        data_aux['Percentile'] = Percentiles[i] # add the percentile info
        
        Data_HighestOccur = Data_HighestOccur.append(data_aux) # append to general dataframe
    
    # add auxiliary columns for subsetting
    SignPerc['Clusters'] = Data_HighestOccur['Clusters'] = clusters_used
    SignPerc['Type'] = Data_HighestOccur['Type'] = type_used
    
    # merge SignPerc and median of Data_HighestOccur
    Data_Summary = Data_HighestOccur.groupby(['Percentile', 'Clusters', 'Type']).median().reset_index() # median
    Data_Summary = pd.merge(Data_Summary, SignPerc)
    
    Data_Summary['Area'] = [i.split('_')[0] for i in Data_Summary['Type']]
    Data_Summary['Variables'] = [i.split('_')[1] for i in Data_Summary['Type']]
    
    Data_HighestOccur['Area'] = [i.split('_')[0] for i in Data_HighestOccur['Type']]
    Data_HighestOccur['Variables'] = [i.split('_')[1] for i in Data_HighestOccur['Type']]
    
    return {'DataSummary': Data_Summary, 'DataAll': Data_HighestOccur} # return summary data & data from all locs

In [38]:
StartTime = datetime.now()

DataSummary = pd.DataFrame() # final dataframe with all the data

for i_type in Clustering:
    
    clusters_used = ['Clusters_'+ str(i) for i in Clusters_used]
    input_combo = list(zip([i_type]*len(clusters_used), clusters_used))
    
    pool = multiprocessing.Pool() # object for multiprocessing
    Summary_Stats = list(tqdm.tqdm(pool.imap(summary_stats, input_combo), 
                                   total=len(input_combo), position=0, leave=True))
    pool.close()
    
    # keep only the DataSummary, cause the file with all the data is too large (over 100Mb) for saving later on
    DataSummary_aux = pd.concat([i_stats['DataSummary'] for i_stats in Summary_Stats]) # concat the data to 1 df
    
    DataSummary = DataSummary.append(DataSummary_aux)
    
print('Analysis completed in:', datetime.now()-StartTime)

DataSummary.to_csv(results_loc+'DataSummary.csv') # save data used for plotting

del(i_type, clusters_used, input_combo, pool, DataSummary_aux, StartTime, Summary_Stats)

100%|██████████| 7/7 [00:00<00:00,  9.00it/s]
100%|██████████| 7/7 [00:00<00:00,  8.39it/s]
100%|██████████| 7/7 [00:00<00:00,  7.92it/s]
100%|██████████| 7/7 [00:00<00:00,  9.21it/s]
100%|██████████| 7/7 [00:00<00:00,  9.07it/s]
100%|██████████| 7/7 [00:00<00:00,  9.79it/s]
100%|██████████| 7/7 [00:00<00:00,  9.39it/s]
100%|██████████| 7/7 [00:00<00:00,  8.88it/s]
100%|██████████| 7/7 [00:00<00:00,  9.35it/s]
100%|██████████| 7/7 [00:00<00:00,  8.78it/s]
100%|██████████| 7/7 [00:00<00:00,  9.00it/s]
100%|██████████| 7/7 [00:00<00:00,  8.65it/s]
100%|██████████| 7/7 [00:00<00:00,  8.64it/s]
100%|██████████| 7/7 [00:00<00:00,  8.19it/s]
100%|██████████| 7/7 [00:00<00:00,  8.69it/s]
100%|██████████| 7/7 [00:00<00:00,  9.07it/s]
100%|██████████| 7/7 [00:00<00:00,  8.56it/s]
100%|██████████| 7/7 [00:00<00:00,  8.19it/s]
100%|██████████| 7/7 [00:00<00:00,  8.54it/s]
100%|██████████| 7/7 [00:00<00:00,  8.83it/s]
100%|██████████| 7/7 [00:00<00:00,  8.83it/s]
100%|██████████| 7/7 [00:00<00:00,

Analysis completed in: 0:00:36.121428


In [39]:
print('Total Analysis completed in:', datetime.now() - InitializationTime, ' HR:MN:SC.')
del(InitializationTime)

Total Analysis completed in: 1:26:55.620317  HR:MN:SC.
