# HELIX-SCOPE
## Country Impacts Summaries
### Part 1

In this notebook we will produce national summary statics from the climatic modelling outputs provided by the Helix consortium. Summary statistics (mean, max, min and standard deviation) will be calculated for every country and variable, and where possible, for every model and model run.

In [1]:
from netCDF4 import Dataset
import os
import re
import fiona
import rasterio
from rasterio.mask import mask
from rasterio.plot import show
from rasterstats import zonal_stats
import geopandas as gpd
import pandas as pd
import numpy as np
from matplotlib.pyplot import cm
import matplotlib.pyplot as plt
import datetime
%matplotlib inline

### FUNCTIONS
First let's define a few functions to separate the analysis tasks as we iterate through each country polygon/impact layer:

- __ncAttributes:__  this function extracts metadata from gridded climate files. It does so by using a combination of `netCDF4` functions and REGEX commands. The results are returned as a dictionary.

- __zstats:__ This is basically a customisation of the `rasterstats.zonal_stats` function. It’s used to extract summary statistics of interest, returning these values as a dictionary.

- __climateSummaries:__ Integrates the 2 functions above, returning  a dictionary consisting of the concatenation of the above functions

- __emptyDict:__ Returns an empty dictionary where the summary statistics and attribute information will be appended.

In [2]:
def ncAttributes(filepath):
    """
    Purpose: To extract useful metadata from nc files, such as variable names, model_taxonomies, SWL and so on.
    Process: Most of this information is stored in the files’ global attribute description, we will access them using the netCDF4.ncattrs function. A couple of attributes are also encoded in the file name themselves, these will be extracted using REGEX commands.
    Input:  A string with the file’s path, starting from the data folder
    Output: This will return a dictionary object with: SWL_info, impact_tag, institution, is_multi_model_summary, is_seasonal, model_long_name, model_short_name, model_taxonomy, variable
    """

    # --- extract .nc global attribute data
    #: model_long_name, is_seasonal, is_multi_model_summary, SWL_info, impact_tag, institution
    nc_file = Dataset(filepath, 'r')
    nc_globalatt_dic = {}
    nc_attrs = nc_file.ncattrs()
    for nc_attr in nc_attrs:
        nc_globalatt_dic.update({nc_attr: nc_file.getncattr(nc_attr)})
    
    # convert text to bools where relevant
    nc_globalatt_dic['is_multi_model_summary'] = nc_globalatt_dic['is_multi_model_summary'] in ['true', 'True', 'TRUE']
    nc_globalatt_dic['is_seasonal'] = nc_globalatt_dic['is_seasonal'] in ['true', 'True', 'TRUE']

    # --- extract additonal data from filename
    fname = filepath.split("/")[4]
    variable = filepath.split("/")[3]
    model_taxonomy = re.search('(^.*?)\.',fname, re.IGNORECASE).group(1)
    model_short_name = re.search('(^.*?)-',model_taxonomy, re.IGNORECASE).group(1)
    
    # -- create attribute dictionary
    del nc_globalatt_dic['contact']    
    nc_att_dic = {"model_short_name" : model_short_name, "variable" : variable, "model_taxonomy" : model_taxonomy}
    nc_att_dic.update(nc_globalatt_dic)

    return nc_att_dic

In [3]:
def zstats(country_shp, rast, zstats_vars,nc_file):
    """
    Purpose: To extract country summaries in a standardised format
    Process: The bulk of this function uses the zonal_stats stats function to perform the calculations. 
    We’re just adding a few pre-defined objects to execute this function in a for loop, 
    as we iterate through each country polygon and climate data file.
    Inputs:  country_shp = a shapefile , rast = a rasterio.read object, zstats_vars = a predefined list of variables ,nc_file = a rasterio.open object
    Output: This will return a dictionary object with: max, mean, min, std
    """

    # get stats
    rast_zstats = zonal_stats(country_shp, rast[0], 
                      stats= zstats_vars,
                      all_touched=True,
                      raster_out=False,
                      affine=nc_file.profile['transform'], 
                      nodata= nc_file.profile.get("nodata"))
    
    # encode in dictionary format
    stats_dic = {"mean" : rast_zstats[0]['mean'], 
                 "max" : rast_zstats[0]['max'],
                 "min" : rast_zstats[0]['min'],
                 "std": rast_zstats[0]['std'],
                 "count": rast_zstats[0]['count']}
    
    return stats_dic


# define zonal sats of interest to be uused with the abive function
zstats_vars = ['mean', 'max','min','std','count']

In [4]:
def climateSummaries(filepath,country_shp,zstats_vars,nc_file,rast):
    """
    Purpose: To integrate attribute and summary statistics data extraction within each iteration 
    Inputs:  country_shp = a shapefile , rast = a rasterio.read object, zstats_vars = a predefined list
    of variables ,nc_file = a rasterio.open object
    Output: This will return a dictionary object that concatenates  the dictionary objects resulting from the ncAttributes and zstats funcions
    """

    # get zonal stats
    stats_dic = zstats(country_shp,rast,zstats_vars,nc_file)

    # get nc file attributes
    nc_att_dic = ncAttributes(filepath)

    # -- Build 'data row' (for pandas) ---
    row = stats_dic
    row.update(nc_att_dic)
    return row

In [5]:
def emptyDict():
    """
    Used to generate an empty dictionary object, 
    which will be used to store the values resulting from the summary iterations
    """
    return {
    'country':[],
    'impact_tag':[],
    'variable':[],
    'SWL_info':[],
    'model_short_name':[],
    'max':[],
    'mean':[],
    'min':[],
    'std':[],
    'count':[],
    'model_long_name':[],
    'is_seasonal':[],
    'is_multi_model_summary':[],
    'iso2':[],
    'model_taxonomy':[],
    'institution':[]
}

# Batch Processing

### 1.1 Read data files

Below we will:

1. Create a list of all available impacts layers for processing (`filepaths` object)
2. Load a global national boundaries shapefile (`countries_shp` object)

In [6]:
# Get .nc filenames and paths
rootdir = "data/Helix/"

# create file paths list
filepaths = []
for (subdir, dirs, files) in os.walk(rootdir):
    for file in files:
        fpath = os.path.join(subdir, file)
        if re.search('.nc$',fpath):
            filepaths.append(fpath)

In [7]:
# Read countries
countries_shp = gpd.read_file('./data/minified_gadm28_countries/gadm28_countries.shp')

### 1.2 Asynchronous Batch Processing
The processing of impact layers will occur in discrete batches as more climatic data becomes available form the HELIX consortium. Since this is a time consuming process, we are keeping a record of the analysed layers in the file:  `processed/processed-ncdfs.txt`. Impact layers that have been processed will be removed from the `filepaths` object so they’re skipped when running the batch analysis script.

In [8]:
# Read list of processed files
processed = "processed/processed-ncdfs.txt"
processed_list =[]

with open(processed, 'r') as file:
    for line in file:
        line = line.rstrip()
        processed_list.append(line)

# Remove processed files from filepaths list     
filepaths = list(set(filepaths) - set(processed_list))
filepaths

[]

### 1.3 Run Batch Processing

For each country polygon and impact layer the process will be to:

1. Extract the layers' metadata details
2. Clip impact layers with the country polygons
3. Extract the layers' summary statistics (mean, max, min, std) 
4. Concatenate all summary outputs in a single dataframe

The resulting output will be stored as a csv file in the folder `processed/raw-summaries`

The script takes about __1.5 hrs__ to generate summaries for 250 climatic layers.

In [9]:
# Define empty DF
data = emptyDict()

# ----  PRODUCE SUMMARIES ---
for filepath in filepaths:
    
    # read ncdf
    nc_file = rasterio.open(filepath)
    rast = nc_file.read()
    
    for i in countries_shp.index.values:
        
        # load country shape
        country_shp = countries_shp.iloc[[i]]
        
        # get country_shp attibutes
        country_att_dic = {'iso2' : country_shp.iso2.to_string(index=False),
                           'country' : country_shp.name_engli.to_string(index=False)}

        row = climateSummaries(filepath,country_shp,zstats_vars,nc_file,rast)
        row.update(country_att_dic)
        
        for var in row:
            data[var].append(row[var])
    
    # append filepath to list of processed impact layers
    with open(processed, 'a+') as file:
        file.write(filepath)
        file.write('\n')

In [10]:
#---- MAKE DATAFRAME ---
df = pd.DataFrame(data)

### Examine data types and save `.csv`

In [10]:
# examine dataframe 
print(df.dtypes)
print('Shape:' + str(df.shape))

SWL_info                  float64
count                       int64
country                    object
impact_tag                 object
institution                object
is_multi_model_summary       bool
is_seasonal                  bool
iso2                       object
max                       float64
mean                      float64
min                       float64
model_long_name            object
model_short_name           object
model_taxonomy             object
std                       float64
variable                   object
dtype: object
Shape:(59243, 16)


In [11]:
# Save to CSV
tday = datetime.date.today().strftime("%Y-%m-%d")
fname  = "%s%s.csv" % ("processed/raw-summaries/joined-summaries-", tday)
df.to_csv(fname,index=False,na_rep='')