# HELIX-SCOPE

### Developing the data processing

In this notebook we will test how best to process national summary statics from the Helix consortium data. Summary statistics (mean, max, min and standard deviation) will be calculated for every shape in an arbitrary shapefile for every netcdf file on path.

Data should be downloaded from the SFTP site (bi.nsc.liu.se), which requires a username and password login. The data should be placed in the `/data` folder within this repo.

In [1]:
from netCDF4 import Dataset
import os
import re
import fiona
import rasterio
from rasterio.mask import mask
from rasterio.plot import show
from rasterstats import zonal_stats
import geopandas as gpd
import pandas as pd
import numpy as np
from matplotlib.pyplot import cm
import matplotlib.pyplot as plt
import datetime
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
def identify_netcdf_and_csv_files(path='data'):
    """Crawl through a specified folder and return a dict of the netcdf d['nc']
    and csv d['csv'] files contained within.
    Returns something like {'nc':'data/CNRS_data/cSoil/orchidee-giss-ecearth.SWL_15.eco.cSoil.nc'}
    """
    netcdf_files = []
    csv_files = []
    for root, dirs, files in os.walk(path):
        if isinstance([], type(files)):
            for f in files:
                if f.split('.')[-1] in ['nc']:
                    netcdf_files.append(''.join([root,'/',f]))
                elif  f.split('.')[-1] in ['csv']:
                    csv_files.append(''.join([root,'/',f]))
    return {'nc':netcdf_files,'csv':csv_files}


def generate_metadata(filepath):
    """Pass a path and file as a sigle string. Expected in the form of:
        data/CNRS_data/cSoil/orchidee-giss-ecearth.SWL_15.eco.cSoil.nc
    """
    file_metadata = get_nc_attributes(filepath)
    filename_properties = extract_medata_from_filename(filepath)
    return {**file_metadata, **filename_properties}


def extract_medata_from_filename(filepath):
    """extract additonal data from filename using REGEX"""
    warning = "Filepath should resemble: data/CNRS_data/cSoil/orchidee-giss-ecearth.SWL_15.eco.cSoil.nc"
    assert len(file.split('/')) == 4, warning
    fname = filepath.split("/")[3]
    variable = filepath.split("/")[2]
    model_taxonomy = re.search('(^.*?)\.',fname, re.IGNORECASE).group(1)
    model_short_name = re.search('(^.*?)-',model_taxonomy, re.IGNORECASE).group(1)
    return {"model_short_name":model_short_name, "variable":variable, "model_taxonomy":model_taxonomy}


def get_nc_attributes(filepath):
    """ Most info is stored in the files’ global attribute description,
    we will access it using netCDF4.ncattrs function.
    Example:
         ncAttributes('data/CNRS_data/cSoil/orchidee-giss-ecearth.SWL_15.eco.cSoil.nc')
    """
    nc_file = Dataset(filepath, 'r')
    d = {}
    nc_attrs = nc_file.ncattrs()    
    for nc_attr in nc_attrs:
        d.update({nc_attr: nc_file.getncattr(nc_attr)})
    could_be_true = ['true', 'True', 'TRUE']
    d['is_multi_model_summary'] = d['is_multi_model_summary'] in could_be_true
    d['is_seasonal'] = d['is_seasonal'] in could_be_true
    del d['contact']
    return d

## Single core process

Single core version:

Place the data folders from Helixscope into the data folder of this repo.

```
data
├── CNRS_data
│   ├── README.txt
│   ├── cSoil
│   │   ├── orchidee-giss-ecearth.SWL_15.eco.cSoil.nc
│   │   ├── orchidee-giss-ecearth.SWL_2.eco.cSoil.nc
│   │   ├── orchidee-giss-ecearth.SWL_4.eco.cSoil.nc
│   │   ├── orchidee-ipsl-ecearth.SWL_15.eco.cSoil.nc
│   │   ├── orchidee-ipsl-ecearth.SWL_2.eco.cSoil.nc
│   │   ├── orchidee-ipsl-ecearth.SWL_4.eco.cSoil.nc
│   │   ├── orchidee-ipsl-hadgem.SWL_15.eco.cSoil.nc
│   │   ├── orchidee-ipsl-hadgem.SWL_2.eco.cSoil.nc
│   │   └── orchidee-ipsl-hadgem.SWL_4.eco.cSoil.nc
│   ├── cVeg
│   │   ├── orchidee-giss-ecearth.SWL_15.eco.cVeg.nc
│   │   ├── orchidee-giss-ecearth.SWL_2.eco.cVeg.nc
│   │   ├── orchidee-giss-ecearth.SWL_4.eco.cVeg.nc
│   │   ├── orchidee-ipsl-ecearth.SWL_15.eco.cVeg.nc
│   │   ├── orchidee-ipsl-ecearth.SWL_2.eco.cVeg.nc
```

Also include the shapefile in the data folder:

```
./data/minified_gadm28_countries/gadm28_countries.shp
```

In [3]:
%%time
shps = gpd.read_file('./data/minified_gadm28_countries/gadm28_countries.shp')
shps = shps.to_crs(epsg='4326')
files = identify_netcdf_and_csv_files()

keys = ['country','iso2','admin1','admin2','variable','SWL_info',
        'count', 'max','min','mean','std','impact_tag','institution',
        'model_long_name','model_short_name','model_taxonomy',
        'is_multi_model_summary','is_seasonal']

CPU times: user 16.2 s, sys: 110 ms, total: 16.3 s
Wall time: 16.5 s


In [4]:
%%time

for file in files.get('nc')[0:1]:
    print("Processing '{}'".format(file))
    tmp_metadata = generate_metadata(file)
    with rasterio.open(files['nc'][0]) as nc_file:
        rast=nc_file.read()
        properties = nc_file.profile
    tmp = rast[0,:,:]                      # The first dim should be stripped
    mask = tmp == properties.get('nodata') # Now we need to make a mask for missing data
    tmp[mask] = np.nan                     # and replace it with a NAN value
    stats_per_file = []
    for i in shps.index:
        shp = shps.iloc[i].geometry
        zstats = zonal_stats(shp, tmp, band=1, stats=['mean', 'max','min','std','count'],
                             all_touched=True, raster_out=False,
                             affine=properties['transform'],
                             no_data=np.nan)
        if zstats[0].get('count', 0) > 0: # If shape generated stats, then add it
            shp_atts = {'iso2' : shps.iso2[i],
                        'country' : shps.name_engli[i]}
            tmp_d = {**zstats[0], **shp_atts, **tmp_metadata}
            stats_per_file.append([tmp_d.get(key, None) for key in keys])



Processing 'data/CNRS_data/cSoil/orchidee-giss-ecearth.SWL_15.eco.cSoil.nc'
CPU times: user 17.2 s, sys: 90 ms, total: 17.2 s
Wall time: 17.7 s


In [5]:
df = pd.DataFrame(stats_per_file, columns=keys)
df.head()

Unnamed: 0,country,iso2,admin1,admin2,variable,SWL_info,count,max,min,mean,std,impact_tag,institution,model_long_name,model_short_name,model_taxonomy,is_multi_model_summary,is_seasonal
0,Norway,NO,,,cSoil,1.5,307,21.694836,10.195782,13.438408,2.055015,eco,LSCE,ORCHIDEE,orchidee,orchidee-giss-ecearth,False,False
1,Thailand,TH,,,cSoil,1.5,221,7.480751,1.998384,4.075577,1.098488,eco,LSCE,ORCHIDEE,orchidee,orchidee-giss-ecearth,False,False
2,Venezuela,VE,,,cSoil,1.5,341,11.517319,0.61295,5.481054,1.990044,eco,LSCE,ORCHIDEE,orchidee,orchidee-giss-ecearth,False,False
3,Nigeria,NG,,,cSoil,1.5,335,6.299715,0.343887,2.6732,1.205588,eco,LSCE,ORCHIDEE,orchidee,orchidee-giss-ecearth,False,False
4,Argentina,AR,,,cSoil,1.5,1231,26.46019,0.0,4.512354,4.254507,eco,LSCE,ORCHIDEE,orchidee,orchidee-giss-ecearth,False,False


In [6]:
df.to_csv('./processed/raw_output.csv')

Next steps:

* Need to ensure this can handel any admin1 or admin2 level shapefiles.
* Need to paraellise this so it will run in a convienient time
* Need to check that the regex changes Alex applied post-table creation are included