## Using helix_funcs

An example of how to use the helix_funcs module to process the HelixScope datafiles, and generate a master output table.

Process_files should generate a similarly named csv file in the process folder.
combine_processed 

In [2]:
import helix_funcs
import geopandas as gpd
import pandas as pd
# from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
#s = gpd.read_file('./data/gadm28_adm1/gadm28_adm1.shp')
#s = s.to_crs(epsg='4326')
s = gpd.read_file('./data/gadm28_countries/gadm28_countries.shp')
#s = gpd.read_file("./data/gadm28_adm0_simplified/gadm28_adm0_simplified.shp")  # <--- ADMIN LEVEL 0
#s = gpd.read_file("./data/gadm28_adm1_simplified/gadm28_adm1_simplified.shp")   # <--- ADMIN LEVEL 1

In [4]:
fs = helix_funcs.identify_netcdf_and_csv_files()

In [5]:
fs['nc'][100:101]

['data/UEA_data/climate/tx/HADGEM3-R3.SWL_4.cl.tx.MAM.nc']

In [6]:
%%time
for f in fs['nc'][100:101]:
    helix_funcs.process_file(file=f, shps=s, admin_level=0, overwrite=True, verbose=True)

working on  admin0/
Processing 'data/UEA_data/climate/tx/HADGEM3-R3.SWL_4.cl.tx.MAM.nc'
CPU times: user 12.9 s, sys: 150 ms, total: 13 s
Wall time: 13.6 s


# Note:

Dont process the monthly data, but note that the seasonal data e.g. MAM need to be checked in the parsing, as the variables are not being properly selected.

Also check that all the metadata is being parased when the new country level shapes are uses.

Create 10x10 degree grid to cover the globe. Intersect it with land areas. Create an ID for each shape. Run the pipeline against those shapes to create a coarse gridded dataset.

In [8]:
x = pd.read_csv("./processed/admin0/UEA_data/climate/tx/HADGEM3-R3.SWL_4.cl.tx.MAM.csv")

In [11]:
x[x['iso'] == 'SSD']

Unnamed: 0,name_0,iso,variable,swl_info,count,max,min,mean,std,impact_tag,institution,model_long_name,model_short_name,model_taxonomy,is_multi_model_summary,is_seasonal,season,is_monthly,month
155,South Sudan,SSD,tx,4.0,255,45.081566,34.072529,40.970466,2.453321,cl,,HADGEM3-R3,HADGEM3,HADGEM3-R3,False,True,MAM,False,


In [None]:
f = 'data/UEA_data/climate/tx/HADGEM3-R3.SWL_4.cl.tx.MAM.nc'
print(f)
extract_medata_from_filename(f)

In [None]:
fs

In [None]:
seasons = ["MAM", "JJA", "SON", "DJF"]

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

### Speed up 

Looks like we achieve a 34 second execution per file (down from 2 mins) when we use a simplifed set of geometries.



## Investiage areas with no zonal stats

Seems like small areas are failing. Need to investigate best way to handle this.

* check simple shapes produce same number of elements as complex shapes - confimred
* check shapefile has same number of shapes as output file - confirmed
* check simple and complex processed results for same file show good agreement. - confirmed

* Add kwarg for ADMIN0 or Admin1. Have a differnt root folder for both in processed data.
* MOVE alredy processed ADMIN1 level data to the `processed/admin1/` directory - Failed. Columns changed. Will need to recalculate :(
* SIMPLIFY the ADMIN-0 level geometry file as I did with the ADMIN1 file. - confirmed
* RUN ADMIN0 for all data.

Perhaps we should have two tables: ADMIN0 level (based on simplified shapes), and ADMIN1 level.

The ADMIN 0 level makes sense to calculate (mean, min, max, stdev), as those shapes should be large. BUT for ADMIN 1 level, the shapes can be far smaller. In which case it doesnt make sense to calculate those stats for every shape.

* given time, we could work out which shapes it makes sense for (e.g. with counts > 3) and make some distinction.
* with short time (which is the case) it maybe only makes sense to offer mean over those areas (which in most cases will be the mean of 1 cell (i.e. simply the value of the gridcell).


THEREFORE:

1. Process all data to ADMIN 0 (using simplified country shapes)
    * use this as MVP
2. CALCULATE TIME RELATIVE STATISTICS
    * i.e. for datatets with Month distinctions calculate the seasonality
3. Attempt to do the ADMIN 1 level data.

# Deal with outputs

Reduce significant digits. Join together dataframes.

In [None]:
fs = helix_funcs.identify_netcdf_and_csv_files()

In [None]:
len(fs['csv'])

In [None]:
%%time
helix_funcs.combine_processed_results(path='./processed/admin0',
                                      table_name='./master_admin0.csv')  #<-- join all results files together into a master_admin1.csv

In [None]:
tmp = pd.read_csv('./master_admin0.csv')

In [None]:
len(tmp)

In [None]:
tmp.head()

In [None]:
print(len(tmp['impact_tag'].unique()),'\n')
for var in (tmp['impact_tag'].unique()):
    print(var)

In [None]:
mask = tmp['variable'] == 'Rice_yield_perc_change' #'river_floods_PopAff' #"river_floods_ExpDam"
shorter = tmp[mask]

In [None]:
shorter.head()

In [None]:
shorter['swl_info'].unique()

In [None]:
tmp = pd.read_csv("./master_admin0.csv")
tmp.head()

In [None]:
set(tmp['iso'].values)

In [None]:
is_season = tmp['is_seasonal'] == True
print((len(tmp[is_season])/len(tmp)) * 100.,"% has seasonal data")

In [None]:
is_season = tmp['is_monthly'] == True
print((len(tmp[is_season])/len(tmp)) * 100.,"% has monthly data")

In [None]:
output_files = helix_funcs.identify_netcdf_and_csv_files('processed/')

In [None]:
len(output_files['csv'])

In [None]:
tmp = pd.read_csv(outputs['csv'][-1])

In [None]:
country_mask = tmp['iso'] == 'ESP'

In [None]:
tmp = tmp[country_mask].head()

In [None]:
admin_mask = tmp['id_1'] == 1

In [None]:
tmp['id_1'].unique()

## Test plots

We need to make an easy way to preview how a specific variables choropleths will appear.
This means read a file, and make a preview plot using geopandas.


In [None]:
output_files = helix_funcs.identify_netcdf_and_csv_files('processed/')

In [None]:
helix_funcs.map_file_by_iso(f=output_files['csv'][10], s=s, iso='FRA')

## SIMPLIFY LIFE!

Before passing the polygons for zonal analysis, they should be simplified. 
Looks like the call should be `stest.geometry.simplify(0.2, preserve_topology=False)`

The geometry files themsevels should be simplified, and read in already done. I have another notebook to simply (and correct) these data.

In [None]:
stest.geometry.simplify(0.2, preserve_topology=False).plot()

In [None]:
stest = s[s['iso'] == 'BRA']
stest.geometry.plot()

In [None]:
stest.geometry.simplify(0.2, preserve_topology=True).plot()

## Next Step

Work out problem of small shapes (probably will need to have a logical test for small shapes, and buffer the geom before zonal stats are calculated). Perhaps this should even be done prior to any loop, initially when the shapes are calculaed.

First can look and see what admins are absent from file (to see where the problem lies).

Test with simplifed shapes.

#### REDUCE SIZE OF FINAL DATA BY removing significant digits from values
e.g. 10.6466969914 should be converted to simply 10.6





In [None]:
def combine_processed_results(path='./processed/admin1',
                              table_name="./master_admin1.csv"):
    """Combine all the csv files in the path (e.g. all processed files)
    into a single master table
    """
    output_files = identify_netcdf_and_csv_files(path)
    frames = [pd.read_csv(csv_file) for csv_file in output_files['csv']]
    master_table = pd.concat(frames)
    master_table.to_csv(table_name, index=False)
    print("Made {0}: {1:,g} rows of data. {2:,g} sources.".format(table_name,
                                                        len(master_table),
                                                        len(output_files['csv'])
                                                                 ))
    return


In [None]:
path='./processed'
output_files = helix_funcs.identify_netcdf_and_csv_files(path)
frames = [pd.read_csv(csv_file) for csv_file in output_files['csv'][0:10]]

In [None]:
for  n, frame in enumerate(frames):
    print(n, min(frames[0]['min']), max(frames[0]['min']), min(frames[0]['max']), max(frames[0]['max']), )

In [None]:
frames[0].head()

In [None]:
test = frames[0].round(1)

In [None]:
admin_level = 0

if admin_level == 0:
    admin_prefix = 'admin0/'
elif admin_level == 1:
    admin_prefix = 'admin1/'
else:
    raise ValueError("admin_level kwarg must be either 0 or 1")
print(admin_prefix)