## Using helix_funcs

An example of how to use the helix_funcs module to process the HelixScope datafiles, and generate a master output table.

Process_files should generate a similarly named csv file in the process folder.
combine_processed 

In [1]:
import helix_funcs
import geopandas as gpd
import pandas as pd
# from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
s = gpd.read_file('./data/gadm28_adm1/gadm28_adm1.shp')
#s = s.to_crs(epsg='4326')
#s = gpd.read_file("./data/gadm28_adm1_simplified/gadm28_adm1_simplified.shp")

In [None]:
fs = helix_funcs.identify_netcdf_and_csv_files()

In [None]:
fs['nc'][-3:-1]

In [None]:
# test_list = ['data/UEA_data/climate/pr/ECEARTH-R1.SWL_15.cl.pr.Apr.nc',
#             'data/CNRS_data/cSoil/orchidee-giss-ecearth.SWL_2.eco.cSoil.nc'
#             ]

In [None]:
%%time
for f in ['data/UEA_data/climate/tx/HADGEM3-R8.SWL_4.cl.tx.Sep.nc']: #tqdm(fs['nc'][0:4]):
    helix_funcs.process_file(file=f, shps=s, verbose=True)

# Speed up 

Looks like we achieve a 34 second execution per file (down from 2 mins) when we use a simplifed set of geometries.


In [None]:
smaller = s.head()

In [None]:
smaller.geometry

In [None]:
# prepare a simplifed set of geometries to work from...

In [None]:
tmp_geoms = smaller.geometry.simplify(0.2, preserve_topology=False)

In [None]:
smaller

In [None]:
smaller.keys()

In [None]:
new_data = []
for row_index in smaller.index:
    new_data.append([smaller['iso'][row_index],
                     smaller['name_0'][row_index],
                     smaller['id_1'][row_index],
                     smaller['name_1'][row_index],
                     smaller['type_1'][row_index],
                    ])

In [None]:
tester = gpd.GeoDataFrame(pd.DataFrame(new_data, columns=['iso','name_0','id_1','name_1','type_1']), geometry=tmp_geoms)
tester

In [None]:
tester.plot()

## Deal with outputs

TODO:

* I should filter the outputs here, so that they are to only 1 sig fig (Saves a lot of space!)


In [3]:
helix_funcs.combine_processed_results()  #<-- join all results files together into a master_admin1.csv

KeyboardInterrupt: 

In [None]:
tmp = pd.read_csv("./master_admin1.csv")
tmp.head()

In [None]:
len(tmp)

In [None]:
output_files = helix_funcs.identify_netcdf_and_csv_files('processed/')

In [None]:
len(output_files['csv'])

In [None]:
tmp = pd.read_csv(outputs['csv'][-1])

In [None]:
country_mask = tmp['iso'] == 'ESP'

In [None]:
tmp = tmp[country_mask].head()

In [None]:
admin_mask = tmp['id_1'] == 1

In [None]:
tmp['id_1'].unique()

## Test plots

We need to make an easy way to preview how a specific variables choropleths will appear.
This means read a file, and make a preview plot using geopandas.


In [4]:
output_files = helix_funcs.identify_netcdf_and_csv_files('processed/')

In [None]:
helix_funcs.map_file_by_iso(f=output_files['csv'][10], s=s, iso='DEU')

In [None]:
helix_funcs.map_file_by_iso(f=output_files['csv'][10], s=s, iso='RUS')

Written ./RUS.orchidee-giss-ecearth.SWL_2.eco.cVeg.png


## SIMPLIFY LIFE!

Before passing the polygons for zonal analysis, they should be simplified. It seems like the below is acceptable:

This should be done prior to looping over the files, so we only calculate the simpler shapes once.

Looks like the call should be `stest.geometry.simplify(0.2, preserve_topology=False)`

In [None]:
stest.geometry.simplify(0.2, preserve_topology=False).plot()

In [None]:
stest = s[s['iso'] == 'BRA']
stest.geometry.plot()

In [None]:
stest.geometry.simplify(0.2, preserve_topology=True).plot()

## Map a country's worth of data

Create a function to return a map for a country over a given processed file to see what is going on.

Should look like:

`helix_functions.test_map(f, iso='ESP',var='mean')`

And produce a saved png matplotlib plot.

Necessary steps:

1. Read processed file
2. subset data for only specified country and variable
3. read geometries
4. subset only geometry of interest (only specific country)
5. combine them into a single geopandas dataframe
6. choropleth plot the data together to a file gpd.plot(column='mean')

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
s = gpd.read_file('./data/gadm28_adm1/gadm28_adm1.shp')

In [None]:
output_files = helix_funcs.identify_netcdf_and_csv_files('processed/')

In [None]:
df = pd.read_csv(output_files['csv'][0])

In [None]:
country = "ESP"
country_subset = df['iso'] == country

In [None]:
df[country_subset].head()

In [None]:
country = "ESP"
var = 'mean'
keys = ['iso', 'id_1', var]
country_subset = df['iso'] == country
extrated_data = []
for row in df[country_subset].index:
    extrated_data.append([df[k][row] for k in keys])
tmp_df = pd.DataFrame(extrated_data, columns = keys)

In [None]:
tmp_df.head()

In [None]:
s_smaller = s[s['iso'] == country]

In [None]:
geoms = []
for row in tmp_df.index:
    #print(row, tmp_df['iso'][row], tmp_df['id_1'][row])
    s_smaller_mask = tmp_df['id_1'][row] == s_smaller['id_1']
    geoms.append(s_smaller[s_smaller_mask].geometry.values[0])#.simplify(0.01))

In [None]:
map_data = gpd.GeoDataFrame(tmp_df, geometry=geoms)

In [None]:
#ax = map_data.plot(column='mean',cmap='Accent')

In [None]:
fig, ax = plt.subplots()
ax.set_aspect('equal')
s_smaller.plot(ax=ax, color='red', linewidth=0.5)                           # Basemap of all country shapes (red polygons)
map_data.plot(ax=ax, column='mean',cmap='Blues', alpha=1.0, linewidth=0.5)  # All shapes for where there exist data (colored polygons)
#plt.show()
plt.savefig('./test.png', dpi=300)

In [None]:
def map_file_by_iso(f, s, iso="ESP", var='mean'):
    """Read a processed CSV (expecting admin1 level) and produce a 
    sample choropleth plot.
    f is filepath e.g. 'processed/CNRS_data/cSoil/orchidee-ipsl-hadgem.SWL_2.eco.cSoil.csv'
    s are loaded geopandas dataframe e.g: s = gpd.read_file('./data/gadm28_adm1/gadm28_adm1.shp')
    """
    f_split = f.split('/')
    f_split
    title = " ".join(["ESP:",f_split[-1].split('.csv')[0]])
    df = pd.read_csv(f)
    country_subset = df['iso'] == country
    keys = ['iso', 'id_1', var]
    country_subset = df['iso'] == iso
    extrated_data = []
    for row in df[country_subset].index:
        extrated_data.append([df[k][row] for k in keys])
    tmp_df = pd.DataFrame(extrated_data, columns = keys)
    s_smaller = s[s['iso'] == iso]
    geoms = []
    for row in tmp_df.index:
        s_smaller_mask = tmp_df['id_1'][row] == s_smaller['id_1']
        geoms.append(s_smaller[s_smaller_mask].geometry.values[0])#.simplify(0.01))
    map_data = gpd.GeoDataFrame(tmp_df, geometry=geoms)
    # Plotting section
    fig, ax = plt.subplots()
    ax.set_aspect('equal')
    s_smaller.plot(ax=ax, color='red', linewidth=0.5)                           # Basemap of all country shapes (red polygons)
    map_data.plot(ax=ax, column='mean',cmap='Pastel1', alpha=1.0, linewidth=0.5)  # All shapes for where there exist data (colored polygons)
    plt.title(title)
    outfname = "".join(["./ESP.",f_split[-1].split('.csv')[0],'.png'])
    plt.savefig(outfname, dpi=300)
    print("Written {0}".format(outfname))
    return tmp_df

In [None]:
map_file_by_iso(f=output_files['csv'][7], s=s, iso='GBR')

In [None]:
output_files['csv'][7]

## Next Step

Work out problem of small shapes (probably will need to have a logical test for small shapes, and buffer the geom before zonal stats are calculated). Perhaps this should even be done prior to any loop, initially when the shapes are calculaed.

First can look and see what admins are absent from file (to see where the problem lies).

Test with simplifed shapes.

#### REDUCE SIZE OF FINAL DATA BY removing significant digits from values
e.g. 10.6466969914 should be converted to simply 10.6





In [None]:
#test_geo_df = GeoDataFrame(df, crs=crs, geometry=geometry)