# Benchmarking

This notebook is an attempt to look at ways of optimizing the execution of `pyDEM` code over large numbers of independent timeseries. The specific application is the use of EDM approaches to examine the predicatbility of seasonal NDVI dynamics across East Africa using trajectories of rainfall and temperature.

Approaches to optimizing the code will include:

- Storage of the data in a row-centric format that allows for rapid querying of data in dask.
- Possibly speeding up the `pyEDM` routines by cutting out extra analysis (e.g. model testing code)
- Some other stuff we've not thought of yet.

## Benchmark data

The overall file sizes are too large to benchmark directly. We have 320k pixels, each with ~18 years of dekad (10-day) data. That's a data cube that is 320,000 x 606, which is about 3.5 Gb in a `.csv` file. More problematic is the fact that the `pyDEM` routine takes about 0.5 seconds per pixel.

So we are going to start with a sample of 40k points. These data are stored in the repo under the `Data/observation_tables`. The files are:

- NDVI data: `ndvi_table_anom_lct_80000_84000.csv`
- Precipitation data: `precip_table_anom_lct_80000_84000.csv`
- Temperature data: `temp_table_anom_lct_80000_84000.csv`

Goal:

- Simplex data: `simplex_table_pred_lct_80000_84000.csv`


In [1]:
import pandas as pd
import numpy as np
from pyEDM import *
# Dataframes implement the Pandas API
import dask.dataframe as dd
import dask
dask.config.set(scheduler='threads')  # overwrite default with threaded scheduler

<dask.config.set at 0x7fe9eda1c190>

In [6]:
data_dir = 'Data/observations_tables/'
datasets = {
    'ndvi': 'ndvi_table_anom_lct_80000_84000.csv',
    'prcp': 'precip_table_anom_lct_80000_84000.csv',
    'temp': 'temp_table_anom_lct_80000_84000.csv'
}

### Step 1. Read in the datasets and group data by pixels

Read in the individuals csv files, and concatenate them into a single data array.

In [31]:
df_list = []
for i, dataset in enumerate(datasets.keys()):
    df = dd.read_csv(data_dir + datasets[dataset])
    df['var'] = dataset
    df_list.append(df.loc[0:100])
data = dd.concat(df_list)
data.set_index('pixel_id')
pixels = data.groupby('pixel_id')

### Step 2. Run the Simplex function on each pixel group

In [17]:
def do_Simplex(pixel, target='ndvi', ed=6, pi=3):
    #ed=6
    #target='ndvi'
    #pi=3
    t = pixel.T
    t.columns = pixel['var'].values
    # p_id = t.iloc[0]['ndvi']
    t = t.iloc[3:-1].astype(float)
    columns = t.columns
    t['Time'] = pd.Series(pd.to_datetime(t.index))
    # t = t[['Time'] + list(columns)]
    lib = '1 ' + str(len(t))
    pred = lib
    result = Simplex(
        dataFrame = t, lib = lib, pred = pred,
        E = ed, Tp = pi,
        columns = columns, target = target, showPlot = False)
    return result.T

In [36]:
one_pixel = pixels.get_group(252780)
one_pixel = one_pixel.compute()

In [20]:
df = data.compute()

In [76]:
def pixel_transform(pixel):
    pixel.set_index(pixel.iloc[:,-1]).iloc[:,3:-1]
    return ('temp', pixel.loc['temp',:].to_list())

### Step 3. Run the Simplex Function for each Pixel Id

In [77]:
test = pixels.apply(pixel_transform)

ValueError: Metadata inference failed in `groupby.apply(pixel_transform)`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
KeyError('temp')

Traceback:
---------
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/dask/dataframe/utils.py", line 174, in raise_on_meta_error
    yield
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/dask/dataframe/groupby.py", line 1613, in apply
    meta = self._meta_nonempty.apply(func, *meta_args, **meta_kwargs)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 859, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 892, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 213, in apply
    res = f(group)
  File "<ipython-input-76-d05f278b5daf>", line 3, in pixel_transform
    return ('temp', pixel.loc['temp',:].to_list())
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/indexing.py", line 873, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/indexing.py", line 1044, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/indexing.py", line 786, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/indexing.py", line 1110, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/indexing.py", line 1059, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/generic.py", line 3491, in xs
    loc = self.index.get_loc(key)
  File "/Users/kellycaylor/opt/anaconda3/envs/droughtEDM/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err


In [69]:
test.compute()

Unnamed: 0_level_0,Unnamed: 1_level_0,2002-07-11,2002-07-21,2002-08-01,2002-08-11,2002-08-21,2002-09-01,2002-09-11,2002-09-21,2002-10-01,2002-10-11,...,2019-01-21,2019-02-01,2019-02-11,2019-02-21,2019-03-01,2019-03-11,2019-03-21,2019-04-01,2019-04-11,2019-04-21
pixel_id,var,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
253359,temp,-0.552875,-0.217244,-0.701324,-0.701324,0.265766,0.406947,0.061001,0.985432,0.731399,0.221755,...,1.208922,1.298874,1.731830,0.207873,2.222059,0.965158,1.126209,0.927947,1.097937,0.901879
253359,ndvi,-0.613768,-0.682057,-0.750502,-0.798802,-0.813404,-0.811555,-0.825458,-0.923921,-0.861836,-0.534437,...,-0.290187,-0.437319,-0.496332,-0.538985,-0.614076,-0.656829,-0.654689,-0.545627,-0.340577,-0.070040
253359,prcp,-0.492448,-0.494050,-0.514979,-0.518604,-0.518604,-0.518604,-0.518604,-0.398309,-0.229213,-0.060973,...,-0.359019,-0.423448,-0.377278,-0.119065,-0.385213,-0.443841,0.126901,-0.452269,-0.458699,1.801077
253360,temp,-0.743305,-0.198759,-0.280825,-0.280825,0.174059,0.491491,0.073292,1.206714,0.984184,0.337311,...,1.384938,1.463201,1.893191,0.552285,2.334502,0.987960,1.282146,1.108175,1.293477,0.759397
253360,ndvi,-0.582699,-0.646529,-0.712318,-0.760645,-0.797051,-0.820513,-0.847850,-0.957199,-0.914057,-0.644665,...,-0.163175,-0.394060,-0.489789,-0.552099,-0.630360,-0.669550,-0.655078,-0.556882,-0.384853,-0.154831
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261947,ndvi,-0.474581,-0.642116,-0.835005,-1.112779,-1.176134,-1.022186,-0.681887,-0.217583,0.222808,0.673771,...,0.705471,0.386695,0.055722,-0.107172,-0.195289,-0.334184,-0.608328,-1.045149,,
261947,prcp,-1.229510,-1.053431,-1.078789,-1.237888,-0.575965,-0.395390,-0.287958,-0.293626,0.396356,0.218935,...,-0.281496,-1.087855,-0.727226,-0.429794,0.288698,-0.295102,0.222362,0.624170,0.764269,0.678677
262502,temp,-0.871718,0.298273,0.083427,0.083427,1.319651,-0.526801,-0.149129,-0.287817,0.081438,0.285309,...,0.024046,1.441787,-0.895484,1.542725,1.849648,1.260794,-0.537809,0.717443,-0.295967,2.132563
262502,ndvi,-0.483350,-0.537007,-0.633212,-0.809182,-0.760975,-0.569141,-0.284949,-0.054573,0.270137,0.493683,...,-1.516745,-1.561022,-1.466831,-1.282827,-1.287195,-1.372194,-1.376965,-1.326059,,-0.937727
