# Modelling uncertainty in LGBM predictions

Quantile regression in this context doesn't work as an estimate of Aus-wide uncertainty.

Instead, we will attempt to model the uncertainty that comes from the training data.  To do this, we will generate _n_ models where _n_ is equal to the number of unique EC flux tower sites that make up the training data (29 in this case). For each iteration, one site's entire time-series will be removed from the training data and a LGBM model will be fit on the remaining data.  This will result in 29 models that later we can use to make predictions with. The envelope of our predictions will inform our uncertainity


## Load modules

In [1]:
import os
import xarray as xr
import numpy as np
import pandas as pd
from joblib import dump
import multiprocessing
import matplotlib.pyplot as plt
from pprint import pprint
from lightgbm import LGBMRegressor
import lightgbm as lgbm
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import warnings
warnings.filterwarnings("ignore")

## Analysis Parameters

In [2]:
model_name = 'AUS'
model_var = 'NEE'

In [3]:
ncpus=multiprocessing.cpu_count()
print('ncpus = '+str(ncpus))

ncpus = 16


## Prepare Data

In [4]:
base = '/g/data/os22/chad_tmp/NEE_modelling/results/training_data/'
sites = os.listdir('/g/data/os22/chad_tmp/NEE_modelling/results/training_data/')

td = []
for site in sites:
    if '.csv' in site:
        xx = pd.read_csv(base+site, index_col='time', parse_dates=True)
        xx['site'] = site[0:5]
        td.append(xx)

ts = pd.concat(td).dropna() #we'll use this later

In [5]:
variables = [
            #'LAI_anom_RS',
             'kNDVI_anom_RS',
             'FPAR_RS',
             'LST_RS',
             'tree_cover_RS',
             'nontree_cover_RS',
             'nonveg_cover_RS',
             'LST-Tair_RS',
             'TWI_RS',
             'NDWI_RS',
             'rain_anom_RS',
             'rain_cml3_anom_RS',
             'rain_cml6_anom_RS',
             'rain_cml12_anom_RS',
             'srad_anom_RS',
             'vpd_RS',
             'tavg_anom_RS',
             'SOC_RS',
             #'CO2_RS',
             'site'
            ]

## prepare training data

In [6]:
xx = []
yy = []

for t in td:
    t = t.drop(['Fluxcom_RS-Meteo_NEE', 'Fluxcom_RS_NEE', 'ThisStudy_NEE', 'Cable_NEE',
       'Fluxcom_RS_GPP', 'Fluxcom_RS-meteo_GPP', 'ThisStudy_GPP', 'Cable_GPP',
       'MODIS_GPP', 'GOSIF_GPP'], axis=1)  
    
    t = t.dropna()  # remove NaNS
    df = t.drop(['NEE_SOLO_EC','GPP_SOLO_EC','ER_SOLO_EC'], axis=1) # seperate carbon fluxes
    
    #df = df.filter(regex='RS') # only use remote sensing variables   
    df = df[variables]
    
    if model_var == 'ET':
        df_var=t[model_var+'_EC']
    else:
        df_var=t[[model_var+'_SOLO_EC', 'site']] # seperate out the variable we're modelling
    
    x = df.reset_index(drop=True)#.to_numpy()
    y = df_var.reset_index(drop=True)#.to_numpy()
    xx.append(x)
    yy.append(y)

x = pd.concat(xx)
y = pd.concat(yy)

print(x.shape)

(2744, 18)


## Test model robustness with time-series K-fold cross validation

* If you set boosting as RF then the lightgbm algorithm behaves as random forest. According to the documentation, to use RF you must use bagging_fraction and feature_fraction smaller than 1


<img src="results/figs/cross_validation.png" width=800>

### Generate five sets of train-test indices

First we remove one site's time-series from the full training dataset.

Then,we do the per site TSCV: For each site, grab a sequential set of test samples (time-series-split methods), the remaining points (either side of test samples) go into training.  A single K-fold contains test and training samples from every site.

A model is built and saved that is trained of n-1 sites

We can **label the models by the site that is removed**

In [7]:
for site in x['site'].unique():
    
    if os.path.exists('/g/data/os22/chad_tmp/NEE_modelling/results/models_uncertainty/'+model_var+'_LGBM_'+site+'-rm.joblib'):
        print('skipping '+site)
        continue
    else:
        print('starting ' +site)
        x_n = x[x.site != site]
        y_n = y[y.site != site]

        sites_n = x_n['site'].unique()
        x_n['original_index'] = [i for i in range(0,len(x_n))]

        train_1=[]
        train_2=[]
        train_3=[]
        train_4=[]
        train_5=[]

        test_1=[]
        test_2=[]
        test_3=[]
        test_4=[]
        test_5=[]

        for site_n in sites_n:
            df = x_n.loc[x_n['site'] == site_n]
            tscv = TimeSeriesSplit(n_splits=5)
            i=1
            for train, test in tscv.split(df):
                all_indices=np.concatenate([train,test])
                left_over = df.loc[~df.index.isin(all_indices)].index.values
                train = np.concatenate([train, left_over])
                if i==1:
                    train_1.append(df.iloc[train]['original_index'].values)
                    test_1.append(df.iloc[test]['original_index'].values)
                if i==2:
                    train_2.append(df.iloc[train]['original_index'].values)
                    test_2.append(df.iloc[test]['original_index'].values)
                if i==3:
                    train_3.append(df.iloc[train]['original_index'].values)
                    test_3.append(df.iloc[test]['original_index'].values)
                if i==4:
                    train_4.append(df.iloc[train]['original_index'].values)
                    test_4.append(df.iloc[test]['original_index'].values)
                if i==4:
                    train_5.append(df.iloc[train]['original_index'].values)
                    test_5.append(df.iloc[test]['original_index'].values)
                i+=1

        train_1 = np.concatenate(train_1)
        train_2 = np.concatenate(train_2)
        train_3 = np.concatenate(train_3)
        train_4 = np.concatenate(train_4)
        train_5 = np.concatenate(train_5)

        test_1 = np.concatenate(test_1)
        test_2 = np.concatenate(test_2)
        test_3 = np.concatenate(test_3)
        test_4 = np.concatenate(test_4)
        test_5 = np.concatenate(test_5)

        train = [train_1, train_2, train_3, train_4, train_5]
        test = [test_1, test_2, test_3, test_4, test_5]

        #check there are no train indices in the test indices
        for i,j in zip(train, test):
            assert (np.sum(np.isin(i,j)) == 0)

        #remove the columns we no longer need
        x_n = x_n.drop(['site', 'original_index'], axis=1)
        y_n = y_n.drop('site', axis=1)

        #optimize hyperparamters
        param_grid = {
            'num_leaves': [7, 14, 21, 28, 31],
            'min_child_samples':[15, 20, 30],
            'boosting_type': ['gbdt', 'dart'],
            'max_depth': [5, 10, 15, 20],
            'n_estimators': [300, 400, 500],
            'early_stopping_round' : [10]
            }

        clf = GridSearchCV(LGBMRegressor(verbose=-1),
                           param_grid,
                           scoring='r2',
                           verbose=0,
                           cv=zip(train, test), #using timeseries custom splits here
                          )

        clf.fit(x_n, y_n, callbacks=None)
        print(site+ ": r2 score ", round(clf.best_score_, 2))

        #fit model and save
        model = LGBMRegressor(**clf.best_params_)
        model.fit(x_n,y_n)

        dump(model, '/g/data/os22/chad_tmp/NEE_modelling/results/models_uncertainty/'+model_var+'_LGBM_'+site+'-rm.joblib')

skipping Colli
skipping Tumba
skipping CapeT
skipping Boyag
skipping Litch
skipping Silve
skipping Emera
skipping Whroo
skipping Great
skipping Otway
skipping Sturt
skipping FoggD
skipping Gingi
skipping Adela
skipping DalyU
starting Riggs
Riggs: r2 score  0.57
starting Longr
Longr: r2 score  0.58
starting Samfo
Samfo: r2 score  0.58
starting Walla
Walla: r2 score  0.62
starting Robso
Robso: r2 score  0.59
starting Warra
Warra: r2 score  0.61
starting Womba
Womba: r2 score  0.54
starting Calpe
Calpe: r2 score  0.58
starting Yanco
Yanco: r2 score  0.57
starting Alice
Alice: r2 score  0.58
starting Ridge
Ridge: r2 score  0.57
starting DryRi
DryRi: r2 score  0.58
starting Cumbe
Cumbe: r2 score  0.59
starting CowBa
CowBa: r2 score  0.6
