# modeling notebook

Here, we estimate country-wide tree height scores, using LVIS-derived height scores as the response variable and landsat- and palsar-derived remote sensing data as covariates.

Steps:

* Extract co-aligned pixel values for the following datasets
    * LVIS lidar
    * Landsat fractional cover
    * PALSAR backscatter
    * CCB tree cover
* Perform hyperparameter tuning from a small subset of data
* Train a model on the full dataset using the best-fit hyperparameters
* Apply that model to the full country's covariates

In [2]:
# set to allow re-loading modules while in-dev
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
# import packages
import os
import ccb
import random
import pickle
import numpy as np
from sklearn import metrics
from sklearn import ensemble
from sklearn import multiclass
from sklearn import calibration
from sklearn import preprocessing
from sklearn import model_selection

# set the seed
random.seed(1985)

# set and create directories
data_dir = 'data'
model_dir = 'models'
if not os.path.exists(model_dir):
    os.mkdir(model_dir)
    
# set data paths
covariate_path = os.path.join(data_dir, 'covariate-stack.tif')
response_path = os.path.join(data_dir, 'tree-height-lvis-masked.tif')

## Reading input data
We'll read data from two paths: the response data (LVIS lidar) and the covariate data (Landsat/PALSAR/tree cover). Each of these datasets were created and masked to the same good-data locations per the `data-processing` notebook.

In [4]:
# read the response data into memory and subset to the good-data values
print("Reading response variable data into memory")
r = ccb.read.raster(response_path)
r.read_all()
gd = r.data != r.no_data
gd_count = gd.sum()
response = r.data[gd]

# clear the ref
r.data = None
r = None

print("{n} good-data values found".format(n = gd_count))

Reading response variable data into memory
2061698 good-data values found


In [5]:
# read the covariate data into memory, band-by-band, and do the same subset
c = ccb.read.raster(covariate_path)
covariates = np.zeros((gd_count, c.nb))
for band_number in range(c.nb):
    print("Reading covariate band: {n}".format(n = band_number + 1))
    c.read_band(band_number + 1)
    covariates[:, band_number] = c.data[gd]
    c.data = None
    
c = None

Reading covariate band: 1
Reading covariate band: 2
Reading covariate band: 3
Reading covariate band: 4
Reading covariate band: 5
Reading covariate band: 6


## Model tuning
We'll create a small subset of data that we'll use to run a grid search for model hyperparameter selection

In [6]:
xlarge, xsmall, ylarge, ysmall = model_selection.train_test_split(
    covariates, 
    response, 
    test_size=0.08
)

In [7]:
# run model tuning for gradient boosting regression
scoring = 'neg_mean_squared_error'

# create the tuning object
gbm = ccb.ml.tuner(
    xsmall, 
    ysmall, 
    scoring = scoring,
    n_splits = 3,
    cv = model_selection.KFold(3),
    n_jobs = 2,
    verbose = 2
)

gbm.GradientBoostRegressor()

Fitting 3 folds for each of 243 candidates, totalling 729 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   58.7s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed: 27.1min
[Parallel(n_jobs=2)]: Done 446 tasks      | elapsed: 85.8min
[Parallel(n_jobs=2)]: Done 729 out of 729 | elapsed: 667.9min finished


In [9]:
# save the best model
gbm_tuned_path = os.path.join(model_dir, 'gbm-tuned.pck')
with open(gbm_tuned_path, 'wb') as f:
    pickle.dump(gbm.best_estimator, f)

# report performance on tuning data
ypred = gbm.best_estimator.predict(xsmall)
r2 = metrics.r2_score(ysmall, ypred)
rmse = metrics.mean_squared_error(ysmall, ypred)
print('r**2 on tuning data : {:0.3f}'.format(r2))    
print('rmse on tuning data : {:0.3f}'.format(rmse))

r**2 on tuning data : 0.586
rmse on tuning data : 56.363


## Model training
Now that we've found the best hyperparameters to train the model, run a 3-fold cross-validation then use the full dataset to train the model