# Evaluating initial machine learning performance

I ran the combined Landsat7-and-household-income data through a large-grid ML pipeline, saving 30% of the original data as test data and recording the resulting $R^2$ values. The full grid parameters are kept in the `GRID_MAIN` variable in `config.py`.

This notebook explores results from various models.

In [26]:
import os
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.metrics import r2_score
import graphviz

import config as cf

# Display settings
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = -1

### Load results file

In [2]:
RESULTS_PATH = os.path.join('output', 'results.csv')
df = pd.read_csv(RESULTS_PATH)
df.shape

(1092, 7)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,regressor,params,features,r2,mse,max_err
0,0,Lasso,"{'alpha': 0.01, 'max_iter': 1000.0, 'selection...",DAY_FEATURES,0.002785,2832462000000.0,31127450.0
1,1,Lasso,"{'alpha': 0.01, 'max_iter': 1000.0, 'selection...",NIGHT_FEATURES,0.000684,2838430000000.0,31303300.0
2,2,Lasso,"{'alpha': 0.01, 'max_iter': 1000.0, 'selection...",ALL_FEATURES,0.002788,2832455000000.0,31114320.0
3,3,Lasso,"{'alpha': 0.01, 'max_iter': 1000.0, 'selection...",DAY_FEATURES,0.002978,2831916000000.0,31129670.0
4,4,Lasso,"{'alpha': 0.01, 'max_iter': 1000.0, 'selection...",NIGHT_FEATURES,0.000684,2838430000000.0,31303300.0


### Explore results

In [4]:
df.sort_values(by='r2', ascending=False).head()

Unnamed: 0.1,Unnamed: 0,regressor,params,features,r2,mse,max_err
190,190,DecisionTreeRegressor,"{'criterion': 'friedman_mse', 'splitter': 'bes...",NIGHT_FEATURES,0.695971,863557100000.0,20061798.0
118,118,DecisionTreeRegressor,"{'criterion': 'mse', 'splitter': 'best', 'max_...",NIGHT_FEATURES,0.695971,863557100000.0,20061798.0
121,121,DecisionTreeRegressor,"{'criterion': 'mse', 'splitter': 'best', 'max_...",NIGHT_FEATURES,0.695971,863557100000.0,20061798.0
193,193,DecisionTreeRegressor,"{'criterion': 'friedman_mse', 'splitter': 'bes...",NIGHT_FEATURES,0.695971,863557100000.0,20061798.0
225,225,DecisionTreeRegressor,"{'criterion': 'friedman_mse', 'splitter': 'ran...",DAY_FEATURES,0.680805,906633300000.0,20061798.0


The best-performing models (by $R^2$) were decision tree regression models trained on night-time features.

Hyperparameters for the best-performing models involved:
- a maximum tree depth of 20 (`max_depth = 20`) 
- searching sqrt(n) or log2(n) features at each split (`max_features = {sqrt, log2}`)
- using mean squared error to determine best non-random splits (`criterion = {mse, friedman_mse}`)

In [5]:
# What was the best performance for each model type?
df.groupby('regressor')['r2'].max().sort_values(ascending=False)

regressor
DecisionTreeRegressor    0.695971
BaggingRegressor         0.635201
RandomForestRegressor    0.554162
LinearRegression         0.003086
Lasso                    0.003085
Ridge                    0.002697
LinearSVR                0.000802
Name: r2, dtype: float64

Only bagging, decision tree, and random forest models achieved $R^2$ scores appreciably above 0. 

Surprisingly, decision trees seem to outperform random forests here. Given how random forests work, this suggests that the problem lies either with bootstrapping features (i.e. useful information is spread across all features such that that we lose info by using subsets) or bootstrapping observations (i.e. our initial dataset is too small for subsets to be useful). 

### Get feature importances

To re-create feature importances, we'll have to re-train the best-performing model.

In [6]:
# Load final model data
CLEAN_DATA_PATH = os.path.join('output', 'final_data.pkl')
with open(CLEAN_DATA_PATH, 'rb') as f:
    x_train, x_test, y_train, y_test = pickle.load(f)
    
# Verify
for i in (x_train, x_test, y_train, y_test):
    print(i.shape)

(3412, 32)
(1463, 32)
(3412,)
(1463,)


In [7]:
# Verify feature set
features = cf.NIGHT_FEATURES
features

['dmspols_2011', 'viirs_2012', 'dmspols_2011_imputed', 'viirs_2012_imputed']

In [8]:
# Get parameters 
params = eval(df.sort_values(by='r2', ascending=False).reset_index().iloc[0]['params'])
params

{'criterion': 'friedman_mse',
 'splitter': 'best',
 'max_depth': 20,
 'max_features': 'sqrt',
 'random_state': 0}

In [9]:
# Retrain best tree
dt = DecisionTreeRegressor(**params)
dt.fit(x_train[features], y_train)

DecisionTreeRegressor(criterion='friedman_mse', max_depth=20,
                      max_features='sqrt', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort=False,
                      random_state=0, splitter='best')

In [10]:
# Verify this is the model with the highest R2
pred_labels = dt.predict(x_test[features])

r2_score(y_true=y_test, y_pred=pred_labels)

-0.4502273446300675

In [11]:
importance = pd.DataFrame({'feature': features, 'importance': dt.feature_importances_})
importance

Unnamed: 0,feature,importance
0,dmspols_2011,0.104068
1,viirs_2012,0.895772
2,dmspols_2011_imputed,0.00016
3,viirs_2012_imputed,0.0


Unsurprisingly, `viirs_2012` seems to be the most important feature in this model.

### Visualize tree

Final decision tree (with max depth 20) is a little too large to display here.

In [23]:
# export_graphviz(dt,
#                 out_file=os.path.join('output', 'tree.dot'),
#                 feature_names=cf.NIGHT_FEATURES,
#                 filled=True)

# graphviz.Source.from_file(os.path.join('output', 'tree.dot'))

## Models with Daytime features

What about models trained on daytime features? Which ones performed best? Which were the most important features?

In [27]:
df.loc[df['features'] == 'DAY_FEATURES'] \
    .sort_values(by='r2', ascending=False) \
    .head()

Unnamed: 0.1,Unnamed: 0,regressor,params,features,r2,mse,max_err
225,225,DecisionTreeRegressor,"{'criterion': 'friedman_mse', 'splitter': 'random', 'max_depth': 20, 'max_features': 'sqrt', 'random_state': 0}",DAY_FEATURES,0.680805,906633300000.0,20061800.0
153,153,DecisionTreeRegressor,"{'criterion': 'mse', 'splitter': 'random', 'max_depth': 20, 'max_features': 'sqrt', 'random_state': 0}",DAY_FEATURES,0.680805,906633300000.0,20061800.0
189,189,DecisionTreeRegressor,"{'criterion': 'friedman_mse', 'splitter': 'best', 'max_depth': 20, 'max_features': 'sqrt', 'random_state': 0}",DAY_FEATURES,0.656971,974330600000.0,20036780.0
117,117,DecisionTreeRegressor,"{'criterion': 'mse', 'splitter': 'best', 'max_depth': 20, 'max_features': 'sqrt', 'random_state': 0}",DAY_FEATURES,0.656971,974330600000.0,20036780.0
1083,1083,BaggingRegressor,"{'n_estimators': 10000, 'max_features': 0.3, 'random_state': 0, 'n_jobs': -1}",DAY_FEATURES,0.635201,1036166000000.0,19897300.0


The best-performing daytime models (by $R^2$) were also decision trees.

Hyperparameters for the best-performing daytime models involved:
- a maximum tree depth of 20 (`max_depth = 20`) 
- searching sqrt(n) features at each split (`max_features = sqrt`)
- with random splits (`splitter = random`)

In [36]:
day_params = eval(df.loc[df['features'] == 'DAY_FEATURES'] \
    .sort_values(by='r2', ascending=False) \
    .reset_index() \
    .iloc[0]['params'])
day_params

{'criterion': 'friedman_mse',
 'splitter': 'random',
 'max_depth': 20,
 'max_features': 'sqrt',
 'random_state': 0}

In [40]:
# Retrain best tree
day_dt = DecisionTreeRegressor(**day_params)
day_dt.fit(x_train[cf.DAY_FEATURES], y_train)

DecisionTreeRegressor(criterion='friedman_mse', max_depth=20,
                      max_features='sqrt', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort=False,
                      random_state=0, splitter='random')

In [41]:
# Verify this is the model with the highest R2
day_pred_labels = day_dt.predict(x_test[cf.DAY_FEATURES])

r2_score(y_true=y_test, y_pred=day_pred_labels)

-1.000808666245904

In [44]:
day_importance = pd.DataFrame({'feature': cf.DAY_FEATURES, 'importance': day_dt.feature_importances_})
day_importance.sort_values(by='importance', ascending=False).head()

Unnamed: 0,feature,importance
1,l7_2011_2,0.253304
27,ratio_6_7,0.144409
23,ratio_4_6,0.114767
4,l7_2011_5,0.074585
16,ratio_2_6,0.052273


In this model, the Landsat Band 2 (blue) seems to be most important, followed by:
- the ratio of Band 6 to Band 7 (SWIR1 and SWIR2)
- then the ratio of Band 4 to Band 6 (Red and SWIR1)
- then Band 5 (NIR)