# Train, fit, and evaluate classifier <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">


TODO:
- if/when datasets become large, consider implementing `dask_ml`

## Background

blahblahblah

## Description

1. Split the training data into a training set and a test set
2. Traing the model using a Random Forest Classifier
3. Evaluate the classifier using a number of methods
4. Optimise the model hyperparameters
5. Retrain the model using the optimised hyperparameters
6. Save model to disk

## Load packages

In [None]:
import os
import pandas as pd
import numpy as np
from joblib import dump
import subprocess as sp
from pprint import pprint
import matplotlib.pyplot as plt
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split, cross_val_score


## Analysis Parameters

* `training_data`: Name and location of the training data `.txt` file output from runnning `1_Extract_training_data.ipynb`
* `class_dict`: A dictionary mapping the 'string' name of the classes to the integer values that represent our classes in the training data (e.g. `{'crop': 1., 'noncrop': 0.}`)
* `ncpus`: Set this value to > 1 to parallize the model fitting eg. npus=8. 
* `metrics` : A single str or a list of strings to evaluate the predictions on the test set. See the scoring parameter page [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) for a pre-defined list of options
* `cv` : Determines the number of k-fold cross-validations to conduct during testing of the model.  ahigher number here will reduce over-fitting, but require more time to compute 3-5 is a good default number.

In [None]:
training_data = "results/training_data/test_training_data.txt"

class_dict = {'crop':1, 'noncrop':0}

ncpus = int(float(sp.getoutput('env | grep CPU')[-3:]))

metrics = 'f1'

cv = 5

print('ncpus = '+str(ncpus))

## Import training data

In [None]:
# load the data
model_input = np.loadtxt(training_data)

# load the column_names
with open(training_data, 'r') as file:
    header = file.readline()
    
column_names = header.split()[1:]

# Extract relevant indices from training data
model_col_indices = [column_names.index(var_name) for var_name in column_names[1:]]

### Split into training and testing data

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(model_input[:, model_col_indices],
                                                                  model_input[:, 0],
                                                                  test_size=0.2, 
                                                                  random_state=0)
print("Train shape:", train_features.shape)
print("Test shape:", test_features.shape)

## Train model

The intial model we use will rely on the default parameters, during hyperparameter tuning later we will alter these parameters

In [None]:
model = RandomForestClassifier(random_state=1, n_job=ncpus)

In [None]:
model.fit(train_features, train_labels)

## Evaluating Classifier

The following cells will help you examine the classifier and improve the results.  We can do this by:
* Calculate the cross-validation scores, a classification report, and a confusion matrix
* Finding out which features (bands in the input data) are most useful for classifying, and which are not,
* Evaluating which model parameters will optimize the model 
* Plotting some of the decision trees from the random forest model to visualize how the algorithm is splitting the data


### Accuracy metrics

We can use the 20% sample of test data we partitioned earlier to test the accuracy of the trained model on this new, "unseen" data.


In [None]:
predictions = model.predict(test_features)

In [None]:
score = cross_val_score(model,
                        model_input[:, model_col_indices],
                        model_input[:, 0],
                        cv=cv,
                        scoring=metrics)

In [None]:
print("=== Confusion Matrix ===")
print(confusion_matrix(test_labels, predictions))
print('\n')
print("=== Classification Report ===")
print(classification_report(test_labels, predictions))
print('\n')
print("=== All " +metrics+" Scores ===")
print(score)
print('\n')
print("=== Mean "+metrics+" Score ===")
print(score.mean())

### Determine Feature Importance

Extract classifier estimates of the relative importance of each band/variable for training the classifier. Useful for potentially selecting a subset of input bands/variables for model training/classification (i.e. optimising feature space). Results will be presented in descending order with the most important features listed first.  Importance is reported as a relative fraction between 0 and 1.

In [None]:
# This shows the feature importance of the input features for predicting the class labels provided
order = np.argsort(model.feature_importances_)

plt.figure(figsize=(13,5))
plt.bar(x=np.array(column_names[1:])[order],
        height=model.feature_importances_[order])
plt.gca().set_ylabel('Importance', labelpad=10)
plt.gca().set_xlabel('Variable', labelpad=10);

## Optimize hyperparameters

Hyperparameter searches are a required process in machine learning. Machine learning models require certain “hyperparameters”, model parameters that can be learned from the data. Finding these good values for these parameters is a “hyperparameter search” or an “hyperparameter optimization.”

To optimize the parameters in our model, we will take a two-step approach. Firstly, we conduct a random search over many possible paramter values to narrow our search. Secondly, using the parameters returned from the random search, we use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to exhaustively search through the parameters and determine the combination that will result in the highest accuracy based upon the accuracy metric defined.

* `param_grid`: Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored.

### Random hyperparameter grid search

In [None]:
random_param_grid = {
     'bootstrap': [True, False],
     'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10],
     'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}


In [None]:
#set up the parameter searching
random_grid_search = RandomizedSearchCV(model,
                           random_param_grid,
                           n_iter = 100,
                           scoring=metrics,
                           cv=cv,
                           n_jobs=ncpus)

random_grid_search.fit(model_input[:, model_col_indices], model_input[:, 0])

print("The most accurate combination of tested parameters is: ")
pprint(random_grid_search.best_params_)


### Grid Search with Cross Validation

Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. To use Grid Search, we make another grid based on the best values provided by random search:

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [70, 80, 90, 100],
    'max_features': ["sqrt"],
    'min_samples_leaf': [2, 3, 4, 5],
    'min_samples_split': [2,3,4,5],
    'n_estimators': [700,800,900]
}

# Instantiate the grid search model
grid_search = GridSearchCV(model,
                           param_grid, 
                           scoring=metrics,
                           cv=cv,
                           n_jobs=ncpus)

grid_search.fit(model_input[:, model_col_indices], model_input[:, 0])

print("The most accurate combination of tested parameters is: ")
pprint(grid_search.best_params_)

In [None]:
print("The most accurate combination of tested parameters is: ")
pprint(grid_search.best_params_)

## Retrain model

Using the best parameters from our hyperparmeter optmization search, we now rerun our model and re-test its accuracy

In [None]:
model = RandomForestClassifier(**grid_search.best_params_, random_state=1, n_job=ncpus)
model.fit(train_features, train_labels)

### Accuracy metrics

In [None]:
predictions = model.predict(test_features)

In [None]:
new_score = cross_val_score(model,
                        model_input[:, model_col_indices],
                        model_input[:, 0],
                        cv=cv,
                        scoring=metrics)

In [None]:
print("=== Confusion Matrix ===")
print(confusion_matrix(test_labels, predictions))
print('\n')
print("=== Classification Report ===")
print(classification_report(test_labels, predictions))
print('\n')
print("=== All " +metrics+" Scores ===")
print(new_score)
print('\n')
print("=== Mean "+metrics+" Score ===")
print(new_score.mean())

In [None]:
print("Improvement in "+metric+" score over default params:")
print(new_score.mean() - new_score.mean())

## Export & plot tree diagrams

Export .png plots of each decision tree in the random forest ensemble. Useful for inspecting the splits used by the classifier to classify the data.

> This can be quite slow if the classifier/number of trees is quite large

In [None]:
loc = 'results/tree_graphs/'
for n, tree_in_forest in enumerate(model.estimators_):    

    # Create graph and save to dot file
    export_graphviz(tree_in_forest,
                    out_file = loc+'tree.dot',
                    feature_names = column_names[1:],
                    class_names = list(class_dict.keys()),
                    filled = True,
                    rounded = True)

    # Plot as figure
    os.system('dot -Tpng ' + loc + 'tree.dot -o ' + loc + 'tree' + str(n + 1) + '.png')

In [None]:
# Plot any tree
tree_number = 'tree1'

img = plt.imread(loc + tree_number + '.png')
plt.figure(figsize = (20, 20))
plt.imshow(img, interpolation = "bilinear")

### Save the model

Running this cell will export the classifier as a binary`.joblib` file. This will allow for importing the model in the subsequent script, `4_Predict.ipynb` 


In [None]:
dump(model, 'results/ml_model.joblib')