## Random Forest Model for Costa Rican Poverty Level Predicion
Written by: Andrew Dunn

### Outline of Notebook
**[1. Project Set-Up](#project-set-up)** \
*[1.1 Load Packages and Data](#1.1-load-packages-and-data)* \
*[1.2 Prep the data](#1.2-prep-the-data)* \
\
**[2. What is a Random Forest Model?](#2.-model-testing)** \
*[2.1 Summary of Approach](#2.1-summary-of-approach)* \
*[2.2 Summary of Findings](#2.2-summary-of-findings)* \
\
**[3. Models](#3.-models)** \
*[3.1 Conduct Random Search Cross Validation](#3.1-conduct-random-search-cross-validation)* \
*[3.2 Conduct Random Search Cross Validation on Resampled Data](#3.2-conduct-random-search-cross-validation-on-resampled-data)* \
*[3.3 Conduct Random Search Cross Validation on SMOTE data](#3.3-conduct-random-search-cross-validation-on-smote-data)* \
\
**[4. Limitations and Next Steps](#4.-limitations-and-next-steps)** 


## 1. Project Set-Up

### 1.1 Load Packages and Data

In [None]:
# change the working directory
%cd ..

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score, recall_score, classification_report
from sklearn.model_selection import RandomizedSearchCV
from imblearn.over_sampling import RandomOverSampler, SVMSMOTE
from collections import Counter
import pprint

import load_data
from evaluate_classification import evaluate_classification

### 1.2 Prep the data

In [None]:
SEED = 0

df, X_valid, y_valid = load_data.load_train_data()
X = df.iloc[:, :-1]
y = df.loc[:, 'Target']

# Create oversampled dataframes
train_X_resampled, train_y_resampled = load_data.gen_oversample_date(df)
X_smote, y_smote = load_data.gen_SMOTE_data(df)

### 2. What is a Random Forest Model?

A Random Forest model is a machine learning algorithm that creates many decision trees and averages the results from them. Because of this, random forests have the benefit over decision trees of being less likely to overfit on the training data. They are frequently used for classification questions. 

We apply this to our Target column, which has values 1-4, representing the different levels of poverty.

#### 2.1 Summary of Approach


We run random search cross validation on three different sets of data: 
- the raw cleaned data
- data where we randomly oversample from the underrepresented classes
- data where we apply the SMOTE methodology to generate additional rows for the underrepresented classes


The random search cross validation process randomly selects different combinations of hyperparameters and returns the best fitting.

### 2.2 Summary of Findings

Thus far, there is no significant difference between the models run on the different datasets. The best performing models from the random search cross validation process yields an accuracy of about .65 and an f1 score of about .59 when run on the validation data.

## 3. Models

### 3.1 Basic Models

In [None]:
# test of default model
clf = RandomForestClassifier(random_state= SEED)
clf.fit(X, y)

y_pred = clf.predict(X_valid)
evaluate_classification(y_pred, y_valid, cm=True)

In [None]:
# Make graph of feature importance
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
feature_names = X.columns.values.tolist()
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=feature_names)

# top 20 features
top_20 = forest_importances.nlargest(20)
indices = np.where(np.in1d(forest_importances, top_20))[0]

fig, ax = plt.subplots()
top_20.plot.bar(yerr=std[indices], ax=ax)

ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

In [None]:
# test number of trees in forest
#num_trees = [100, 200, 300, 400, 500, 1000, 1500, 2000]
num_trees = [i for i in range(100, 2100, 100)]
accuracy_lst = []
f1_lst = []
recall_lst_1 = []
recall_lst_2 = []
recall_lst_3 = []
recall_lst_4 = []

for i in num_trees:
    clf = RandomForestClassifier(n_estimators = i,
                                random_state = SEED)
    clf.fit(X, y)
    y_pred = clf.predict(X_valid)

    accuracy, f1, recall = evaluate_classification(y_pred, y_valid, return_vals = True)
    
    accuracy_lst.append(accuracy)
    f1_lst.append(f1)
    recall_lst_1.append(recall[0])
    recall_lst_2.append(recall[1])
    recall_lst_3.append(recall[2])
    recall_lst_4.append(recall[3])


In [None]:
# plot the stats
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20,10))

plt.plot(num_trees, accuracy_lst, label = 'accuracy', color = 'blue')
plt.plot(num_trees, f1_lst, label = 'f1', color = 'red')
plt.plot(num_trees, recall_lst_1, label = 'recall_1', color = 'darkgreen')
plt.plot(num_trees, recall_lst_2, label = 'recall_2', color = 'green')
plt.plot(num_trees, recall_lst_3, label = 'recall_3', color = 'forestgreen')
plt.plot(num_trees, recall_lst_4, label = 'recall_4', color = 'seagreen')



ax.legend(loc="lower right", fontsize=16)
ax.set_xlabel("number of trees", fontsize=16)
ax.set_ylabel("performance stats", fontsize=16)
plt.show()


### 3.1 Conduct Random Search Cross Validation 

In [None]:
# Look via randomized search on the raw cleaned data
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
               
print(random_grid)

In [None]:
clf = RandomForestClassifier(random_state= SEED)

rf_random = RandomizedSearchCV(estimator = clf, 
                                param_distributions = random_grid, 
                                n_iter = 100, 
                                cv = 3, 
                                verbose = 2,
                                 random_state= SEED, 
                                 n_jobs = -1)
# Fit the random search model
rf_random.fit(X, y)

rf_random.best_params_

In [None]:
# Evaluate the best random model
best_random = rf_random.best_estimator_
best_random.fit(X, y)
y_pred = best_random.predict(X_valid)
evaluate_classification(y_pred, y_valid, cm = True)

### 3.2 Conduct Random Search Cross Validation on Resampled Data

In [None]:
# Run on the resampled data
clf = RandomForestClassifier(random_state= SEED)

rf_rs_random = RandomizedSearchCV(estimator = clf, 
                                param_distributions = random_grid, 
                                n_iter = 100, 
                                cv = 3, 
                                verbose = 2,
                                 random_state= SEED, 
                                 n_jobs = -1)
# Fit the random search model
rf_rs_random.fit(train_X_resampled, train_y_resampled)

rf_rs_random.best_params_

In [None]:
# Evaluate the best random model on randomly resampled data
best_rs_random = rf_rs_random.best_estimator_
best_rs_random.fit(X, y)
y_pred = best_rs_random.predict(X_valid)
evaluate_classification(y_pred, y_valid, cm = True)

### 3.3 Conduct Random Search Cross Validation on SMOTE data

In [None]:
# Run on the SMOTE data
clf = RandomForestClassifier(random_state= SEED)

rf_smote_random = RandomizedSearchCV(estimator = clf, 
                                param_distributions = random_grid, 
                                n_iter = 100, 
                                cv = 3, 
                                verbose = 2,
                                random_state= SEED, 
                                n_jobs = -1)
# Fit the random search model
rf_smote_random.fit(X_smote, y_smote)

rf_smote_random.best_params_

In [None]:
# Evaluate the best random model on randomly resampled data
best_smote_random = rf_smote_random.best_estimator_
best_smote_random.fit(X, y)
y_pred = best_smote_random.predict(X_valid)
evaluate_classification(y_pred, y_valid, cm = True)

## 4. Limitations and Next Steps

This analysis only initially examines different random forest models. Further attention should be paid to the data preprocessing and feature generation to ensure that the data inputs are as high quality and possible. The models clearly do well at predicted label #4, but appear to do a worse job of categorizing the other labels. 

Next steps include further testing of different versions of this random forest models and different combinations of hyperparamters, as well as examining if there are ways to better categorize the labels other than 4.