# Final Random Forest Model

This is the code that takes in our fully cleaned and feature engineered dataset, trains a model on 80% of the data, and tests it on the other 20% to check accuracy. This gives us an understanding of how much faith we can have in our model. Were it to be used in a real life scenario, we would train it on all available training data, and use it to predict our unknown new data points.

__Import Libraries__ first order of business is to import the necessary libraries.

In [1]:
# get rid of numpy depreciation warnings
def warn(*args, **kwargs): pass
import warnings
warnings.warn = warn

# import python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import data science libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

__Import and Inspect__ the dataset, which has been cleaned, prepared, and feature engineered in the file `feature_engineering.ipynb`.

In [2]:
master_df = pd.read_csv('data/master_df.csv')
master_df.head()

Unnamed: 0,Lat,Long,MONTH,HOUR,TAVG,PRCP,SNOW,closest_property_value,neighborhood_avg,lamp_min_dist,...,hospital_min_dist,school_min_dist,school_density,DAY_OF_WEEK_Friday,DAY_OF_WEEK_Monday,DAY_OF_WEEK_Saturday,DAY_OF_WEEK_Sunday,DAY_OF_WEEK_Thursday,DAY_OF_WEEK_Tuesday,DAY_OF_WEEK_Wednesday
0,42.364331,-71.063193,9,4,70.0,0.0,0.0,16.843299,14.987931,7e-05,...,0.001813,0.008471,2.0,0,0,0,1,0,0,0
1,42.31463,-71.092615,9,4,70.0,0.0,0.0,12.754483,13.036407,8.3e-05,...,0.00394,0.001727,8.0,0,0,0,1,0,0,0
2,42.279675,-71.083813,9,3,70.0,0.0,0.0,12.997027,13.034991,7.8e-05,...,0.008593,0.003386,5.0,0,0,0,1,0,0,0
3,42.379124,-71.028082,9,2,70.0,0.0,0.0,15.638769,15.682548,0.000109,...,0.038462,0.002467,4.0,0,0,0,1,0,0,0
4,42.379124,-71.028082,9,2,70.0,0.0,0.0,12.77479,13.015478,0.000157,...,0.038462,0.002467,4.0,0,0,0,1,0,0,0


__Split the Dataset__ into predictor variables (X) and response (y) variables. Then, further split each into a training set and a test set, stratifying on the category (y) to ensure that we get an accurate measurement of how well the model performs.

In [38]:
# split into predictors and response variables
X = master_df.drop(['category'], axis=1) 
y = master_df['category']

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

__A Note About the Baseline Model:__ to have something to compare to, we looked at what sort of accuracy we would want our model to beat. One way to think of it is the accuracy we would get if responders guessed categories at random (this would be the accuracy given no information about the past). Another way to think about it is the accuracy that a model would produce if it just guessed the most common category 100% of the time (a very naive model). We included both, and decided to compare our model to the latter, because it is more realistic that we do have information about the past when making these decisions.

In [39]:
# calculate baseline accuracy if we guessed categories at random
naive_baseline = 1 / len(master_df.category.value_counts())

# calculate baseline accuracy if the model assumed the most common category
model_baseline   = master_df.category.value_counts()[0] / sum(master_df.category.value_counts())

print('naive baseline:', 100*naive_baseline, '%')
print('model baseline:', round(100*model_baseline,1), '%')

naive baseline: 20.0 %
model baseline: 35.9 %


__Set Up the Model__ using the knowledge we gained from refining different models, all of which can be found in the folder `old_models`. From this refining, we decided that the best model would be a Random Forest with a maximum tree depth of 10 (so as to get the best test accuracy without overfitting too much to the training set). We tested each of the hyper parameters that `RandomForestClassifier` takes, but none helped us improve our accuracy.

In [40]:
def test_rf(best_depth):

    # set up the model and fit on training data
    model = RandomForestClassifier(n_estimators=100, max_depth=best_depth)
    model.fit(X_train, y_train)

    # predict train and test data to check accuracy
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # evaluate performance using accuracy score
    acc_random_forest_training = accuracy_score(y_train, y_pred_train)*100
    acc_random_forest_testing = accuracy_score(y_test, y_pred_test)*100

    print("Random Forest Accuracy, Training Set : {:0.2f}%".format(acc_random_forest_training))
    print("Random Forest Accuracy, Testing Set :  {:0.2f}%".format(acc_random_forest_testing))
    
    return(model)

__Train and Evaluate__ the model using the train and test sets. We chose to evaluate our model based on accuracy because other methods didn't make sense. For example, we do care a lot about the false negatives, since not all our categories have the same number of entries, however, since this is not a binary outcome problem, it didn't make sense to use ROC scores. After careful inspection of the class predictions and the `predict_proba` results, we decided that the accuracy was a good measure of how well our model was performing.

In [41]:
model = test_rf(10)

Random Forest Accuracy, Training Set : 50.74%
Random Forest Accuracy, Testing Set :  46.08%


__Analysis:__ as we can see above, our trained model (~50%) performed better than the baseline model (35%). Although the model itself seems simple (we just trained a simple random forest on our training data), it is in fact the result of many careful decisions. The real model refining happened in the feature engineering and feature selection process. 

Due to the nature of the data (crimes are hard to predict, it can hard to differentiate between a crime that involves drugs or not, and the predicors that we chose cannot possibly capture all the variability in the data), we are confident that our model's accuracy could not significantly improve without drastic changes in the data inputs.

That being said, we wanted to create somethiing that could be useful for a real-life situation, i.e. how can we use the model we created to help emergency responders? The accuracy is pretty low, meaning that given an emergency, the model would only be right about what sort of response was needed 50% of the time. So we decided to use our model to help narrow down the types of responses that could be needed to 2 instead of 5. We could use our model to tell responders "this emergency likely involves drugs and weapons posessions", or "this emergency likely involves domestic issues and drugs". This could help responders send different specialists to each situation, making their response more effective.

In order to accomplish this, we used `predict_proba` and chose the top two predicted classes for each crime. We used the model trained on the training set, and tested it on the test set to see how good of an accuracy we could get. We were hoping for an accuracy above 75% for a reliable and applicable tool.

In [42]:
# get list of categories 
caterogies = np.array(sorted(y_test.unique()))

# use model to predict probabilities of each class
predict_probas = model.predict_proba(X_test)

# keep track of how many predicted pairs contained the correct category
correct = 0

# evaluate model
for i in range(len(predict_probas)):
    if list(y_test)[i] in list(caterogies[predict_probas[i].argsort()[-2:][::-1]]): correct += 1
    else: pass

# calculate percentage guessed right
test_accuracy = correct / len(predict_probas)
print('percent of guesses that contained correct category:',round(100*test_accuracy,1),'%')

percent of guesses that contained correct category: 74.2 %


In [35]:
X_train.columns[np.argmax(model.feature_importances_)]

'police_min_dist'