# Random Forest

We'll continue by evaluating a random forest model using AUC to get the best score. We'll also review the Confusion Matrix for the best model on the validation data as well as looking at the models precision and recall.

---

## Imports and import Data

In [1]:
import jupyter_black

jupyter_black.load()

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    roc_auc_score,
)

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_text

from sklearn import tree

from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split

### Import the data

In [2]:
df = pd.read_csv("./final_data.csv")

In [3]:
df.shape

(72063, 18)

In [4]:
# Some records are noy unique
df.drop_duplicates(inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71866 entries, 0 to 72062
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   block                 71866 non-null  object 
 1   iucr                  71866 non-null  object 
 2   primary_type          71866 non-null  object 
 3   description           71866 non-null  object 
 4   location_description  71866 non-null  object 
 5   arrest                71866 non-null  bool   
 6   domestic              71866 non-null  bool   
 7   beat                  71866 non-null  int64  
 8   district              71866 non-null  int64  
 9   ward                  71866 non-null  int64  
 10  community_area        71866 non-null  int64  
 11  fbi_code              71866 non-null  object 
 12  latitude              71866 non-null  float64
 13  longitude             71866 non-null  float64
 14  hour                  71866 non-null  int64  
 15  day                

### Set Target

The target feature is `arrest` which is a boolean feature. For the medel we need to change this to `0` and `1` values.

We will also change the `domestic` feature in the same manner

In [6]:
df.arrest = df.arrest.astype(int)
df.domestic = df.domestic.astype(int)

### Identify Catergorical and Numeric Columns

In [7]:
categorical_columns = [
    "iucr",
    "primary_type",
    "description",
    "location_description",
    "fbi_code",
    "zip",
    "street",
]

numerical_columns = [
    "domestic",
    "beat",
    "district",
    "ward",
    "community_area",
    "latitude",
    "longitude",
    "hour",
    "day",
]

features = categorical_columns + numerical_columns

---

# Random Forest

### Split the Data

The data will be split as follows:

#### Training

80% of the data will be used for Training. During training K-Fold cross validation will be used.

One the best model parameters are identified the full training dataset will be used to train the final model.

#### Test

20% of the data will be help back for final testing of the model.

In [8]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=11)

dict_train = df_train[features].to_dict(orient="records")
dict_test = df_test[features].to_dict(orient="records")

dv = DictVectorizer(sparse=False)

X_train = dv.fit_transform(dict_train)
X_test = dv.transform(dict_test)

y_train = df_train.arrest.values
y_test = df_test.arrest.values

In [9]:
dict_train[0]

{'iucr': '1477',
 'primary_type': 'WEAPONS VIOLATION',
 'description': 'RECKLESS FIREARM DISCHARGE',
 'location_description': 'PARK PROPERTY',
 'fbi_code': '15',
 'zip': '066XX',
 'street': 'N, WESTERN, AVE',
 'domestic': 0,
 'beat': 2412,
 'district': 24,
 'ward': 50,
 'community_area': 2,
 'latitude': 42.001822361,
 'longitude': -87.689987495,
 'hour': 19,
 'day': 3}

In [10]:
# Setup parameters with a wide spread

cv_wide = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=999)

random_dict = {
    "n_estimators": range(25, 201, 50),
    "min_samples_split": range(25, 201, 25),
    "min_samples_leaf": range(3, 21, 3),
    "max_depth": range(10, 21, 2),
    "bootstrap": [True, False],
}

rf = RandomForestClassifier(random_state=999)

---

### Accuracy

#### Caution

This took 30 minutes to run on a M1 Macbook Pro !!!

In [11]:
# Do a randomised grid wide search using accuracy
grid_random = RandomizedSearchCV(
    estimator=rf,
    param_distributions=random_dict,
    cv=cv_wide,
    verbose=1,
    scoring="accuracy",
    n_jobs=-1,
    n_iter=100,
)

grid_random.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits




In [12]:
# What are the best parameters
grid_random.best_params_

{'n_estimators': 125,
 'min_samples_split': 175,
 'min_samples_leaf': 3,
 'max_depth': 20,
 'bootstrap': False}

In [13]:
# What is the best score these parameters achieved
grid_random.best_score_

0.8897411696852311

---

## Evaluate the Model

In [14]:
rf_auc = RandomForestClassifier(
    n_estimators=75,
    min_samples_split=150,
    min_samples_leaf=3,
    max_depth=20,
    bootstrap=False,
)

rf_auc.fit(X_train, y_train)

y_pred = rf_auc.predict(X_test)

In [15]:
print("Confusion Matrix Tree : \n", confusion_matrix(y_test, y_pred), "\n")
print("The precision for Tree is ", precision_score(y_test, y_pred))
print("The recall for Tree is ", recall_score(y_test, y_pred), "\n")

Confusion Matrix Tree : 
 [[11174   111]
 [ 1521  1568]] 

The precision for Tree is  0.9338892197736748
The recall for Tree is  0.5076076400129492 

