# Modeling

This notebook containing all steps and decisions in the modeling phase of the pipeline.

## The Required Imports

Here we'll import all the modules required to run the code cells in this notebook.

In [1]:
from time import time

import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC

from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.feature_selection import RFE

from wrangle import wrangle_crime_data
from prepare import split_data
from evaluate import *
from model import *

# We'll use this random seed for all the machine learning models.
random_seed = 42

## Acquire, Prepare, and Split the Data

Here we'll use the wrangle module to acquire and prepare the data. We'll then split the data into train, validate, and test datasets. The train dataset will be used to train the machine learning models. Validate and test will be used to determine how our models perform on unseen data.

In [2]:
df = wrangle_crime_data()
df = prep_data_for_modeling(df)

train, validate, test = split_data(df)
train.shape, validate.shape, test.shape

Using cached csv


((195764, 346), (83900, 346), (69917, 346))

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195764 entries, 128272 to 382883
Columns: 346 entries, council_district to WEAPON VIOL - OTHER
dtypes: bool(1), float64(4), uint8(341)
memory usage: 71.3 MB


## Establish a Baseline

We will need to establish a baseline model which will serve as performance reference for our models. The baseline will simply use the simplest approach to predict clearance status (which will be simply predicting the most frequent value). With this reference point will be able to determine if our models at least perform better than the simplest model we could build.

In [4]:
# Here we will establish a baseline model which will always predict the most frequent value in the target variable.

baseline = establish_classification_baseline(train.cleared)
baseline.value_counts()

False    195764
dtype: int64

In [5]:
# Calculate the roc auc score.
roc_auc_score(train.cleared, baseline)

0.5

In [6]:
# Calculate the accuracy score.
accuracy_score(train.cleared, baseline)

0.7883063280276251

We'll use two metrics to determine the performance of our models: roc auc score and accuracy. Accuracy will tell us how well the model predicts the clearance status of case for our dataset. However, due to the imbalance in our target variable we have to use another metric that will help determine in general how well the model predicts clearance status.

In [7]:
eval_df = append_model_results('baseline', evaluate(train.cleared, baseline, True))
eval_df

Unnamed: 0,accuracy,roc_auc
baseline,0.79,0.5


## Feature Selection

Before we begin building machine learning models let's use RFE to determine the importance of the features in the dataset.

In [8]:
# # We'll use RFE to rank the importance of the features in the dataset. We'll use a decision tree classifier 
# # as the model to compare the features.

# rfe = RFE(DecisionTreeClassifier(max_depth = 15), n_features_to_select = 2)
# rfe.fit(train.drop(columns = 'cleared'), train.cleared)

In [9]:
# pd.DataFrame({'Var': train.drop(columns = 'cleared').columns, 'Rank': rfe.ranking_}).sort_values(by = 'Rank').head(25)

## Initial Set of Models

Now we will build a set of initial models to determine which ones have the best performance. We will try building models using various classification algorithms provided by sklearn. These models will be evaluated on the train dataset and the top 3 performing models will be evaluated on validate.

In [10]:
# All the machine learning model objects will be created using mostly default values with just a few exceptions 
# such as decision trees which will have a limited depth.

algorithms = {
    'Decision Tree' : DecisionTreeClassifier(max_depth = 5, random_state = random_seed),
    'Random Forest' : RandomForestClassifier(max_depth = 5, random_state = random_seed),
    'Ada Boost' : AdaBoostClassifier(random_state = random_seed),
    'Bagging Classifier' : BaggingClassifier(random_state = random_seed),
    'Gradient Boosting' : GradientBoostingClassifier(random_state = random_seed),
    'KNN' : KNeighborsClassifier(),
    'SGD' : SGDClassifier(random_state = random_seed),
    'Naive Bayes' : BernoulliNB()
}

models = {}

for key, algorithm in algorithms.items():
    print(f'Training {key} model, ', end = '')
    
    start = time()
    models[key] = Model(
        algorithm,
        train = train,
        features = train.drop(columns = 'cleared').columns,
        target = 'cleared'
    )
    
    end = time()
    print(f'{end - start} seconds')

Training Decision Tree model, 1.1325089931488037 seconds
Training Random Forest model, 5.312100887298584 seconds
Training Ada Boost model, 19.786761045455933 seconds
Training Bagging Classifier model, 74.75548791885376 seconds
Training Gradient Boosting model, 52.34478688240051 seconds
Training KNN model, 0.17876291275024414 seconds
Training SGD model, 4.529252052307129 seconds
Training Naive Bayes model, 0.6548099517822266 seconds


In [11]:
# Now we'll evaluate the models.

for name, model in models.items():
    print(f'Evaluating {name} model, ', end = '')
    
    start = time()
    eval_df = append_model_results(
        name,
        evaluate(train.cleared, model.make_predictions(train), True),
        eval_df
    )
    
    end = time()
    print(f'{end - start} seconds')
    
eval_df

Evaluating Decision Tree model, 0.32296180725097656 seconds
Evaluating Random Forest model, 0.8737740516662598 seconds
Evaluating Ada Boost model, 5.832282781600952 seconds
Evaluating Bagging Classifier model, 3.4469261169433594 seconds
Evaluating Gradient Boosting model, 0.8725049495697021 seconds
Evaluating KNN model, 1200.461651802063 seconds
Evaluating SGD model, 0.2830479145050049 seconds
Evaluating Naive Bayes model, 0.6619489192962646 seconds


Unnamed: 0,accuracy,roc_auc
baseline,0.79,0.5
Decision Tree,0.83,0.61
Random Forest,0.79,0.5
Ada Boost,0.89,0.79
Bagging Classifier,0.97,0.95
Gradient Boosting,0.89,0.78
KNN,0.9,0.81
SGD,0.81,0.75
Naive Bayes,0.89,0.81


In [13]:
eval_df.sort_values(by = 'roc_auc', ascending = False)

Unnamed: 0,accuracy,roc_auc
Bagging Classifier,0.97,0.95
KNN,0.9,0.81
Naive Bayes,0.89,0.81
Ada Boost,0.89,0.79
Gradient Boosting,0.89,0.78
SGD,0.81,0.75
Decision Tree,0.83,0.61
baseline,0.79,0.5
Random Forest,0.79,0.5


For both metrics the three models with the best performance are the Bagging Classifier, KNN, and Naive Bayes. We'll now evaluate these models on the validate set.

In [18]:
# We'll evaluate the top 3 performing models on validate.
algorithms = [
    'Bagging Classifier',
    'KNN',
    'Naive Bayes'
]

eval_df = None

for model in algorithms:
    print(f'Evaluating {model} model, ', end = '')
    
    start = time()
    eval_df = append_model_results(
        model,
        evaluate(validate.cleared, models[model].make_predictions(validate), True),
        eval_df
    )
    
    end = time()
    print(f'{end - start} seconds')
    
eval_df.sort_values(by = 'roc_auc', ascending = False)

Evaluating Bagging Classifier model, 1.2515251636505127 seconds
Evaluating KNN model, 422.44667077064514 seconds
Evaluating Naive Bayes model, 0.2880840301513672 seconds


Unnamed: 0,accuracy,roc_auc
Bagging Classifier,0.89,0.81
Naive Bayes,0.89,0.81
KNN,0.87,0.76


The Naive Bayes model has the same performance on validate as it does on train so we'll evaluate this one on test.

## Evaluate Best Model on Test

In [19]:
append_model_results(
    'Naive Bayes',
    evaluate(test.cleared, models['Naive Bayes'].make_predictions(test), True)
)

Unnamed: 0,accuracy,roc_auc
Naive Bayes,0.89,0.81


The Naive Bayes model is 89% accurate on unseen data.