# Modeling

This notebook containing all steps and decisions in the modeling phase of the pipeline.

## The Required Imports

Here we'll import all the modules required to run the code cells in this notebook.

In [8]:
from time import time

import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC

from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.feature_selection import RFE, SelectKBest, f_classif

from sklearn.preprocessing import MinMaxScaler

from wrangle import wrangle_crime_data
from prepare import split_data
from evaluate import *
from model import *

# We'll use this random seed for all the machine learning models.
random_seed = 42

## Acquire, Prepare, and Split the Data

Here we'll use the wrangle module to acquire and prepare the data. We'll then split the data into train, validate, and test datasets. The train dataset will be used to train the machine learning models. Validate and test will be used to determine how our models perform on unseen data.

In [2]:
df = wrangle_crime_data()
df = prep_data_for_modeling(df)

train, validate, test = split_data(df)
train.shape, validate.shape, test.shape

Using cached csv


((195795, 392), (83913, 392), (69928, 392))

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195795 entries, 68896 to 178533
Columns: 392 entries, council_district to month
dtypes: bool(2), float64(1), int64(2), uint8(387)
memory usage: 78.6 MB


In [9]:
train_scaled, validate_scaled, test_scaled = train.copy(), validate.copy(), test.copy()

columns = train_scaled.drop(columns = 'cleared').columns

scaler = MinMaxScaler()
train_scaled[columns] = scaler.fit_transform(train[columns])
validate_scaled[columns] = scaler.transform(validate[columns])
test_scaled[columns] = scaler.transform(test[columns])

## Establish a Baseline

We will need to establish a baseline model which will serve as performance reference for our models. The baseline will simply use the simplest approach to predict clearance status (which will be simply predicting the most frequent value). With this reference point will be able to determine if our models at least perform better than the simplest model we could build.

In [4]:
# Here we will establish a baseline model which will always predict the most frequent value in the target variable.

baseline = establish_classification_baseline(train.cleared)
baseline.value_counts()

False    195795
dtype: int64

In [5]:
# Calculate the roc auc score.
roc_auc_score(train.cleared, baseline)

0.5

In [6]:
# Calculate the accuracy score.
accuracy_score(train.cleared, baseline)

0.7882887714190863

We'll use two metrics to determine the performance of our models: roc auc score and accuracy. Accuracy will tell us how well the model predicts the clearance status of case for our dataset. However, due to the imbalance in our target variable we have to use another metric that will help determine in general how well the model predicts clearance status.

In [34]:
eval_df = append_model_results('baseline', evaluate(train.cleared, baseline, True))
eval_df

Unnamed: 0,accuracy,roc_auc
baseline,0.79,0.5


## Feature Selection

Before we begin building machine learning models let's use RFE to determine the importance of the features in the dataset.

In [8]:
# # We'll use RFE to rank the importance of the features in the dataset. We'll use a decision tree classifier 
# # as the model to compare the features.

# rfe = RFE(DecisionTreeClassifier(max_depth = 15), n_features_to_select = 2)
# rfe.fit(train.drop(columns = 'cleared'), train.cleared)

In [9]:
# pd.DataFrame({'Var': train.drop(columns = 'cleared').columns, 'Rank': rfe.ranking_}).sort_values(by = 'Rank').head(25)

## Initial Set of Models

Now we will build a set of initial models to determine which ones have the best performance. We will try building models using various classification algorithms provided by sklearn. These models will be evaluated on the train dataset and the top 3 performing models will be evaluated on validate.

In [23]:
models['Bagging Classifier'] = Model(BaggingClassifier(random_state = random_seed), train = train, features = train.drop(columns = 'cleared').columns, target = 'cleared')

In [10]:
# All the machine learning model objects will be created using mostly default values with just a few exceptions 
# such as decision trees which will have a limited depth.

algorithms = {
    'Decision Tree' : DecisionTreeClassifier(max_depth = 15, random_state = random_seed),
    'Random Forest' : RandomForestClassifier(max_depth = 15, random_state = random_seed),
    'Ada Boost' : AdaBoostClassifier(random_state = random_seed),
    'Bagging Classifier' : BaggingClassifier(random_state = random_seed),
    'Gradient Boosting' : GradientBoostingClassifier(random_state = random_seed),
    'SGD' : SGDClassifier(random_state = random_seed),
    'Naive Bayes' : BernoulliNB()
}

models = {}

for key, algorithm in algorithms.items():
    print(f'Training {key} model, ', end = '')
    
    start = time()
    models[key] = Model(
        algorithm,
        train = train,
        features = train.drop(columns = 'cleared').columns,
        target = 'cleared'
    )
    
    end = time()
    print(f'{end - start} seconds')

Training Decision Tree model, 12.754553079605103 seconds
Training Random Forest model, 36.80265283584595 seconds
Training Ada Boost model, 251.447411775589 seconds
Training Bagging Classifier model, 223.18360877037048 seconds
Training Gradient Boosting model, 151.38878059387207 seconds
Training SGD model, 10.347418069839478 seconds
Training Naive Bayes model, 6.08551025390625 seconds


In [14]:
# Now we'll evaluate the models.

for name, model in models.items():
    print(f'Evaluating {name} model, ', end = '')
    
    start = time()
    eval_df = append_model_results(
        name,
        evaluate(train.cleared, model.make_predictions(train), True),
        eval_df
    )
    
    end = time()
    print(f'{end - start} seconds')
    
eval_df.sort_values(by = 'roc_auc', ascending = False)

Evaluating Decision Tree model, 6.028434991836548 seconds
Evaluating Random Forest model, 8.273238897323608 seconds
Evaluating Ada Boost model, 119.57484221458435 seconds
Evaluating Bagging Classifier model, 33.803488969802856 seconds
Evaluating Gradient Boosting model, 5.067091941833496 seconds
Evaluating SGD model, 4.459805965423584 seconds
Evaluating Naive Bayes model, 5.2549309730529785 seconds


Unnamed: 0,accuracy,roc_auc
Bagging Classifier,0.96,0.93
Naive Bayes,0.88,0.81
Ada Boost,0.89,0.79
Decision Tree,0.89,0.78
Gradient Boosting,0.89,0.77
SGD,0.87,0.76
Random Forest,0.86,0.68
baseline,0.79,0.5


For both metrics the four models with the best performance are the Bagging Classifier, KNN, Naive Bayes, and the Decision Tree. We'll now create some new models with different hyper-parameters for each of these algorithms.

In [19]:
# Create different versions of the top 4 models using various hyper-parameters.

algorithms = {}

# Create a variety of decision tree models with various hyper-parameter values for 
# max_depth, min_samples_leaf, and criterion.

for max_depth in range(10, 26):
    for min_samples_leaf in range(1, 6):
        for criterion in ['gini', 'entropy']:
            algorithms[
                f'Decision Tree md:{max_depth} msl:{min_samples_leaf} c:{criterion}'
            ] = DecisionTreeClassifier(
                max_depth = max_depth,
                min_samples_leaf = min_samples_leaf,
                criterion = criterion,
                random_state = random_seed
            )
            
# Create a variety of adaboost classifier models with various hyper-parameters values for
# n_estimators and base_estimator.

for n_estimators in range(45, 56):
    for name, base_estimator in {
        'dt' : DecisionTreeClassifier(max_depth = 1, random_state = random_seed),
        'rf' : RandomForestClassifier(max_depth = 1, random_state = random_seed)
    }.items():
        algorithms[
            f'Ada Boost b:{name} n:{n_estimators}'
        ] = AdaBoostClassifier(
            base_estimator = base_estimator,
            n_estimators = n_estimators,
            random_state = random_seed
        )
        
# Create a variety of bagging classifier models with various hyper-parameter values for 
# base_estimator, n_estimators, and max_features.

for name, base_estimator in {
    'dt' : DecisionTreeClassifier(max_depth = 1, random_state = random_seed),
    'rf' : RandomForestClassifier(max_depth = 1, random_state = random_seed)
}.items():
    for n_estimators in range(8, 13):
        for max_features in range(1, 3):
            algorithms[
                f'Bagging Classifier b:{name} n:{n_estimators} m:{max_features}'
            ] = BaggingClassifier(
                base_estimator = base_estimator,
                n_estimators = n_estimators,
                max_features = max_features,
                random_state = random_seed
            )
            
# Create a variety of naive bayes models with various hyper-parameter values for 
# fit_prior and alpha.

for fit_prior in [True, False]:
    for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
        algorithms[
            f'Naive Bayes f:{fit_prior} a:{alpha}'
        ] = BernoulliNB(fit_prior = fit_prior, alpha = alpha)

models = {}

for key, algorithm in algorithms.items():
    print(f'Training {key} model, ', end = '\r')
    
    start = time()
    models[key] = Model(
        algorithm,
        train = train,
        features = train.drop(columns = 'cleared').columns,
        target = 'cleared'
    )
    
    end = time()
    print(f'{end - start} seconds', end = '\r')

Training Naive Bayes f:True a:0.0 model,  model, l, 



Training Naive Bayes f:False a:0.0 model, 



5.279297590255737 secondslse a:1.0 model,  

In [20]:
# Now we'll evaluate the models.

for name, model in models.items():
    print(f'Evaluating {name} model, ', end = '')
    
    start = time()
    eval_df = append_model_results(
        name,
        evaluate(train.cleared, model.make_predictions(train), True),
        eval_df
    )
    
    end = time()
    print(f'{end - start} seconds')
    
eval_df.sort_values(by = 'roc_auc', ascending = False)

Evaluating Decision Tree md:10 msl:1 c:gini model, 3.9683518409729004 seconds
Evaluating Decision Tree md:10 msl:1 c:entropy model, 3.9314963817596436 seconds
Evaluating Decision Tree md:10 msl:2 c:gini model, 3.9557418823242188 seconds
Evaluating Decision Tree md:10 msl:2 c:entropy model, 3.9401259422302246 seconds
Evaluating Decision Tree md:10 msl:3 c:gini model, 3.954310894012451 seconds
Evaluating Decision Tree md:10 msl:3 c:entropy model, 3.924078941345215 seconds
Evaluating Decision Tree md:10 msl:4 c:gini model, 3.9283089637756348 seconds
Evaluating Decision Tree md:10 msl:4 c:entropy model, 3.9114458560943604 seconds
Evaluating Decision Tree md:10 msl:5 c:gini model, 3.950137138366699 seconds
Evaluating Decision Tree md:10 msl:5 c:entropy model, 3.927891969680786 seconds
Evaluating Decision Tree md:11 msl:1 c:gini model, 3.955997943878174 seconds
Evaluating Decision Tree md:11 msl:1 c:entropy model, 3.935448169708252 seconds
Evaluating Decision Tree md:11 msl:2 c:gini model, 3

Unnamed: 0,accuracy,roc_auc
Bagging Classifier,0.96,0.93
Naive Bayes f:False a:1.0,0.84,0.84
Naive Bayes f:False a:0.75,0.84,0.84
Naive Bayes f:False a:0.5,0.84,0.84
Naive Bayes f:False a:0.25,0.84,0.84
...,...,...
Bagging Classifier b:dt n:9 m:2,0.79,0.50
Bagging Classifier b:dt n:9 m:1,0.79,0.50
Bagging Classifier b:dt n:8 m:2,0.79,0.50
Bagging Classifier b:dt n:8 m:1,0.79,0.50


In [21]:
eval_df.sort_values(by = 'roc_auc', ascending = False).head(10)

Unnamed: 0,accuracy,roc_auc
Bagging Classifier,0.96,0.93
Naive Bayes f:False a:1.0,0.84,0.84
Naive Bayes f:False a:0.75,0.84,0.84
Naive Bayes f:False a:0.5,0.84,0.84
Naive Bayes f:False a:0.25,0.84,0.84
Naive Bayes f:False a:0.0,0.84,0.84
Decision Tree md:25 msl:1 c:gini,0.91,0.83
Decision Tree md:24 msl:1 c:gini,0.91,0.82
Decision Tree md:25 msl:1 c:entropy,0.91,0.82
Decision Tree md:25 msl:2 c:gini,0.91,0.82


In [24]:
# We'll evaluate the top 3 performing models on validate.
algorithms = [
    'Bagging Classifier',
    'Naive Bayes f:False a:1.0',
    'Naive Bayes f:False a:0.75',
    'Naive Bayes f:False a:0.5',
    'Naive Bayes f:False a:0.25',
    'Naive Bayes f:False a:0.0',
    'Decision Tree md:25 msl:1 c:gini',
    'Decision Tree md:24 msl:1 c:gini',
    'Decision Tree md:25 msl:1 c:entropy',
    'Decision Tree md:25 msl:2 c:gini'
]

eval_df = None

for model in algorithms:
    print(f'Evaluating {model} model, ', end = '')
    
    start = time()
    eval_df = append_model_results(
        model,
        evaluate(validate.cleared, models[model].make_predictions(validate), True),
        eval_df
    )
    
    end = time()
    print(f'{end - start} seconds')
    
eval_df.sort_values(by = 'roc_auc', ascending = False)

Evaluating Bagging Classifier model, 12.927634000778198 seconds
Evaluating Naive Bayes f:False a:1.0 model, 2.36434268951416 seconds
Evaluating Naive Bayes f:False a:0.75 model, 2.3913583755493164 seconds
Evaluating Naive Bayes f:False a:0.5 model, 2.3695759773254395 seconds
Evaluating Naive Bayes f:False a:0.25 model, 2.3552420139312744 seconds
Evaluating Naive Bayes f:False a:0.0 model, 2.3753459453582764 seconds
Evaluating Decision Tree md:25 msl:1 c:gini model, 1.6748161315917969 seconds
Evaluating Decision Tree md:24 msl:1 c:gini model, 1.6745269298553467 seconds
Evaluating Decision Tree md:25 msl:1 c:entropy model, 1.6799218654632568 seconds
Evaluating Decision Tree md:25 msl:2 c:gini model, 1.6673331260681152 seconds


Unnamed: 0,accuracy,roc_auc
Naive Bayes f:False a:1.0,0.84,0.84
Naive Bayes f:False a:0.75,0.84,0.84
Naive Bayes f:False a:0.5,0.84,0.84
Naive Bayes f:False a:0.25,0.84,0.84
Naive Bayes f:False a:0.0,0.84,0.84
Bagging Classifier,0.89,0.82
Decision Tree md:25 msl:1 c:gini,0.89,0.79
Decision Tree md:24 msl:1 c:gini,0.89,0.78
Decision Tree md:25 msl:1 c:entropy,0.88,0.78
Decision Tree md:25 msl:2 c:gini,0.89,0.78


The Naive Bayes model has the same performance on validate as it does on train so we'll evaluate this one on test.

## Trying Different Numbers of Features

Now we're going to build some models that use a variety of different numbers of features to see how this affects model performance.

In [28]:
features = []

# Select features using SelectKBest for the k = 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100.
for k in range(10, 101, 10):
    f_selector = SelectKBest(f_classif, k = k)
    f_selector.fit(train_scaled.drop(columns = 'cleared'), train_scaled.cleared)
    
    # boolean mask of whether the column was selected or not. 
    feature_mask = f_selector.get_support()

    # get list of top K features. 
    f_feature = train_scaled.drop(columns = 'cleared').iloc[:,feature_mask].columns.tolist()
    
    features.append(f_feature)
    
models = {}

for feature_set in features:
    print(f'Training Bagging Classifier model k = {len(feature_set)}, ', end = '\r')
    
    models[f'Bagging Classifier k:{len(feature_set)}'] = Model(
        BaggingClassifier(random_state = random_seed),
        train = train_scaled,
        features = feature_set,
        target = 'cleared'
    )
    
for feature_set in features:
    print(f'Training Decision Tree model k = {len(feature_set)}, ', end = '\r')
    
    models[f'Decision Tree k:{len(feature_set)}'] = Model(
        DecisionTreeClassifier(max_depth = 25, random_state = random_seed),
        train = train_scaled,
        features = feature_set,
        target = 'cleared'
    )
    
for feature_set in features:
    print(f'Training Naive Bayes model k = {len(feature_set)}, ', end = '\r')
    
    models[f'Naive Bayes k:{len(feature_set)}'] = Model(
        BernoulliNB(),
        train = train_scaled,
        features = feature_set,
        target = 'cleared'
    )
    
for feature_set in features:
    print(f'Training Naive Bayes fit_prior = False model k = {len(feature_set)}, ', end = '\r')
    
    models[f'Naive Bayes fit_prior:False k:{len(feature_set)}'] = Model(
        BernoulliNB(fit_prior = False),
        train = train_scaled,
        features = feature_set,
        target = 'cleared'
    )

  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw


Training Naive Bayes fit_prior = False model k = 100, 

In [35]:
# Now we'll evaluate the models.

for name, model in models.items():
    print(f'Evaluating {name} model, ', end = '')
    
    start = time()
    eval_df = append_model_results(
        name,
        evaluate(train.cleared, model.make_predictions(train_scaled), True),
        eval_df
    )
    
    end = time()
    print(f'{end - start} seconds')
    
eval_df.sort_values(by = 'roc_auc', ascending = False)

Evaluating Bagging Classifier k:10 model, 0.33852314949035645 seconds
Evaluating Bagging Classifier k:20 model, 0.47135400772094727 seconds
Evaluating Bagging Classifier k:30 model, 0.965256929397583 seconds
Evaluating Bagging Classifier k:40 model, 1.025602102279663 seconds
Evaluating Bagging Classifier k:50 model, 1.1582262516021729 seconds
Evaluating Bagging Classifier k:60 model, 1.3304362297058105 seconds
Evaluating Bagging Classifier k:70 model, 1.6224617958068848 seconds
Evaluating Bagging Classifier k:80 model, 1.819997787475586 seconds
Evaluating Bagging Classifier k:90 model, 3.3077950477600098 seconds
Evaluating Bagging Classifier k:100 model, 4.160368919372559 seconds
Evaluating Decision Tree k:10 model, 0.07445311546325684 seconds
Evaluating Decision Tree k:20 model, 0.08928298950195312 seconds
Evaluating Decision Tree k:30 model, 0.11636710166931152 seconds
Evaluating Decision Tree k:40 model, 0.1308748722076416 seconds
Evaluating Decision Tree k:50 model, 0.1454448699951

Unnamed: 0,accuracy,roc_auc
Bagging Classifier k:100,0.95,0.91
Bagging Classifier k:90,0.94,0.9
Bagging Classifier k:80,0.93,0.88
Bagging Classifier k:70,0.91,0.85
Bagging Classifier k:60,0.91,0.84
Decision Tree k:100,0.91,0.83
Bagging Classifier k:50,0.91,0.83
Decision Tree k:90,0.91,0.83
Decision Tree k:80,0.91,0.82
Naive Bayes fit_prior:False k:90,0.84,0.82


In [37]:
# Now we'll evaluate the models on validate.

algorithms = [
    'Bagging Classifier k:100',
    'Bagging Classifier k:90',
    'Bagging Classifier k:80',
    'Bagging Classifier k:70',
    'Bagging Classifier k:60',
    'Decision Tree k:100',
    'Bagging Classifier k:50',
    'Decision Tree k:90',
    'Decision Tree k:80',
    'Naive Bayes fit_prior:False k:60',
    'Naive Bayes fit_prior:False k:90'
]

eval_df = None

for model in algorithms:
    print(f'Evaluating {model} model, ', end = '')
    
    start = time()
    eval_df = append_model_results(
        model,
        evaluate(validate.cleared, models[model].make_predictions(validate_scaled), True),
        eval_df
    )
    
    end = time()
    print(f'{end - start} seconds')
    
eval_df.sort_values(by = 'roc_auc', ascending = False)

Evaluating Bagging Classifier k:100 model, 0.889296293258667 seconds
Evaluating Bagging Classifier k:90 model, 0.7825591564178467 seconds
Evaluating Bagging Classifier k:80 model, 0.8530402183532715 seconds
Evaluating Bagging Classifier k:70 model, 0.7350919246673584 seconds
Evaluating Bagging Classifier k:60 model, 0.524116039276123 seconds
Evaluating Decision Tree k:100 model, 0.08946681022644043 seconds
Evaluating Bagging Classifier k:50 model, 0.5005710124969482 seconds
Evaluating Decision Tree k:90 model, 0.09373998641967773 seconds
Evaluating Decision Tree k:80 model, 0.07422590255737305 seconds
Evaluating Naive Bayes fit_prior:False k:60 model, 0.1244499683380127 seconds
Evaluating Naive Bayes fit_prior:False k:90 model, 0.1652529239654541 seconds


Unnamed: 0,accuracy,roc_auc
Naive Bayes fit_prior:False k:60,0.85,0.82
Naive Bayes fit_prior:False k:90,0.83,0.82
Bagging Classifier k:100,0.88,0.81
Bagging Classifier k:90,0.89,0.81
Bagging Classifier k:80,0.89,0.81
Bagging Classifier k:70,0.89,0.81
Bagging Classifier k:60,0.89,0.81
Bagging Classifier k:50,0.89,0.8
Decision Tree k:100,0.89,0.79
Decision Tree k:90,0.89,0.79


## Evaluate Best Model on Test

In [19]:
append_model_results(
    'Naive Bayes',
    evaluate(test.cleared, models['Naive Bayes'].make_predictions(test), True)
)

Unnamed: 0,accuracy,roc_auc
Naive Bayes,0.89,0.81


The Naive Bayes model is 89% accurate on unseen data.