## Implementation of Tree Based Models

In this notebook, we extract the features from the dataset and provide it to a data pipeline that processes all numerical, categorical and text features to provide results on tree based models such as Decision Tree and Random Forests.

In [12]:
import pandas as pd
import numpy as np
import matplotlib as plt

#### 1. Data Processing

We import the dataset and find out the shape.

In [13]:
#Importing data
openpolicing_path="C:/Users/SwetaMankala/Desktop/Assignments/EAI6000/ma_statewide_2020.csv"

data=pd.read_csv(openpolicing_path,low_memory=False)
print('The shape of the dataset is:', data.shape)

The shape of the dataset is: (3416238, 24)


In [14]:
print(data.columns)

Index(['raw_row_number', 'date', 'location', 'county_name', 'subject_age',
       'subject_race', 'subject_sex', 'type', 'arrest_made', 'citation_issued',
       'contraband_weapons', 'contraband_alcohol', 'contraband_other',
       'frisk_performed', 'search_conducted', 'search_basis',
       'reason_for_stop', 'vehicle_type', 'vehicle_registration_state',
       'raw_Race'],
      dtype='object')


The features present in the dataset are split manually into three categories namely numerical, categorical and text features to process them based on their properties. Based on the features extracted, we assign them to model features and model target which will be used in our models. We select the features that are useful for our predictions below.

In [16]:
#numerical features
numerical_features = ['subject_age']

#categorical features
categorical_features = ['subject_sex', 'type', 'arrest_made', 'citation_issued', 'warning_issued',
                       'outcome', 'contraband_found', 'frisk_performed', 'search_conducted', 'search_basis', 'reason_for_stop',
                       'vehicle_type', 'vehicle_registration_state']

#text features
text_features = ['location', 'county_name']

model_features = numerical_features + categorical_features + text_features
model_target = 'subject_race'

print('Model Features:', model_features)
print('Model Target:', model_target)

Model Target: subject_race


The features are segregated to adjust them accordingly to fit in our training model. We have selected our model target to be subject race as we would like to classify the subjects based on their race which can be used to improve our statistics later on. 

We check the categories present in our model target.

In [17]:
data[model_target].value_counts()

white                     2543612
black                      353548
hispanic                   340271
asian/pacific islander     167735
other                       11072
Name: subject_race, dtype: int64

We convert the features categorical and text to type string format for ease of use.

In [18]:
data[categorical_features + text_features] = data[categorical_features + text_features].astype('str')

#### 2. Train-Test Split

We split our dataset and assign 10% to the test set. Out of the 90%, we assign 30% to validation and training sets.

In [19]:
from sklearn.model_selection import train_test_split

dataset, test_data = train_test_split(data, test_size=0.1, shuffle=True, random_state=23)

The validation set evaluates the model to use the fitted model to test the accuracy. This will ensure the model to work properly before running it on the test dataset.

In [20]:
train_data, val_data = train_test_split(dataset, test_size=0.7, shuffle=True, random_state=23)

### Data Pipeline with Decision Tree Classifier

#### 3. Decision Tree Data Pipeline

We build separate pipelines to handle the numerical, categorical, and text features, and then combine them into a composite pipeline along with an estimator, which is the Decision Tree Classifier.

 * For the numerical features pipeline, the __numerical_processor__ below, we impute missing values with the mean using sklearn's SimpleImputer, followed by a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.
   
   
   * In the categoricals pipeline, the __categorical_processor__ below, we impute with a placeholder value (no effect here as we already encoded the 'nan's), and encode with sklearn's OneHotEncoder. If computing memory is an issue, it is a good idea to check categoricals' unique values, to get an estimate of many dummy features will be created by one-hot encoding. Note the __handle_unknown__ parameter that tells the encoder to ignore (rather than throw an error for) any unique value that might show in the validation/and or test set that was not present in the initial training set.
  
   
   * Finally, we build two more pipelines, one for each of our text features using the CountVectorizer().
   
These features are selected and combined together in the Column Transformer. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future.

In [21]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier

numerical_processor = Pipeline([
    ('num_scaler', MinMaxScaler())
])

categorical_processor = Pipeline([
    ('cat_encoder', OneHotEncoder(handle_unknown='ignore'))
])

text_processor0 = Pipeline([
    ('text_vect0', CountVectorizer(binary=True, max_features=50))
])

text_processor1 = Pipeline([
    ('text_vect1', CountVectorizer(binary=True, max_features=50))
])

data_preprocessor = ColumnTransformer([
    ('numeric', numerical_processor, numerical_features),
    ('categoric', categorical_processor, categorical_features),
    ('text_pro0', text_processor0, text_features[0]),
    ('text_pro1', text_processor1, text_features[1])
])

pipeline1 = Pipeline([
    ('data_preprocessing', data_preprocessor),
    ('dt', DecisionTreeClassifier())
])

from sklearn import set_config
set_config(display='diagram')
pipeline1

### Training the Model

#### 4. Training the model on Train and Cross Validation datasets.

In [22]:
X_train = train_data[model_features]
y_train = train_data[model_target]

pipeline1.fit(X_train, y_train)

In [23]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

train_predictions = pipeline1.predict(X_train)

print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print('Accuracy Score:', accuracy_score(y_train, train_predictions))

[[  5346    788    524      3  38710]
 [   523  15799   3071      0  76156]
 [   393   2081  18948      3  70561]
 [    21     61     84    456   2380]
 [  1608   5983   8165    118 670602]]
                        precision    recall  f1-score   support

asian/pacific islander       0.68      0.12      0.20     45371
                 black       0.64      0.17      0.26     95549
              hispanic       0.62      0.21      0.31     91986
                 other       0.79      0.15      0.25      3002
                 white       0.78      0.98      0.87    686476

              accuracy                           0.77    922384
             macro avg       0.70      0.32      0.38    922384
          weighted avg       0.74      0.77      0.71    922384

Accuracy Score: 0.770992341584416


Using the training data, we obtain an acuracy of 77.09%. This is a nearly perfect score for training data. 

In [24]:
X_val = val_data[model_features]
y_val = val_data[model_target]

val_predictions = pipeline1.predict(X_val)

print(confusion_matrix(y_val, val_predictions))
print(classification_report(y_val, val_predictions))
print('Accuracy Score:', accuracy_score(y_val, val_predictions))

[[   2056    3456    2467      71   97462]
 [   2354   13965   11050     121  195278]
 [   1880    7828   22863     121  181864]
 [     88     230     262      78    6206]
 [  13078   34474   37643    1046 1516289]]
                        precision    recall  f1-score   support

asian/pacific islander       0.11      0.02      0.03    105512
                 black       0.23      0.06      0.10    222768
              hispanic       0.31      0.11      0.16    214556
                 other       0.05      0.01      0.02      6864
                 white       0.76      0.95      0.84   1602530

              accuracy                           0.72   2152230
             macro avg       0.29      0.23      0.23   2152230
          weighted avg       0.63      0.72      0.65   2152230

Accuracy Score: 0.722623046793326


The accuracy for the validation is 72.25% which is a little lesser than for the training set. 

#### 5. Grid Search Cross-Validation Hyperparameter Tuning

We use Grid Search to look for hyperparameter combinations to improve the accuracy on the test set (and reduce the generalization gap). As GridSearchCV does cross-validation train-validation split internally, 
our data transformers inside the Pipeline context will force the correct behavior of learning data transformations on the training set, and applying the transformations to the validation set when cross-validating, as well as on the test set later when running test predictions.

In [25]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# param_grid = {'dt__max_depth': [10, 20, 30],
#               'dt__min_samples_leaf':[1, 2, 5],
#               'dt__min_samples_split':[10, 20, 30]
#              }

param_grid = {'dt__max_depth': [10, 20],
              'dt__min_samples_leaf':[1, 2],
              'dt__min_samples_split':[10, 20]
             }

grid_search = GridSearchCV(pipeline1, 
                           param_grid,
#                            cv = 5,
                           verbose = 1,
                           n_jobs = -1)

grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 19.7min finished


In [26]:
grid_search.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__memory', 'estimator__steps', 'estimator__verbose', 'estimator__data_preprocessing', 'estimator__dt', 'estimator__data_preprocessing__n_jobs', 'estimator__data_preprocessing__remainder', 'estimator__data_preprocessing__sparse_threshold', 'estimator__data_preprocessing__transformer_weights', 'estimator__data_preprocessing__transformers', 'estimator__data_preprocessing__verbose', 'estimator__data_preprocessing__numeric', 'estimator__data_preprocessing__categoric', 'estimator__data_preprocessing__text_pro0', 'estimator__data_preprocessing__text_pro1', 'estimator__data_preprocessing__numeric__memory', 'estimator__data_preprocessing__numeric__steps', 'estimator__data_preprocessing__numeric__verbose', 'estimator__data_preprocessing__numeric__num_scaler', 'estimator__data_preprocessing__numeric__num_scaler__copy', 'estimator__data_preprocessing__numeric__num_scaler__feature_range', 'estimator__data_preprocessing__categoric__memory', 'estimator__data_

Using the get_params(), we can infer the type of parameters that are using in our hyperparameter tuning.

### Function to report best scores

In [29]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})"
                  .format(results['mean_test_score'][candidate],
                          results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [30]:
report(grid_search.cv_results_)
print(grid_search.best_params_)
print(grid_search.best_score_)

classifier1 = grid_search.best_estimator_

classifier1.fit(X_train, y_train)

Model with rank: 1
Mean validation score: 0.746 (std: 0.000)
Parameters: {'dt__max_depth': 10, 'dt__min_samples_leaf': 2, 'dt__min_samples_split': 20}

Model with rank: 2
Mean validation score: 0.746 (std: 0.000)
Parameters: {'dt__max_depth': 10, 'dt__min_samples_leaf': 1, 'dt__min_samples_split': 20}

Model with rank: 3
Mean validation score: 0.746 (std: 0.000)
Parameters: {'dt__max_depth': 10, 'dt__min_samples_leaf': 1, 'dt__min_samples_split': 10}

{'dt__max_depth': 10, 'dt__min_samples_leaf': 2, 'dt__min_samples_split': 20}
0.7458715678697917


With the best_estimator_, we can use the best set of parameters the grid-search has tuned to fit them to the classifier which will be used to fetch predictions over the test dataset.

In [31]:
X_test = test_data[model_features]
y_test = test_data[model_target]

In [32]:
test_predictions = classifier1.predict(X_test)

print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print('Accuracy Score:', accuracy_score(y_test, test_predictions))

[[    10    255     80      0  16507]
 [    16   1094    954      0  33167]
 [     6    314   2018      0  31391]
 [     0     12     12      0   1182]
 [    13   1031   1582      0 251980]]


  _warn_prf(average, modifier, msg_start, len(result))


                        precision    recall  f1-score   support

asian/pacific islander       0.22      0.00      0.00     16852
                 black       0.40      0.03      0.06     35231
              hispanic       0.43      0.06      0.11     33729
                 other       0.00      0.00      0.00      1206
                 white       0.75      0.99      0.86    254606

              accuracy                           0.75    341624
             macro avg       0.36      0.22      0.20    341624
          weighted avg       0.66      0.75      0.65    341624

Accuracy Score: 0.7467332505912934


### Data Pipeline with Random Forests

#### 5. Random Forest Classifier with Data Pipeline

The data pipeline is implemented with the same method used before but we use the Random Forest classifier instead.

In [33]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline2 = Pipeline([
    ('data_preprocessing', data_preprocessor),
    ('rf', RandomForestClassifier())
])

# from sklearn import set_config
# set_config(display='diagram')
pipeline2

#### 6. Training and Cross Validation with Random Forest Classifier

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

pipeline2.fit(X_train, y_train)
train_pre = pipeline2.predict(X_train)

print(confusion_matrix(y_train, train_pre))
print(classification_report(y_train, train_pre))
print('Accuracy Score:', accuracy_score(y_train, train_pre))

On applying the Random Forest classifier, we get 79% accuracy over the training data. 

In [None]:
val_pre = pipeline2.predict(X_val)

print(confusion_matrix(y_val, val_pre))
print(classification_report(y_val, val_pre))
print('Accuracy Score:', accuracy_score(y_val, val_pre))

With applying the model on validation, we get an accuracy of 73% which is lesser than that of training set.

#### 7. Randomized Search Cross-Validation Hyperparameter Tuning

We use Randomized Search to look for hyperparameter combinations to improve the accuracy on the test set (and reduce the generalization gap).

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# param_rand = {'rf__max_depth': [10, 20, 30],
#               'rf__min_samples_leaf':[1, 2, 5],
#               'rf__min_samples_split':[10, 20, 30]
#              }

param_rand = {'rf__max_depth': [10, 20],
              'rf__min_samples_leaf':[1, 2],
              'rf__min_samples_split':[10, 20]
             }

random_search = GridSearchCV(pipeline2, 
                           param_rand,
                           cv = 3,
                           verbose = 1,
                           n_jobs = -1)

random_search.fit(X_train, y_train)

In this section, we have reduced the cross-validation to 3 k-folds and reduced the parameters for grid search as well as it takes a little too long to run the code.

In [None]:
test_pre = classifier2.predict(X_test)

print(confusion_matrix(y_test, test_pre))
print(classification_report(y_test, test_pre))
print('Accuracy Score:', accuracy_score(y_test, test_pre))

The final accuracy obtained in 72%. The decision tree classifier works much better than random forest in this section. We can also use the randomized search with random forest classifier as this is proven to be the best bagging optimization in machine learning models. 

We can infer that random forests take a long time to split the decisions and work on different trees that would consume more than a decision tree classifier wherein a grid search cross validated hyperparameter tuning that performs an exhaustive search and finds the optimum combination of the hyperparameters over the specified parameter values. 

The step of performing a cross validation builds multiple trees in the background that will improve the performance and structure of the existing model that we have built.

In our model predictions, when we look at the values of 'support' given in the classification report, it shows the number of actual occurences in our dataset.

#### Inferences:

1. We have found that the variables rawRace and subject_race have a correlation which affected the model's performance. Upon removing one of them, we could make accurate predictions.

2. Variables such as contraband_drug and contraband_found showed similarities which would overfit the model and hence, one of the features were used in general to overcome duplication of values in the data set. 

3. We have learned that bagging techniques help improve the model performance and also tried various combinations through this assignment. The best accuracy is provided for the Decision Tree classifier with the Grid Search Cross=Validation hyper parameter tuning model. 