# Project SURAKSHA : Enabling awareness

<div class = "alert alert-block alert-success">

 - <b>Version</b> : 1.0

 - <b>Authors</b> : 
    -  Anjali Muralidharan<br>
    -  Chitra Nair<br>
    -  Kavish Jhaveri<br>
    -  Simantini Ghosh<br>
    -  Sonal Rai
            
-  Built in association with Indian School of Business,Hyderbad as part of Capstone Project
</div>

### Problem Description:

To create a prototype for a mobile app that ensures women & child safety though the following modules
1.	Share Live Location & Details - With a chosen few, integrate with cab giants to link and display driver details to the chosen person receiving live updates
2.	Map the City Down Based On Safety Index - Provide real time alerts when in a zone/area with a poor safety rating, citations on prior crimes committed
3.	SOS Functionality: App Profile to Integrate with Women & Police Helplines Based on subject's zip code and make calls in need
4.	Offline Mode: The app must share data while offline (TBD)
5.	Prescriptive Prompts !!!: Must deliver real time alerts, if the subject is in an area that is inhabited (by criminal records) by convicted sex offender, child trafficker or other associated crimes with convicts personal identification indicators (Photo etc.) to enable educated decision making.

<div class = "alert alert-block alert-info">
<b>This code snippet is for modelling in order to predict the type of crime that may be committed at a given place at a particular time, day of he week, month and category of crime. This snippet is integrated with the user interface build using flutter to provide prescriptive prompts to users whenever they enter a location.</b>
</div>

### Dataset

The dataset has features pertaining to crimes committed in the city in hand like crime type, category, date and time of crime,etc.

#### The below experiments will be performed on the dataset provided. The experiment that suggests the best accuracy score will be integrated.

1. Decision Tree
2. Random Forest
3. Gradient Boost
4. KNN
5. XGBoost
6. Logistic Regression

#### Loading relevant libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
# Print version info for the sake of reproducibility
import sys
import sklearn as skl
print("python " + sys.version)
print("pandas " + str(pd.__version__))
print("numpy " + np.__version__)
print("sklearn" + skl.__version__)

python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
pandas 1.3.4
numpy 1.20.3
sklearn1.0.2


#### Loading the Delhi crime data

In [344]:
delhi_crime_data = pd.read_excel("Delhi_crime_data.xls") 
delhi_crime_data.head(3)

Unnamed: 0,Address,Area,City,Category,Day Quarter Group,Day Time,Delhi Districts Cluster,Delhi Incidents Cluster,Id,Incident Category,...,weekday,quarter,Delhi_Cluster_code,Delhi_Cluster_code_encode,time_hour,crime_code,Severity_index,Safety_index,Safety_index_code,Safety_index_code_val
0,"Minto Bridge Colony, Barakhamba, New Delhi, De...",Barakhamba,New Delhi,Verbal Abuse,Quarter 1 : 12 Midnight to 6 AM,1 AM-2 AM,Low: <100 Incidents,Mid: 20-50 Incidents,3887,Catcalls/Whistles,...,5,2,Med,1,1,6,Med,0.038469,Yellow,1
1,"Minto Bridge Colony, Barakhamba, New Delhi, De...",Barakhamba,New Delhi,Non-Verbal Abuse,Quarter 1 : 12 Midnight to 6 AM,1 AM-2 AM,Low: <100 Incidents,Mid: 20-50 Incidents,3996,Ogling/Lewd Facial Expressions/Staring,...,4,2,Med,1,1,4,Med,0.038469,Yellow,1
2,"Minto Bridge Colony, Barakhamba, New Delhi, De...",Barakhamba,New Delhi,Physical Abuse,Quarter 1 : 12 Midnight to 6 AM,1 AM-2 AM,Low: <100 Incidents,Mid: 20-50 Incidents,4000,Touching /Groping,...,4,2,Med,1,1,15,High,0.038469,Yellow,1


In [345]:
delhi_crime_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5829 entries, 0 to 5828
Data columns (total 32 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Address                    5829 non-null   object        
 1   Area                       5829 non-null   object        
 2   City                       5829 non-null   object        
 3   Category                   5829 non-null   object        
 4   Day Quarter Group          5829 non-null   object        
 5   Day Time                   5829 non-null   object        
 6   Delhi Districts Cluster    5829 non-null   object        
 7   Delhi Incidents Cluster    5829 non-null   object        
 8   Id                         5829 non-null   int64         
 9   Incident Category          5829 non-null   object        
 10  Incident Date              5829 non-null   datetime64[ns]
 11  Incident Time              5829 non-null   datetime64[ns]
 12  Title 

#### Feature set selection

In [346]:
x_features =['Area','year','month','day','dayofweek','time_hour','dayofyear','Category'] 

In [347]:
delhi_crime_data[x_features]

Unnamed: 0,Area,year,month,day,dayofweek,time_hour,dayofyear,Category
0,Barakhamba,2014,4,19,5,1,109,Verbal Abuse
1,Barakhamba,2014,4,4,4,1,94,Non-Verbal Abuse
2,Barakhamba,2014,4,18,4,1,108,Physical Abuse
3,Barakhamba,2014,4,13,6,1,103,Non-Verbal Abuse
4,Barakhamba,2014,2,5,2,13,36,Verbal Abuse
...,...,...,...,...,...,...,...,...
5824,Gazipur,2014,1,9,3,9,9,Physical Abuse
5825,Roop Nagar,2016,8,15,0,21,228,Verbal Abuse
5826,Roop Nagar,2015,9,25,4,21,268,Verbal Abuse
5827,New Aruna Nagar,2011,8,23,1,9,235,Physical Abuse


Segregating the features into categorical and numerical

In [350]:
cat_vars = ['Area','year','month','day','dayofweek','time_hour','dayofyear','Category']  #'Category','dayofyear',

In [351]:
num_vars = list(set(x_features) - set(cat_vars))
num_vars

[]

#### Setting x and y variables

In [352]:
#setting X and Y:
X = delhi_crime_data[x_features]
y = delhi_crime_data['crime_code']

### Partitioning data to train and test sets

In [353]:
from sklearn.model_selection import train_test_split

In [354]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 21)

In [355]:
X_train.shape

(4371, 8)

In [356]:
X_test.shape

(1458, 8)

#### Handling data imbalance

1. Oversampling method

In [357]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
ros = RandomOverSampler()
print(Counter(y_train))
X_oversampled_train, y_oversampled_train = ros.fit_resample(X_train, y_train)
print(Counter(y_oversampled_train))

Counter({7: 1053, 6: 922, 15: 627, 4: 575, 1: 303, 13: 173, 3: 143, 5: 128, 14: 120, 2: 82, 11: 71, 17: 66, 8: 60, 12: 39, 9: 5, 10: 3, 16: 1})
Counter({7: 1053, 15: 1053, 1: 1053, 6: 1053, 4: 1053, 8: 1053, 13: 1053, 12: 1053, 3: 1053, 11: 1053, 17: 1053, 2: 1053, 5: 1053, 14: 1053, 10: 1053, 9: 1053, 16: 1053})


### Defining Transformations

#### Encoding of categorical variables

1. One hot encoding

In [358]:
from sklearn.preprocessing import OneHotEncoder

ohe_encoder = OneHotEncoder(handle_unknown='ignore')

2. Target encoding

In [359]:
from category_encoders import TargetEncoder

target_encoder = TargetEncoder() 

#### Standardization

In [360]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## Defining models

#### 1. Decision Tree

In [361]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=50, random_state=42)

#### 2. Random Forest

In [362]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 42)

#### 3. Gradient Boost

In [363]:
from sklearn.ensemble import GradientBoostingClassifier
grad_class = GradientBoostingClassifier(learning_rate=0.1,n_estimators = 10, random_state = 42)

#### 4. KNN

In [364]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)

#### 5. XGBoost

In [365]:
from xgboost.sklearn import XGBClassifier
params = { "n_estimators": 400,
           "max_depth": 5,
           #"objective": 'reg:squarederror',
           "colsample_bytree": 0.8,
           "subsample": 0.75,
          "verbosity":0
          # "lambda": 100
         }

xgb = XGBClassifier(**params)

#### 6. Logistic Regression

In [366]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 100,max_iter=100)

### Setting scoring metric

In [367]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score,f1_score,precision_score

### Creating pipelines

In [368]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [369]:
cat_transformer_ohe = Pipeline( steps = [('ohencoder', ohe_encoder)
                                     ])

In [370]:
cat_transformer_target = Pipeline( steps = [('tencoder', target_encoder)
                                     ])

In [371]:
num_transformer = Pipeline( steps = [('scaler', scaler)                            
                                     ])

In [372]:
preprocessor_ohe = ColumnTransformer(
    transformers=[('cat_ohe',cat_transformer_ohe,cat_vars)
                 , ('num',num_transformer,num_vars)
                 ])

In [373]:
preprocessor_target = ColumnTransformer(
    transformers=[('cat_target',cat_transformer_target,cat_vars)
                 , ('num',num_transformer,num_vars)
                 ])

In [374]:
scorer = make_scorer(accuracy_score)#, average='weighted')

## Decision Tree Experiments

### Pipeline 1.1

Oversampling > One Hot Encoding > Decision Tree > Grid Search > Final Model

In [375]:
dt_v1 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('decisiontree', dtree)])

In [376]:
dt_v1.fit(X_oversampled_train, y_oversampled_train)

Grid Search

In [377]:
dt_params = { "decisiontree__max_depth":[5,7,8,10,12,14],
                  "decisiontree__criterion":['gini','entropy'],
             "decisiontree__min_samples_leaf":[1,2],
             "decisiontree__min_samples_split":[5,10]
                  }

In [378]:
dt_grid_v1 = GridSearchCV(dt_v1,
                           param_grid=dt_params,
                           cv = 5,
                           scoring = scorer,
                         error_score='raise'
                         )

In [379]:
dt_grid_v1.fit(X_oversampled_train, y_oversampled_train)

In [380]:
dt_grid_v1.best_params_

{'decisiontree__criterion': 'entropy',
 'decisiontree__max_depth': 14,
 'decisiontree__min_samples_leaf': 1,
 'decisiontree__min_samples_split': 5}

In [381]:
dt_grid_v1.best_score_

0.7626392864887465

In [382]:
dt_grid_results = pd.DataFrame( dt_grid_v1.cv_results_ )
dt_grid_results[['param_decisiontree__criterion','param_decisiontree__max_depth', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_decisiontree__criterion,param_decisiontree__max_depth,mean_test_score,std_test_score
0,gini,5,0.50824,0.0158
1,gini,5,0.508296,0.015822
2,gini,5,0.508128,0.015913
3,gini,5,0.508184,0.015936
4,gini,7,0.593151,0.003705
5,gini,7,0.593151,0.003792
6,gini,7,0.592816,0.003713
7,gini,7,0.592816,0.003805
8,gini,8,0.615663,0.009724
9,gini,8,0.615551,0.009739


Building the final Decision Tree Model

In [383]:
final_model_dt = DecisionTreeClassifier(max_depth=dt_grid_v1.best_params_['decisiontree__max_depth'],
                                criterion=dt_grid_v1.best_params_['decisiontree__criterion'],
                                 min_samples_leaf=dt_grid_v1.best_params_['decisiontree__min_samples_leaf'],
                                  min_samples_split= dt_grid_v1.best_params_['decisiontree__min_samples_split']      
                                )

dt_final = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('decisiontree', final_model_dt)])

In [384]:
dt_final.fit(X_oversampled_train, y_oversampled_train)

In [385]:
dt_final.score(X_test, y_test)

0.4615912208504801

In [386]:
from sklearn.metrics import mean_squared_error

In [387]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, dt_final.predict(X_test)))
final_rmse_dt

3.8113972465742028

In [388]:
y_pred=dt_final.predict(X_test)
y_pred

array([17, 17, 12, ...,  1, 17,  6], dtype=int64)

In [389]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.4615912208504801
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.91      0.93      0.92       112
           3       0.41      0.60      0.49        45
           4       0.89      0.44      0.59       187
           5       0.31      0.48      0.38        52
           6       0.43      0.35      0.39       293
           7       0.55      0.42      0.47       366
           8       0.16      0.70      0.25        20
           9       0.50      1.00      0.67         1
          10       0.00      0.00      0.00         2
          11       0.44      0.35      0.39        23
          12       0.09      1.00      0.16         7
          13       0.38      0.67      0.49        54
          14       0.15      0.46      0.22        56
          15       0.83      0.31      0.45       189
          16       0.00      0.00      0.00         0

   micro avg       0.46      0.46    

### Pipeline 1.2

One Hot Encoding > Decision Tree > Grid Search > Final Model

In [390]:
dt_v2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('decisiontree', dtree)])

In [391]:
dt_v2.fit(X_train, y_train)

In [392]:
from sklearn import set_config
set_config(display='diagram') 
dt_v2

Grid Search

In [393]:
dt_params = { "decisiontree__max_depth":[5,7,8,10,12,14],
                  "decisiontree__criterion":['gini','entropy'],
             "decisiontree__min_samples_leaf":[1,2],
             "decisiontree__min_samples_split":[5,10]
                  }

In [394]:
dt_grid_v2 = GridSearchCV(dt_v2,
                           param_grid=dt_params,
                           cv = 5,
                           scoring = scorer)

In [395]:
dt_grid_v2.fit(X_train, y_train)

In [396]:
dt_grid_v2.best_params_

{'decisiontree__criterion': 'gini',
 'decisiontree__max_depth': 10,
 'decisiontree__min_samples_leaf': 1,
 'decisiontree__min_samples_split': 10}

In [397]:
dt_grid_v2.best_score_

0.6094726381170317

In [398]:
dt_grid_results = pd.DataFrame( dt_grid_v2.cv_results_ )
dt_grid_results[['param_decisiontree__criterion','param_decisiontree__max_depth', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_decisiontree__criterion,param_decisiontree__max_depth,mean_test_score,std_test_score
0,gini,5,0.602608,0.003576
1,gini,5,0.602608,0.003576
2,gini,5,0.602837,0.003839
3,gini,5,0.602837,0.003839
4,gini,7,0.605813,0.0065
5,gini,7,0.605813,0.005322
6,gini,7,0.60627,0.006356
7,gini,7,0.606042,0.005771
8,gini,8,0.605813,0.008942
9,gini,8,0.606728,0.008058


Building the final Decision Tree Model

In [399]:
final_model_dt = DecisionTreeClassifier(max_depth=dt_grid_v2.best_params_['decisiontree__max_depth'],
                                criterion=dt_grid_v2.best_params_['decisiontree__criterion'],
                                        min_samples_leaf=dt_grid_v2.best_params_['decisiontree__min_samples_leaf'],
                                  min_samples_split= dt_grid_v2.best_params_['decisiontree__min_samples_split'] 
                                )

dt_final = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('decisiontree', final_model_dt)])

In [400]:
dt_final.fit(X_train, y_train)

In [401]:
dt_final.score(X_test, y_test)

0.620713305898491

In [402]:
from sklearn.metrics import mean_squared_error

In [403]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, dt_final.predict(X_test)))
final_rmse_dt

2.1109486617159945

In [404]:
y_pred=dt_final.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  7], dtype=int64)

In [405]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.620713305898491
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.91      0.95      0.93       112
           2       0.71      0.58      0.64        26
           3       0.60      0.20      0.30        45
           4       0.71      0.96      0.82       187
           5       0.25      0.02      0.04        52
           6       0.52      0.23      0.31       293
           7       0.52      0.88      0.65       366
           8       0.50      0.05      0.09        20
           9       1.00      1.00      1.00         1
          10       0.00      0.00      0.00         2
          11       0.60      0.26      0.36        23
          12       1.00      0.43      0.60         7
          13       0.57      0.31      0.40        54
          14       0.00      0.00      0.00        56
          15       0.70      0.94      0.80       189
          16       0.00      0.00      

### Pipeline 1.3

Target Encoding > Decision Tree > Grid Search > Final Model

In [406]:
dt_v3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                        ('decisiontree', dtree)])

In [407]:
dt_v3.fit(X_train, y_train)

In [408]:
from sklearn import set_config
set_config(display='diagram') 
dt_v3

Grid Search

In [409]:
from sklearn.model_selection import GridSearchCV

In [410]:
dt_params = { "decisiontree__max_depth":[5,7,8,10,12,14],
                  "decisiontree__criterion":['gini','entropy'],
             "decisiontree__min_samples_leaf":[1,2],
             "decisiontree__min_samples_split":[5,10]
                  }

In [411]:
dt_grid_v3 = GridSearchCV(dt_v3,
                           param_grid=dt_params,
                           cv = 5,
                           scoring = scorer)

In [412]:
dt_grid_v3.fit(X_train, y_train)

In [413]:
dt_grid_v3.best_params_

{'decisiontree__criterion': 'entropy',
 'decisiontree__max_depth': 10,
 'decisiontree__min_samples_leaf': 1,
 'decisiontree__min_samples_split': 10}

In [414]:
dt_grid_v3.best_score_

0.6090144491663942

In [415]:
dt_grid_results = pd.DataFrame( dt_grid_v3.cv_results_ )
dt_grid_results[['param_decisiontree__criterion','param_decisiontree__max_depth', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_decisiontree__criterion,param_decisiontree__max_depth,mean_test_score,std_test_score
0,gini,5,0.607868,0.005684
1,gini,5,0.607868,0.005684
2,gini,5,0.60764,0.005317
3,gini,5,0.60764,0.005317
4,gini,7,0.60421,0.007883
5,gini,7,0.603752,0.009741
6,gini,7,0.603067,0.007144
7,gini,7,0.603294,0.007873
8,gini,8,0.602606,0.005947
9,gini,8,0.602148,0.006119


Building the final Decision Tree Model

In [416]:
final_model_dt = DecisionTreeClassifier(max_depth=dt_grid_v3.best_params_['decisiontree__max_depth'],
                                criterion=dt_grid_v3.best_params_['decisiontree__criterion'],
                                        min_samples_leaf=dt_grid_v3.best_params_['decisiontree__min_samples_leaf'],
                                  min_samples_split= dt_grid_v3.best_params_['decisiontree__min_samples_split'] 
                                )

dt_final = Pipeline(steps=[('preprocessor', preprocessor_target),
                          ('decisiontree', final_model_dt)])

In [417]:
dt_final.fit(X_train, y_train)

In [418]:
dt_final.score(X_test, y_test)

0.6042524005486969

In [419]:
from sklearn.metrics import mean_squared_error

In [420]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, dt_final.predict(X_test)))
final_rmse_dt

2.311775782863294

In [421]:
y_pred=dt_final.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  6], dtype=int64)

In [422]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.6042524005486969
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.95      0.93      0.94       112
           2       0.74      0.77      0.75        26
           3       0.32      0.51      0.40        45
           4       0.75      0.87      0.80       187
           5       0.39      0.21      0.28        52
           6       0.49      0.59      0.53       293
           7       0.57      0.52      0.54       366
           8       0.22      0.10      0.14        20
           9       0.50      1.00      0.67         1
          10       0.00      0.00      0.00         2
          11       0.41      0.30      0.35        23
          12       0.50      0.43      0.46         7
          13       0.48      0.39      0.43        54
          14       0.00      0.00      0.00        56
          15       0.71      0.87      0.78       189
          16       0.00      0.00     

#### Conclusion for Decision Tree Classifier:
The highest accuracy (62.07%) is achieved for model 2, i.e.

OneHot Encoding > Decision Tree > Grid Search > Final Model

## Random Forest Experiments

### Pipeline 2.1

Oversampling > One Hot Encoding > Random Forest > Grid Search > Final Model

In [423]:
rf_v1 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('randomforest', rf)])

In [424]:
rf_v1.fit(X_oversampled_train, y_oversampled_train)

Grid Search

In [425]:
rf_params = { "randomforest__max_depth":[5,8,10,12,14],
             "randomforest__criterion":['gini','entropy'],
             "randomforest__n_estimators":[10,20,50,100]
                  }

In [426]:
rf_grid_v1 = GridSearchCV(rf_v1,
                           param_grid=rf_params,
                           cv = 5,
                           scoring = scorer)

In [427]:
rf_grid_v1.fit(X_oversampled_train, y_oversampled_train)

In [428]:
rf_grid_v1.best_params_

{'randomforest__criterion': 'entropy',
 'randomforest__max_depth': 14,
 'randomforest__n_estimators': 100}

In [429]:
rf_grid_v1.best_score_

0.8594491255056559

In [430]:
rf_grid_results = pd.DataFrame( rf_grid_v1.cv_results_ )
rf_grid_results[['param_randomforest__criterion','param_randomforest__max_depth', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_randomforest__criterion,param_randomforest__max_depth,mean_test_score,std_test_score
0,gini,5,0.581084,0.026893
1,gini,5,0.638958,0.00713
2,gini,5,0.677448,0.008121
3,gini,5,0.695995,0.003583
4,gini,8,0.690464,0.008831
5,gini,8,0.737723,0.013541
6,gini,8,0.769454,0.011866
7,gini,8,0.778225,0.006468
8,gini,10,0.741522,0.009818
9,gini,10,0.782861,0.003518


Building the final Random Forest Model

In [431]:
final_model_rf = RandomForestClassifier(n_estimators = rf_grid_v1.best_params_['randomforest__n_estimators'], 
                       criterion = rf_grid_v1.best_params_['randomforest__criterion'], 
                       max_depth = rf_grid_v1.best_params_['randomforest__max_depth']
                      )

rf_final = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('randomforest', final_model_rf)])

In [432]:
rf_final.fit(X_oversampled_train, y_oversampled_train)

In [433]:
rf_final.score(X_test, y_test)

0.5459533607681756

In [434]:
from sklearn.metrics import mean_squared_error

In [435]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, rf_final.predict(X_test)))
final_rmse_dt

3.093756711483485

In [436]:
y_pred=rf_final.predict(X_test)
y_pred

array([17, 17,  4, ...,  1, 17,  6], dtype=int64)

In [437]:
rf_final.score(X_test,y_test)

0.5459533607681756

In [438]:
rf_final.score(X_train,y_train)

0.7613818348204072

In [439]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.5459533607681756
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.96      0.95      0.95       112
           2       0.79      0.85      0.81        26
           3       0.29      0.84      0.43        45
           4       0.84      0.59      0.70       187
           5       0.33      0.46      0.38        52
           6       0.54      0.48      0.51       293
           7       0.62      0.43      0.51       366
           8       0.23      0.45      0.30        20
           9       1.00      1.00      1.00         1
          10       0.00      0.00      0.00         2
          11       0.22      0.43      0.29        23
          12       0.21      0.57      0.31         7
          13       0.44      0.67      0.53        54
          14       0.21      0.41      0.27        56
          15       0.83      0.55      0.66       189
          16       0.00      0.00     

### Pipeline 2.2

One Hot Encoding > Random Forest > Grid Search > Final Model

In [591]:
rf_v2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('randomforest', rf)])

In [592]:
rf_v2.fit(X_train, y_train)

Grid Search

In [593]:
from sklearn.model_selection import GridSearchCV

In [594]:
rf_params = { "randomforest__max_depth":[5,8,10,12,14],
             "randomforest__criterion":['gini','entropy'],
             "randomforest__n_estimators":[10,20,50,100]
                  }

In [595]:
rf_grid_v2 = GridSearchCV(rf_v2,
                           param_grid=rf_params,
                           cv = 5,
                           scoring = scorer)

In [596]:
rf_grid_v2.fit(X_train, y_train)

In [597]:
rf_grid_v2.best_params_

{'randomforest__criterion': 'gini',
 'randomforest__max_depth': 14,
 'randomforest__n_estimators': 100}

In [598]:
rf_grid_v1.best_score_

0.8594491255056559

In [599]:
rf_grid_results = pd.DataFrame( rf_grid_v1.cv_results_ )
rf_grid_results[['param_randomforest__criterion','param_randomforest__max_depth', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_randomforest__criterion,param_randomforest__max_depth,mean_test_score,std_test_score
0,gini,5,0.581084,0.026893
1,gini,5,0.638958,0.00713
2,gini,5,0.677448,0.008121
3,gini,5,0.695995,0.003583
4,gini,8,0.690464,0.008831
5,gini,8,0.737723,0.013541
6,gini,8,0.769454,0.011866
7,gini,8,0.778225,0.006468
8,gini,10,0.741522,0.009818
9,gini,10,0.782861,0.003518


Building the final Random Forest Model

In [600]:
final_model_rf = RandomForestClassifier(n_estimators = rf_grid_v2.best_params_['randomforest__n_estimators'], 
                       criterion = rf_grid_v2.best_params_['randomforest__criterion'], 
                       max_depth = rf_grid_v2.best_params_['randomforest__max_depth']
                      )

rf_final2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('randomforest', final_model_rf)])

In [601]:
rf_final2.fit(X_train, y_train)

In [602]:
rf_final2.score(X_test, y_test)

0.6289437585733882

In [603]:
from sklearn.metrics import mean_squared_error

In [604]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, rf_final2.predict(X_test)))
final_rmse_dt

2.1637365435523748

In [605]:
y_pred=rf_final2.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  6], dtype=int64)

In [606]:
rf_final2.score(X_test,y_test)

0.6289437585733882

In [607]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.6289437585733882
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.90      0.96      0.93       112
           2       0.79      0.58      0.67        26
           3       0.27      0.07      0.11        45
           4       0.70      1.00      0.82       187
           5       1.00      0.04      0.07        52
           6       0.54      0.37      0.44       293
           7       0.54      0.82      0.65       366
           8       1.00      0.05      0.10        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       0.00      0.00      0.00        23
          12       0.00      0.00      0.00         7
          13       0.71      0.09      0.16        54
          14       0.00      0.00      0.00        56
          15       0.67      0.99      0.80       189
          16       0.00      0.00     

### Pipeline 2.3

Target Encoding > Random Forest > Grid Search > Final Model

In [608]:
rf_v3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                        ('randomforest', rf)])

In [609]:
rf_v3.fit(X_train, y_train)

Grid Search

In [610]:
rf_params = { "randomforest__max_depth":[5,8,10,12,14],
             "randomforest__criterion":['gini','entropy'],
             "randomforest__n_estimators":[10,20,50,100]
                  }

In [611]:
rf_grid_v3 = GridSearchCV(rf_v3,
                           param_grid=rf_params,
                           cv = 5,
                           scoring = scorer)

In [612]:
rf_grid_v3.fit(X_train, y_train)

In [613]:
rf_grid_v3.best_params_

{'randomforest__criterion': 'gini',
 'randomforest__max_depth': 10,
 'randomforest__n_estimators': 50}

In [614]:
rf_grid_v1.best_score_

0.8594491255056559

In [615]:
rf_grid_results = pd.DataFrame( rf_grid_v1.cv_results_ )
rf_grid_results[['param_randomforest__criterion','param_randomforest__max_depth', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_randomforest__criterion,param_randomforest__max_depth,mean_test_score,std_test_score
0,gini,5,0.581084,0.026893
1,gini,5,0.638958,0.00713
2,gini,5,0.677448,0.008121
3,gini,5,0.695995,0.003583
4,gini,8,0.690464,0.008831
5,gini,8,0.737723,0.013541
6,gini,8,0.769454,0.011866
7,gini,8,0.778225,0.006468
8,gini,10,0.741522,0.009818
9,gini,10,0.782861,0.003518


Building the final Random Forest Model

In [616]:
final_model_rf = RandomForestClassifier(n_estimators = rf_grid_v3.best_params_['randomforest__n_estimators'], 
                       criterion = rf_grid_v3.best_params_['randomforest__criterion'], 
                       max_depth = rf_grid_v3.best_params_['randomforest__max_depth']
                      )

rf_final3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                          ('randomforest', final_model_rf)])

In [617]:
rf_final3.fit(X_train, y_train)

In [618]:
rf_final3.score(X_test, y_test)

0.6406035665294925

In [619]:
from sklearn.metrics import mean_squared_error

In [620]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, rf_final3.predict(X_test)))
final_rmse_dt

2.1159787936423258

In [621]:
y_pred=rf_final3.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  6], dtype=int64)

In [625]:
final_score=rf_final3.score(X_test,y_test)
final_score

0.6406035665294925

In [623]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.6406035665294925
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.91      0.96      0.93       112
           2       0.76      0.62      0.68        26
           3       0.56      0.33      0.42        45
           4       0.71      0.96      0.82       187
           5       0.44      0.08      0.13        52
           6       0.53      0.53      0.53       293
           7       0.58      0.70      0.63       366
           8       0.67      0.10      0.17        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       0.20      0.04      0.07        23
          12       1.00      0.43      0.60         7
          13       0.55      0.31      0.40        54
          14       0.00      0.00      0.00        56
          15       0.71      0.94      0.81       189
          16       0.00      0.00     

#### Conclusion for Random Forest Classifier:
The highest accuracy (64.06%) is achieved for model 3, i.e.

Target Encoding > Random Forest > Grid Search > Final Model

## Gradient Boost Experimentss

### Pipeline 3.1

Oversampling > One Hot Encoding > Gradient Boost > Grid Search > Final Model

In [473]:
gb_v1 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('gradientboost', grad_class)])

In [474]:
gb_v1.fit(X_oversampled_train, y_oversampled_train)

Grid Search

In [475]:
gb_params = { "gradientboost__learning_rate":[0.9,0.5,0.1],
              "gradientboost__n_estimators":[50,100]}

In [476]:
gb_grid_v1 = GridSearchCV(gb_v1,
                           param_grid=gb_params,
                           cv = 5,
                           scoring = scorer)

In [477]:
gb_grid_v1.fit(X_oversampled_train, y_oversampled_train)

In [478]:
gb_grid_v1.best_params_

{'gradientboost__learning_rate': 0.5, 'gradientboost__n_estimators': 100}

In [479]:
gb_grid_v1.best_score_

0.8355953129412059

In [480]:
gb_grid_results = pd.DataFrame( gb_grid_v1.cv_results_ )
gb_grid_results[['param_gradientboost__learning_rate','param_gradientboost__n_estimators', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_gradientboost__learning_rate,param_gradientboost__n_estimators,mean_test_score,std_test_score
0,0.9,50,0.262613,0.222247
1,0.9,100,0.262948,0.222894
2,0.5,50,0.825149,0.03341
3,0.5,100,0.835595,0.040701
4,0.1,50,0.762806,0.005596
5,0.1,100,0.80219,0.004251


Building the final Gradient Boost Model

In [481]:
final_model_gb = GradientBoostingClassifier(n_estimators=gb_grid_v1.best_params_['gradientboost__n_estimators'], 
                                learning_rate=gb_grid_v1.best_params_['gradientboost__learning_rate'])

gb_final1 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('gradientBoost', final_model_gb)])

In [482]:
gb_final1.fit(X_oversampled_train, y_oversampled_train)

In [483]:
gb_final1.score(X_test, y_test)

0.5672153635116598

In [484]:
from sklearn.metrics import mean_squared_error

In [485]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, gb_final1.predict(X_test)))
final_rmse_dt

3.0421140941564913

In [486]:
y_pred=gb_final1.predict(X_test)
y_pred

array([17, 17, 12, ...,  1, 17,  6], dtype=int64)

In [487]:
final_score = gb_final1.score(X_test,y_test)
final_score

0.5672153635116598

In [488]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.5672153635116598
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.91      0.95      0.93       112
           2       0.60      0.69      0.64        26
           3       0.52      0.67      0.58        45
           4       0.81      0.66      0.73       187
           5       0.34      0.37      0.35        52
           6       0.51      0.47      0.49       293
           7       0.62      0.55      0.58       366
           8       0.18      0.20      0.19        20
           9       1.00      1.00      1.00         1
          10       0.00      0.00      0.00         2
          11       0.26      0.52      0.35        23
          12       0.21      0.43      0.29         7
          13       0.42      0.63      0.51        54
          14       0.21      0.39      0.27        56
          15       0.80      0.54      0.65       189
          16       0.00      0.00     

### Pipeline 3.2

One Hot Encoding > Gradient Boost > Grid Search > Final Model

In [489]:
gb_v2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('gradientboost', grad_class)])

In [490]:
gb_v2.fit(X_train, y_train)

Grid Search

In [491]:
from sklearn.model_selection import GridSearchCV

In [492]:
gb_params = { "gradientboost__learning_rate":[0.9,0.5,0.1],
              "gradientboost__n_estimators":[50,100]}

In [493]:
gb_grid_v2 = GridSearchCV(gb_v2,
                           param_grid=gb_params,
                           cv = 5,
                           scoring = scorer)

In [494]:
gb_grid_v2.fit(X_train, y_train)

In [495]:
gb_grid_v2.best_params_

{'gradientboost__learning_rate': 0.1, 'gradientboost__n_estimators': 100}

In [496]:
gb_grid_v2.best_score_

0.6177103628636809

In [497]:
gb_grid_results = pd.DataFrame( gb_grid_v1.cv_results_ )
gb_grid_results[['param_gradientboost__learning_rate','param_gradientboost__n_estimators', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_gradientboost__learning_rate,param_gradientboost__n_estimators,mean_test_score,std_test_score
0,0.9,50,0.262613,0.222247
1,0.9,100,0.262948,0.222894
2,0.5,50,0.825149,0.03341
3,0.5,100,0.835595,0.040701
4,0.1,50,0.762806,0.005596
5,0.1,100,0.80219,0.004251


Building the final Gradient Boost Model

In [498]:
final_model_gb = GradientBoostingClassifier(n_estimators=gb_grid_v2.best_params_['gradientboost__n_estimators'], 
                                learning_rate=gb_grid_v2.best_params_['gradientboost__learning_rate'])

gb_final2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('randomforest', final_model_gb)])

In [499]:
gb_final2.fit(X_train, y_train)

In [500]:
gb_final2.score(X_test, y_test)

0.6172839506172839

In [501]:
from sklearn.metrics import mean_squared_error

In [502]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, gb_final2.predict(X_test)))
final_rmse_dt

2.166745804189445

In [503]:
y_pred=gb_final2.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 17,  6], dtype=int64)

In [504]:
gb_final2.score(X_test,y_test)

0.6172839506172839

In [505]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.6172839506172839
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.90      0.96      0.93       112
           2       0.79      0.58      0.67        26
           3       0.43      0.22      0.29        45
           4       0.73      0.96      0.83       187
           5       0.50      0.06      0.10        52
           6       0.48      0.48      0.48       293
           7       0.56      0.67      0.61       366
           8       0.33      0.05      0.09        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       0.18      0.09      0.12        23
          12       0.50      0.43      0.46         7
          13       0.50      0.26      0.34        54
          14       0.29      0.04      0.06        56
          15       0.70      0.94      0.80       189
          16       0.00      0.00     

### Pipeline 3.3

Target Encoding > Gradient Boost > Grid Search > Final Model

In [506]:
gb_v3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                        ('gradientboost', grad_class)])

In [507]:
gb_v3.fit(X_train, y_train)

Grid Search

In [508]:
from sklearn.model_selection import GridSearchCV

In [509]:
gb_params = { "gradientboost__learning_rate":[0.9,0.5,0.1],
              "gradientboost__n_estimators":[50,100]}

In [510]:
gb_grid_v3 = GridSearchCV(gb_v3,
                           param_grid=gb_params,
                           cv = 5,
                           scoring = scorer)

In [511]:
gb_grid_v3.fit(X_train, y_train)

In [512]:
gb_grid_v3.best_params_

{'gradientboost__learning_rate': 0.1, 'gradientboost__n_estimators': 100}

In [513]:
gb_grid_v3.best_score_

0.6170215102974829

In [514]:
gb_grid_results = pd.DataFrame( gb_grid_v1.cv_results_ )
gb_grid_results[['param_gradientboost__learning_rate','param_gradientboost__n_estimators', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_gradientboost__learning_rate,param_gradientboost__n_estimators,mean_test_score,std_test_score
0,0.9,50,0.262613,0.222247
1,0.9,100,0.262948,0.222894
2,0.5,50,0.825149,0.03341
3,0.5,100,0.835595,0.040701
4,0.1,50,0.762806,0.005596
5,0.1,100,0.80219,0.004251


Building the final Gradient Boost Model

In [515]:
final_model_gb = GradientBoostingClassifier(n_estimators=gb_grid_v3.best_params_['gradientboost__n_estimators'], 
                                learning_rate=gb_grid_v3.best_params_['gradientboost__learning_rate'])

gb_final3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                          ('gradientboost', final_model_gb)])

In [516]:
gb_final3.fit(X_train, y_train)

In [517]:
gb_final3.score(X_test, y_test)

0.6330589849108368

In [518]:
from sklearn.metrics import mean_squared_error

In [519]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, gb_final3.predict(X_test)))
final_rmse_dt

2.2249982660556427

In [520]:
y_pred=gb_final3.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  6], dtype=int64)

In [521]:
gb_final3.score(X_test,y_test)

0.6330589849108368

In [522]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.6330589849108368
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.96      0.93      0.95       112
           2       0.81      0.85      0.83        26
           3       0.45      0.33      0.38        45
           4       0.71      0.91      0.80       187
           5       0.32      0.12      0.17        52
           6       0.51      0.56      0.53       293
           7       0.59      0.66      0.63       366
           8       0.50      0.10      0.17        20
           9       0.25      1.00      0.40         1
          10       0.00      0.00      0.00         2
          11       0.25      0.09      0.13        23
          12       0.75      0.43      0.55         7
          13       0.50      0.26      0.34        54
          14       0.00      0.00      0.00        56
          15       0.69      0.94      0.80       189
          16       0.00      0.00     

#### Conclusion for Gradient Boost Classifier:
The highest accuracy (63.3%) is achieved for model 3, i.e.

Target Encoding > Gradient Boost > Grid Search > Final Model

## KNN Experiments

### Pipeline 4.1

Oversampling > One Hot Encoding > KNN > Grid Search > Final Model

In [523]:
knn_v1 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('knn', knn)])

In [524]:
knn_v1.fit(X_oversampled_train, y_oversampled_train)

Grid Search

In [525]:
from sklearn.model_selection import GridSearchCV

In [526]:
knn_params = { "knn__n_neighbors": [5, 10, 15, 20, 25],
               "knn__weights": ['uniform', 'distance'],
               "knn__metric": ['minkowski', 'euclidean']}

In [527]:
knn_grid_v1 = GridSearchCV(knn_v1,
                           param_grid=knn_params,
                           cv = 10,
                           scoring = scorer,
                           )

In [528]:
knn_grid_v1.fit(X_oversampled_train, y_oversampled_train)

In [529]:
knn_grid_v1.best_params_

{'knn__metric': 'minkowski',
 'knn__n_neighbors': 10,
 'knn__weights': 'distance'}

In [530]:
knn_grid_results = pd.DataFrame( knn_grid_v1.cv_results_ )
knn_grid_results[['param_knn__n_neighbors', 'param_knn__weights', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_knn__n_neighbors,param_knn__weights,mean_test_score,std_test_score
0,5,uniform,0.83068,0.008589
1,5,distance,0.861014,0.012897
2,10,uniform,0.816268,0.00953
3,10,distance,0.864757,0.014292
4,15,uniform,0.790291,0.007468
5,15,distance,0.863025,0.014815
6,20,uniform,0.763756,0.005809
7,20,distance,0.862019,0.015167
8,25,uniform,0.742249,0.008398
9,25,distance,0.862299,0.015799


Building the final KNN Model

In [531]:
final_model_knn = KNeighborsClassifier(n_neighbors = knn_grid_v1.best_params_['knn__n_neighbors'], 
                                  weights = knn_grid_v1.best_params_['knn__weights'], 
                                  metric = knn_grid_v1.best_params_['knn__metric'])
knn_final = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('knn', final_model_knn)])

In [532]:
knn_final.fit(X_oversampled_train, y_oversampled_train)

In [533]:
knn_final.score(X_test, y_test)

0.47050754458161864

In [534]:
from sklearn.metrics import mean_squared_error

In [535]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, knn_final.predict(X_test)))
final_rmse_dt

3.9673738000955368

In [536]:
y_pred=knn_final.predict(X_test)
y_pred

array([17, 17,  4, ...,  1, 17,  6], dtype=int64)

In [537]:
knn_final.score(X_test,y_test)

0.47050754458161864

In [538]:
knn_final.score(X_train,y_train)

0.8307023564401739

In [539]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]))  

ACCURACY OF THE MODEL:  0.47050754458161864
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.88      0.75      0.81       112
           2       0.64      0.81      0.71        26
           3       0.32      0.82      0.46        45
           4       0.82      0.52      0.64       187
           5       0.29      0.52      0.37        52
           6       0.47      0.39      0.42       293
           7       0.63      0.39      0.48       366
           8       0.18      0.35      0.24        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       0.17      0.39      0.23        23
          12       0.15      0.57      0.24         7
          13       0.33      0.52      0.40        54
          14       0.17      0.48      0.25        56
          15       0.76      0.41      0.53       189
          16       0.00      0.00    

### Pipeline 4.2

One Hot Encoding > KNN > Grid Search > Final Model

In [540]:
knn_v2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('knn', knn)])

In [541]:
knn_v2.fit(X_train, y_train)

Grid Search

In [542]:
from sklearn.model_selection import GridSearchCV

In [543]:
knn_params = { "knn__n_neighbors": [5, 10, 15, 20, 25],
               "knn__weights": ['uniform', 'distance'],
               "knn__metric": ['minkowski', 'euclidean']}

In [544]:
knn_grid_v2 = GridSearchCV(knn_v2,
                           param_grid=knn_params,
                           cv = 10,
                           scoring = scorer)

In [545]:
knn_grid_v2.fit(X_train, y_train)

In [546]:
knn_grid_v2.best_params_

{'knn__metric': 'minkowski',
 'knn__n_neighbors': 25,
 'knn__weights': 'distance'}

In [547]:
knn_grid_v2.best_score_

0.5776715463465095

In [548]:
knn_grid_results = pd.DataFrame( knn_grid_v2.cv_results_ )
knn_grid_results[['param_knn__n_neighbors', 'param_knn__weights', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_knn__n_neighbors,param_knn__weights,mean_test_score,std_test_score
0,5,uniform,0.554567,0.019665
1,5,distance,0.564858,0.016078
2,10,uniform,0.552968,0.027091
3,10,distance,0.577445,0.026506
4,15,uniform,0.548844,0.016451
5,15,distance,0.574012,0.014388
6,20,uniform,0.547014,0.02309
7,20,distance,0.574468,0.015509
8,25,uniform,0.547014,0.0174
9,25,distance,0.577672,0.016744


Building the final KNN Model

In [549]:
final_model_knn = KNeighborsClassifier(n_neighbors = knn_grid_v2.best_params_['knn__n_neighbors'], 
                                  weights = knn_grid_v2.best_params_['knn__weights'], 
                                  metric = knn_grid_v2.best_params_['knn__metric'])
knn_final = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('knn', final_model_knn)])

In [550]:
knn_final.fit(X_train, y_train)

In [551]:
knn_final.score(X_test, y_test)

0.5747599451303155

In [552]:
from sklearn.metrics import mean_squared_error

In [553]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, knn_final.predict(X_test)))
final_rmse_dt

2.563193096069164

In [554]:
y_pred=knn_final.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  6], dtype=int64)

In [555]:
knn_final.score(X_test,y_test)

0.5747599451303155

In [556]:
knn_final.score(X_train,y_train)

0.8796614047128803

In [557]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.5747599451303155
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.89      0.79      0.83       112
           2       0.71      0.58      0.64        26
           3       0.39      0.29      0.33        45
           4       0.70      0.83      0.76       187
           5       0.43      0.06      0.10        52
           6       0.43      0.55      0.49       293
           7       0.50      0.59      0.54       366
           8       0.40      0.10      0.16        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       1.00      0.22      0.36        23
          12       1.00      0.43      0.60         7
          13       0.56      0.19      0.28        54
          14       0.60      0.05      0.10        56
          15       0.69      0.87      0.77       189
          16       0.00      0.00     

### Pipeline 4.3

Target Encoding > KNN > Grid Search > Final Model

In [558]:
knn_v3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                        ('knn', knn)])

In [559]:
knn_v3.fit(X_train, y_train)

Grid Search

In [560]:
from sklearn.model_selection import GridSearchCV

In [561]:
knn_params = { "knn__n_neighbors": [5, 10, 15, 20, 25],
               "knn__weights": ['uniform', 'distance'],
               "knn__metric": ['minkowski', 'euclidean']}

In [562]:
knn_grid_v3 = GridSearchCV(knn_v3,
                           param_grid=knn_params,
                           cv = 10,
                           scoring = 'precision',
                           refit=True ,
                           n_jobs = -1)

In [563]:
knn_grid_v3.fit(X_train, y_train)

In [564]:
knn_grid_v3.best_params_

{'knn__metric': 'minkowski', 'knn__n_neighbors': 5, 'knn__weights': 'uniform'}

In [565]:
knn_grid_v3.best_score_

nan

In [566]:
knn_grid_results = pd.DataFrame( knn_grid_v3.cv_results_ )
knn_grid_results[['param_knn__n_neighbors', 'param_knn__weights', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_knn__n_neighbors,param_knn__weights,mean_test_score,std_test_score
0,5,uniform,,
1,5,distance,,
2,10,uniform,,
3,10,distance,,
4,15,uniform,,
5,15,distance,,
6,20,uniform,,
7,20,distance,,
8,25,uniform,,
9,25,distance,,


Building the final KNN Model

In [567]:
final_model_knn = KNeighborsClassifier(n_neighbors = knn_grid_v3.best_params_['knn__n_neighbors'], 
                                  weights = knn_grid_v3.best_params_['knn__weights'], 
                                  metric = knn_grid_v3.best_params_['knn__metric'])
knn_final = Pipeline(steps=[('preprocessor', preprocessor_target),
                          ('knn', final_model_knn)])

In [568]:
knn_final.fit(X_train, y_train)

In [569]:
knn_final.score(X_test, y_test)

0.5294924554183813

In [570]:
from sklearn.metrics import mean_squared_error

In [571]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, knn_final.predict(X_test)))
final_rmse_dt

2.463450244194947

In [572]:
y_pred=knn_final.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  7], dtype=int64)

In [573]:
knn_final.score(X_test,y_test)

0.5294924554183813

In [574]:
knn_final.score(X_train,y_train)

0.6607183710821323

In [575]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.5294924554183813
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.89      0.71      0.79       112
           2       0.64      0.54      0.58        26
           3       0.20      0.29      0.24        45
           4       0.58      0.64      0.61       187
           5       0.25      0.08      0.12        52
           6       0.43      0.51      0.46       293
           7       0.49      0.57      0.53       366
           8       0.30      0.15      0.20        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       0.62      0.22      0.32        23
          12       1.00      0.43      0.60         7
          13       0.42      0.35      0.38        54
          14       0.00      0.00      0.00        56
          15       0.72      0.83      0.77       189
          16       0.00      0.00     

#### Conclusion for KNN Classifier:
The highest accuracy (57.47%) is achieved for model 2, i.e.

One hot Encoding > KNN > Grid Search > Final Model

## XGBoost Experiments

### Pipeline 5.1

Oversampling > One Hot Encoding > XGBoost > Grid Search > Final Model

In [590]:
xgb_v1 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('xgb', xgb)])

In [591]:
xgb_v1.fit(X_oversampled_train, y_oversampled_train)

Grid Search

In [592]:
xgb_params = { "xgb__n_estimators": [100,200,300,400,500],
               "xgb__max_depth": [3,4,5,6,7],
              #"objective": 'reg:squarederror',
               "xgb__colsample_bytree": [0.5,0.6,7],
               "xgb__subsample": [0.75,0.7]
              #"lambda": 100
           }

In [593]:
xgb_grid_v1 = GridSearchCV(xgb_v1,
                           param_grid=xgb_params,
                           cv = 5,
                           scoring = scorer)

In [594]:
xgb_grid_v1.fit(X_oversampled_train, y_oversampled_train)

In [595]:
xgb_grid_v1.best_params_

{'xgb__colsample_bytree': 0.6,
 'xgb__max_depth': 3,
 'xgb__n_estimators': 100,
 'xgb__subsample': 0.7}

In [596]:
xgb_grid_v1.best_score_

0.9515817571182849

In [597]:
xgb_grid_results = pd.DataFrame( xgb_grid_v1.cv_results_ )
#xgb_grid_results[['mean_fit_time','mean_score_time','params']

In [598]:
xgb_grid_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgb__colsample_bytree,param_xgb__max_depth,param_xgb__n_estimators,param_xgb__subsample,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.057906,0.035263,0.036141,0.002711,0.5,3,100,0.75,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth...",0.952025,0.945744,0.949120,0.952737,0.956354,0.951196,0.003569,9
1,1.043346,0.020827,0.038490,0.002904,0.5,3,100,0.7,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth...",0.951302,0.945503,0.948638,0.953219,0.957077,0.951148,0.003941,10
2,1.965008,0.026175,0.027484,0.003308,0.5,3,200,0.75,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth...",0.950579,0.945503,0.950567,0.951290,0.955389,0.950666,0.003143,48
3,1.991378,0.016226,0.031277,0.004699,0.5,3,200,0.7,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth...",0.951784,0.945985,0.952014,0.951290,0.955389,0.951292,0.003024,2
4,2.897959,0.030725,0.033764,0.006245,0.5,3,300,0.75,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth...",0.951302,0.944779,0.951772,0.949843,0.954666,0.950473,0.003249,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,0.087099,0.000966,0.000000,0.000000,7,7,300,0.7,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':...",,,,,,,,120
146,0.085029,0.005928,0.000000,0.000000,7,7,400,0.75,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':...",,,,,,,,111
147,0.080819,0.007055,0.000000,0.000000,7,7,400,0.7,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':...",,,,,,,,122
148,0.080512,0.003575,0.000000,0.000000,7,7,500,0.75,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':...",,,,,,,,127


Building the final XGBoost Model

In [599]:
final_model_xgb = XGBClassifier(n_estimators = xgb_grid_v1.best_params_['xgb__n_estimators'], 
                                      max_depth = xgb_grid_v1.best_params_['xgb__max_depth'],
                                      colsample_bytree = xgb_grid_v1.best_params_['xgb__colsample_bytree'],
                                      subsample = xgb_grid_v1.best_params_['xgb__subsample'])
                                       
xgb_final = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('XGBoost', final_model_xgb)])

In [600]:
xgb_final.fit(X_oversampled_train, y_oversampled_train)

In [601]:
xgb_final.score(X_test, y_test)

0.958842705786471

In [602]:
from sklearn.metrics import mean_squared_error

In [603]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, xgb_final.predict(X_test)))
final_rmse_dt

0.20287260587257447

In [604]:
y_pred=xgb_final.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [605]:
xgb_final.score(X_test,y_test)

0.958842705786471

In [606]:
xgb_final.score(X_train,y_train)

0.9639364303178484

In [607]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]))  

ACCURACY OF THE MODEL:  0.958842705786471
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      3435
           1       0.95      0.91      0.93      1473

    accuracy                           0.96      4908
   macro avg       0.96      0.94      0.95      4908
weighted avg       0.96      0.96      0.96      4908



### Pipeline 5.2

One Hot Encoding > XGBoost > Grid Search > Final Model

In [729]:
xgb_v2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('xgb', xgb)])

In [730]:
xgb_v2.fit(X_train, y_train)

Grid Search

In [731]:
xgb_params = { "xgb__n_estimators": [100,200,300,400,500],
               "xgb__max_depth": [3,4,5,6,7],
              #"objective": 'reg:squarederror',
               "xgb__colsample_bytree": [0.5,0.6,7],
               "xgb__subsample": [0.75,0.7]
              #"lambda": 100
           }

In [732]:
xgb_grid_v2 = GridSearchCV(xgb_v2,
                           param_grid=xgb_params,
                           cv = 5,
                           scoring = scorer)

In [733]:
xgb_grid_v2.fit(X_train, y_train)

In [734]:
xgb_grid_v2.best_params_

{'xgb__colsample_bytree': 0.5,
 'xgb__max_depth': 3,
 'xgb__n_estimators': 100,
 'xgb__subsample': 0.75}

In [735]:
xgb_grid_v2.best_score_

0.6140498202026807

In [736]:
xgb_grid_results = pd.DataFrame( xgb_grid_v2.cv_results_ )
xgb_grid_results[['mean_fit_time','mean_score_time','params']]

Unnamed: 0,mean_fit_time,mean_score_time,params
0,3.784076,0.040502,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
1,3.298417,0.036734,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
2,7.050893,0.053694,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
3,7.748960,0.059053,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
4,11.675990,0.136328,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
...,...,...,...
145,0.040777,0.000000,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':..."
146,0.043997,0.000000,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':..."
147,0.043902,0.000000,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':..."
148,0.041380,0.000000,"{'xgb__colsample_bytree': 7, 'xgb__max_depth':..."


Building the final XGBoost Model

In [737]:
final_model_xgb = XGBClassifier(n_estimators = xgb_grid_v2.best_params_['xgb__n_estimators'], 
                                      max_depth = xgb_grid_v2.best_params_['xgb__max_depth'],
                                      colsample_bytree = xgb_grid_v2.best_params_['xgb__colsample_bytree'],
                                      subsample = xgb_grid_v2.best_params_['xgb__subsample'])
                                       
xgb_final = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                          ('XGBoost', final_model_xgb)])

In [738]:
xgb_final.fit(X_train, y_train)

In [739]:
xgb_final.score(X_test, y_test)

0.6213991769547325

In [740]:
from sklearn.metrics import mean_squared_error

In [741]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, xgb_final.predict(X_test)))
final_rmse_dt

2.2609283452085234

In [742]:
y_pred=xgb_final.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  7], dtype=int64)

In [743]:
xgb_final.score(X_test,y_test)

0.6213991769547325

In [744]:
xgb_final.score(X_train,y_train)

0.7311827956989247

In [745]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.6213991769547325
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.92      0.95      0.93       112
           2       0.74      0.65      0.69        26
           3       0.38      0.33      0.36        45
           4       0.73      0.89      0.80       187
           5       0.33      0.13      0.19        52
           6       0.50      0.53      0.51       293
           7       0.58      0.64      0.61       366
           8       0.33      0.10      0.15        20
           9       1.00      1.00      1.00         1
          10       0.00      0.00      0.00         2
          11       0.54      0.30      0.39        23
          12       0.43      0.43      0.43         7
          13       0.50      0.31      0.39        54
          14       0.00      0.00      0.00        56
          15       0.71      0.92      0.80       189
          16       0.00      0.00     

### Creating pipeline 5.3

Target Encoding > XGBoost > Grid Search > Final Model

In [644]:
xgb_v3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                        ('xgb', xgb)])

In [645]:
xgb_v3.fit(X_train, y_train)

Grid Search

In [646]:
from sklearn.model_selection import GridSearchCV

In [647]:
xgb_params = { "xgb__n_estimators": [100,200], #,300,400,500
               "xgb__max_depth": [3,4], #,5,6,7
              #"objective": 'reg:squarederror',
               "xgb__colsample_bytree": [0.5,0.6,7],
               "xgb__subsample": [0.75,0.7]
              #"lambda": 100
           }

In [648]:
xgb_grid_v3 = GridSearchCV(xgb_v3,
                           param_grid=xgb_params,
                           cv = 5,
                           scoring = scorer)

In [649]:
xgb_grid_v3.fit(X_train, y_train)

In [650]:
xgb_grid_v3.best_params_

{'xgb__colsample_bytree': 0.5,
 'xgb__max_depth': 3,
 'xgb__n_estimators': 100,
 'xgb__subsample': 0.75}

In [651]:
xgb_grid_v3.best_score_

0.6181661981039556

In [652]:
xgb_grid_results = pd.DataFrame( xgb_grid_v3.cv_results_ )
xgb_grid_results[['mean_fit_time','mean_score_time','params']]

Unnamed: 0,mean_fit_time,mean_score_time,params
0,3.301974,0.028736,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
1,3.999989,0.030775,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
2,7.728124,0.042947,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
3,8.057145,0.038427,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
4,4.752864,0.036288,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
5,4.86644,0.03555,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
6,9.697691,0.058471,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
7,9.691234,0.05908,"{'xgb__colsample_bytree': 0.5, 'xgb__max_depth..."
8,4.12287,0.031396,"{'xgb__colsample_bytree': 0.6, 'xgb__max_depth..."
9,4.386406,0.030905,"{'xgb__colsample_bytree': 0.6, 'xgb__max_depth..."


Building the final XGBoost Model

In [653]:
final_model_xgb = XGBClassifier(n_estimators = xgb_grid_v3.best_params_['xgb__n_estimators'], 
                                      max_depth = xgb_grid_v3.best_params_['xgb__max_depth'],
                                      colsample_bytree = xgb_grid_v3.best_params_['xgb__colsample_bytree'],
                                      subsample = xgb_grid_v3.best_params_['xgb__subsample'])
                                       
xgb_final = Pipeline(steps=[('preprocessor', preprocessor_target),
                          ('XGBoost', final_model_xgb)])

In [654]:
xgb_final.fit(X_train, y_train)

In [655]:
xgb_final.score(X_test, y_test)

0.6241426611796982

In [656]:
from sklearn.metrics import mean_squared_error

In [657]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, xgb_final.predict(X_test)))
final_rmse_dt

2.2034702646355497

In [658]:
y_pred=xgb_final.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  6], dtype=int64)

In [659]:
xgb_final.score(X_test,y_test)

0.6241426611796982

In [660]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.6241426611796982
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.93      0.96      0.94       112
           2       0.78      0.69      0.73        26
           3       0.47      0.40      0.43        45
           4       0.74      0.88      0.80       187
           5       0.36      0.15      0.22        52
           6       0.50      0.56      0.52       293
           7       0.58      0.63      0.61       366
           8       0.43      0.15      0.22        20
           9       1.00      1.00      1.00         1
          10       0.00      0.00      0.00         2
          11       0.39      0.30      0.34        23
          12       0.43      0.43      0.43         7
          13       0.45      0.28      0.34        54
          14       0.00      0.00      0.00        56
          15       0.70      0.91      0.79       189
          16       0.00      0.00     

#### Conclusion for XGBoost Classifier:
The highest accuracy (62.4%) is achieved for model 3, i.e.

OneHot Encoding > XGBoost > Grid Search > Final Model

## AdaBoost Experiments

In [691]:
from sklearn.ensemble import AdaBoostClassifier

In [692]:
ada = AdaBoostClassifier(n_estimators=400,learning_rate=1,algorithm='SAMME')

One Hot Encoding

In [713]:
ada_v2 = Pipeline(steps=[('preprocessor', preprocessor_ohe),
                        ('AdaBoost', ada)])

In [714]:
ada_v2.fit(X_train, y_train)

Grid Search

In [715]:
from sklearn.model_selection import GridSearchCV

In [716]:
ada_params = { "AdaBoost__n_estimators": [300,400], #,300,400,500
               "AdaBoost__learning_rate": [1,2], 
               "AdaBoost__algorithm": ['SAMME']
              #"lambda": 100
           }

In [717]:
ada_grid_v2 = GridSearchCV(ada_v2,
                           param_grid=ada_params,
                           cv = 5,
                           scoring = scorer)

In [718]:
ada_grid_v2.fit(X_train, y_train)

In [719]:
ada_grid_v2.best_params_

{'AdaBoost__algorithm': 'SAMME',
 'AdaBoost__learning_rate': 1,
 'AdaBoost__n_estimators': 300}

In [720]:
ada_grid_v2.best_score_

0.5623374959136973

In [721]:
ada_grid_results = pd.DataFrame( ada_grid_v2.cv_results_ )
ada_grid_results[['mean_fit_time','mean_score_time','params']]

Unnamed: 0,mean_fit_time,mean_score_time,params
0,1.089254,0.14463,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."
1,1.429726,0.17437,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."
2,1.036894,0.115283,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."
3,1.303534,0.172749,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."


Building the final AdaBoost Model

In [722]:
final_model_ada = AdaBoostClassifier(algorithm = ada_grid_v2.best_params_['AdaBoost__algorithm'], 
                                      learning_rate = ada_grid_v2.best_params_['AdaBoost__learning_rate'],
                                      n_estimators = ada_grid_v2.best_params_['AdaBoost__n_estimators'])
                                     
                                       
ada_final2 = Pipeline(steps=[('preprocessor', preprocessor_target),
                          ('AdaBoost', final_model_ada)])

In [723]:
ada_final2.fit(X_train, y_train)

In [724]:
ada_final2.score(X_test, y_test)

0.5836762688614541

In [707]:
from sklearn.metrics import mean_squared_error

In [725]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, ada_final2.predict(X_test)))
final_rmse_dt

2.158182184834155

In [726]:
y_pred=ada_final2.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  7], dtype=int64)

In [727]:
ada_final2.score(X_test,y_test)

0.5836762688614541

In [728]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.5836762688614541
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.81      1.00      0.89       112
           2       0.00      0.00      0.00        26
           3       0.00      0.00      0.00        45
           4       0.69      0.98      0.81       187
           5       0.20      0.02      0.04        52
           6       0.00      0.00      0.00       293
           7       0.48      1.00      0.65       366
           8       0.00      0.00      0.00        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       0.00      0.00      0.00        23
          12       0.00      0.00      0.00         7
          13       0.00      0.00      0.00        54
          14       0.00      0.00      0.00        56
          15       0.66      1.00      0.79       189
          16       0.00      0.00     

Target Encoding

In [693]:
ada_v3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                        ('AdaBoost', ada)])

In [694]:
ada_v3.fit(X_train, y_train)

Grid Search

In [695]:
from sklearn.model_selection import GridSearchCV

In [696]:
ada_v3.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'AdaBoost', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__verbose_feature_names_out', 'preprocessor__cat_target', 'preprocessor__num', 'preprocessor__cat_target__memory', 'preprocessor__cat_target__steps', 'preprocessor__cat_target__verbose', 'preprocessor__cat_target__tencoder', 'preprocessor__cat_target__tencoder__cols', 'preprocessor__cat_target__tencoder__drop_invariant', 'preprocessor__cat_target__tencoder__handle_missing', 'preprocessor__cat_target__tencoder__handle_unknown', 'preprocessor__cat_target__tencoder__min_samples_leaf', 'preprocessor__cat_target__tencoder__return_df', 'preprocessor__cat_target__tencoder__smoothing', 'preprocessor__cat_target__tencoder__verbose', 'preprocessor__num__memory', 'preprocessor__num__steps', 'preprocessor__num__verbose', 'preprocessor__num__sc

In [697]:
ada_params = { "AdaBoost__n_estimators": [300,400], #,300,400,500
               "AdaBoost__learning_rate": [1,2], 
               "AdaBoost__algorithm": ['SAMME']
              #"lambda": 100
           }

In [698]:
ada_grid_v3 = GridSearchCV(ada_v3,
                           param_grid=ada_params,
                           cv = 5,
                           scoring = scorer)

In [699]:
ada_grid_v3.fit(X_train, y_train)

In [700]:
ada_grid_v3.best_params_

{'AdaBoost__algorithm': 'SAMME',
 'AdaBoost__learning_rate': 1,
 'AdaBoost__n_estimators': 400}

In [701]:
ada_grid_v3.best_score_

0.5538777378228179

In [702]:
ada_grid_results = pd.DataFrame( ada_grid_v3.cv_results_ )
ada_grid_results[['mean_fit_time','mean_score_time','params']]

Unnamed: 0,mean_fit_time,mean_score_time,params
0,0.930152,0.090366,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."
1,1.125662,0.122037,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."
2,1.085865,0.127847,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."
3,1.305963,0.136317,"{'AdaBoost__algorithm': 'SAMME', 'AdaBoost__le..."


Building the final AdaBoost Model

In [704]:
final_model_ada = AdaBoostClassifier(algorithm = ada_grid_v3.best_params_['AdaBoost__algorithm'], 
                                      learning_rate = ada_grid_v3.best_params_['AdaBoost__learning_rate'],
                                      n_estimators = ada_grid_v3.best_params_['AdaBoost__n_estimators'])
                                     
                                       
ada_final3 = Pipeline(steps=[('preprocessor', preprocessor_target),
                          ('AdaBoost', final_model_ada)])

In [705]:
ada_final3.fit(X_train, y_train)

In [706]:
ada_final3.score(X_test, y_test)

0.53360768175583

In [707]:
from sklearn.metrics import mean_squared_error

In [708]:
final_rmse_dt = np.sqrt(mean_squared_error(y_test, ada_final3.predict(X_test)))
final_rmse_dt

2.2533316288010696

In [709]:
y_pred=ada_final3.predict(X_test)
y_pred

array([15, 15,  4, ...,  1, 15,  6], dtype=int64)

In [711]:
ada_final3.score(X_test,y_test)

0.53360768175583

In [712]:
print("ACCURACY OF THE MODEL: ", accuracy_score(y_test, y_pred))
#print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test,y_pred, labels=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])) 

ACCURACY OF THE MODEL:  0.53360768175583
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.81      1.00      0.89       112
           2       0.00      0.00      0.00        26
           3       0.00      0.00      0.00        45
           4       0.69      0.98      0.81       187
           5       0.20      0.02      0.04        52
           6       0.38      1.00      0.56       293
           7       0.00      0.00      0.00       366
           8       0.00      0.00      0.00        20
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         2
          11       0.00      0.00      0.00        23
          12       0.00      0.00      0.00         7
          13       0.00      0.00      0.00        54
          14       0.00      0.00      0.00        56
          15       0.66      1.00      0.79       189
          16       0.00      0.00      0

## Best Model

From above we found that Gradient Boost algorithm with oversampling and One-Hot encoding gives the best accuracy.

### Model Persistence

In [626]:
class PredictionModel():
    
    def __init__(self, model, features, acc):
        self.model = model
        self.features = features
        self.acc = acc

In [627]:
crime_model = PredictionModel(rf_final3, list(X_train.columns), final_score)

In [628]:
from joblib import dump

In [637]:
dump(crime_model, './suraksha_app_model_crime_pred_area_cat.pkl')

['./suraksha_app_model_crime_pred_area_cat.pkl']

Testing

In [638]:
from joblib import load

In [639]:
model_v1 = load("suraksha_app_model_crime_pred_area_cat.pkl")

In [640]:
type(model_v1)

__main__.PredictionModel

In [641]:
model_v1.model

In [642]:
model_v1.acc

0.6406035665294925

In [643]:
model_v1.features

['Area',
 'year',
 'month',
 'day',
 'dayofweek',
 'time_hour',
 'dayofyear',
 'Category']

#### Note:
We have kept the standard scalar in our code just in case in future, safety_index or some other numeric feature is used for modelling, no amendment would be required as part of the pipeline

# ****