# Model Fitting

Here we will try to fit the best model based on the the preprocessed data.

# What's in this notebook?

First we try the following ML algorithms on the one-hot encoded data.

0. Over Sampling - SMOTE
1. Logistic Regression
2. Random Forest Classifier
3. XGboost Classifier
4. SVM Classifier
5. KNN Classifier

Next we try the following ML algorithms on the non-encoded data.

0. Over Sampling - SMOTE
1. CatBoost Classifier


#### NOTE:
Here we will use Recall instead of Accuracy because we're more interested in correctly classifying the records where target variable is 1.

Recall: Proportion of actual Positives that are is correctly classified.


In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# loading the one-hot encoded preprocessed data
df = pd.read_csv('https://raw.githubusercontent.com/Suvam-Bit/Datasets/main/Tourism%20Package%20Purchase%20Prediction/preprocessed.csv')
df.head()

Unnamed: 0,ProdTaken,Age,CityTier,DurationOfPitch,NumberOfPersonVisited,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,MonthlyIncome,Self Enquiry,Large Business,Salaried,Small Business,Male,Deluxe,King,Standard,Super Deluxe,Married,Single,Executive,Manager,Senior Manager,VP
0,1,41.0,3,6.0,3,3.0,3.0,1.0,1,2,1,0.0,20993.0,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0
1,0,49.0,1,14.0,3,4.0,4.0,2.0,0,3,1,2.0,20130.0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0
2,1,37.0,1,8.0,3,4.0,3.0,7.0,1,3,0,0.0,17090.0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0
3,0,33.0,1,9.0,2,3.0,3.0,2.0,1,5,1,1.0,17909.0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
4,0,33.054181,1,8.0,2,3.0,4.0,1.0,0,5,1,0.0,18468.0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0


In [None]:
# dimension of the data
df.shape

(4888, 28)

In [None]:
# splitting into feature data and target data
df_X = df.drop('ProdTaken', axis = 1)
df_y = df['ProdTaken']

Now let's see how many records are there for each category in the target variable.

In [None]:
print(df_y[df_y==1].shape, df_y[df_y==0].shape)

(920,) (3968,)


In [None]:
print(df_X[df_y==1].shape, df_X[df_y==0].shape)

(920, 27) (3968, 27)


Number of records where target variable is 1 = 920

Number of records where target variable is 0 = 3968


So this is a IMBALANCED DATA.

In [None]:
df_X.head()

Unnamed: 0,Age,CityTier,DurationOfPitch,NumberOfPersonVisited,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,MonthlyIncome,Self Enquiry,Large Business,Salaried,Small Business,Male,Deluxe,King,Standard,Super Deluxe,Married,Single,Executive,Manager,Senior Manager,VP
0,41.0,3,6.0,3,3.0,3.0,1.0,1,2,1,0.0,20993.0,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0
1,49.0,1,14.0,3,4.0,4.0,2.0,0,3,1,2.0,20130.0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0
2,37.0,1,8.0,3,4.0,3.0,7.0,1,3,0,0.0,17090.0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0
3,33.0,1,9.0,2,3.0,3.0,2.0,1,5,1,1.0,17909.0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
4,33.054181,1,8.0,2,3.0,4.0,1.0,0,5,1,0.0,18468.0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0


# Over Sampling - SMOTE

To make the data balanced, we'll do over-sampling by SMOTENC from the records where target variable is 1, such that both the categories will be present in same frquency.

Read more more about SMOTE:

https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

In [None]:
num_features = ['Age', 'DurationOfPitch', 'MonthlyIncome']
cat_features = [list(df_X.columns).index(i) for i in df_X.columns if i not in num_features]

In [None]:
from imblearn.over_sampling import SMOTENC
os = SMOTENC(categorical_features = cat_features, random_state = 42)

X, y = os.fit_sample(df_X, df_y)

X = pd.DataFrame(data = X, columns = df_X.columns)
y = pd.Series(y, name = 'ProdTaken')

Now let's check for the frequencies of two categories in target variable.

In [None]:
print(y[y==1].shape, y[y==0].shape)

(3968,) (3968,)


In [None]:
print(X[y==1].shape, X[y==0].shape)

(3968, 27) (3968, 27)


See! After doing over-sampling, our data is BALANCED now.

In [None]:
X.head()

Unnamed: 0,Age,CityTier,DurationOfPitch,NumberOfPersonVisited,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,MonthlyIncome,Self Enquiry,Large Business,Salaried,Small Business,Male,Deluxe,King,Standard,Super Deluxe,Married,Single,Executive,Manager,Senior Manager,VP
0,41.0,3.0,6.0,3.0,3.0,3.0,1.0,1.0,2.0,1.0,0.0,20993.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,49.0,1.0,14.0,3.0,4.0,4.0,2.0,0.0,3.0,1.0,2.0,20130.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,37.0,1.0,8.0,3.0,4.0,3.0,7.0,1.0,3.0,0.0,0.0,17090.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,33.0,1.0,9.0,2.0,3.0,3.0,2.0,1.0,5.0,1.0,1.0,17909.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,33.054181,1.0,8.0,2.0,3.0,4.0,1.0,0.0,5.0,1.0,0.0,18468.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [None]:
# transforming the numeric variables so that they become Gaussian distributed. (See the Preprocessing Notebook for reference)
X['Age'] = X['Age']**(1/2)
X['DurationOfPitch'] = X['DurationOfPitch']**(1/5)
X['MonthlyIncome'] = np.log(X['MonthlyIncome'])

In [None]:
# saving the over-sampled data as .csv file
df_sampled = pd.concat([X, y], axis = 1)
df_sampled.to_csv('over_sampled.csv', index = False)

# Cross Validation Technique

Here we will use stratified k-fold cross validation so that the training and test data will contain same proportion for two categories in target variable.

In [None]:
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits = 5)

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

logistic_model = LogisticRegression()

scores_lg = cross_val_score(logistic_model, X, y, scoring='recall', cv = skfold, n_jobs = -1)
print("Recall: ",scores_lg.mean())

Recall:  0.7459851788794267


By applying Logistic Regression we got  74.60% Recall.

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

skfold = StratifiedKFold(n_splits = 5)

params = {'n_estimators': [100,150,200,250,300],
          'max_depth': [None,5,7,10,13,15,17,20],
          'min_samples_split': [2,3,4,5,6,7,8,9,10],
          'min_samples_leaf': [1,2,3,4,5,6,7,8,9,10],
          'max_features': ['auto','sqrt']}

rf_cl = RandomizedSearchCV(RandomForestClassifier(), param_distributions = params, scoring = 'recall', n_iter = 100, cv = skfold, n_jobs = -1, verbose = 2)

rf_cl.fit(X, y)

print("Recall: ",rf_cl.best_score_)
print("Best parameters: ",rf_cl.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   34.5s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  7.8min finished


Recall:  0.8954586892233998
Best parameters:  {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 15}


By applying Random Forest Classifier we got 89.55% Recall.

# XGBoost Classifier

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

params = {'n_estimators': [50,75,100,125,150,175,200],
              'booster': ['gbtree'],
              'max_depth': [None,3,4,5,7,10,13,15],
              'learning_rate': [0.05,0.10,0.15,0.20,0.25,0.30],
              'min_child_weight': [1,3,5,7],
              'gamma': [ 0.0,0.1,0.2,0.3,0.4],
              'colsample_bytree' : [0.3,0.4,0.5,0.7]}

xgb_cl = RandomizedSearchCV(XGBClassifier(), param_distributions = params, cv = 5, scoring = 'recall', n_iter=100, n_jobs = -1, verbose = 2)

xgb_cl.fit(X, y)

print("Recall: ",xgb_cl.best_score_)
print("Best parameters: ",xgb_cl.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   20.8s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  6.1min finished


Recall:  0.9052868137767176
Best parameters:  {'n_estimators': 125, 'min_child_weight': 1, 'max_depth': 13, 'learning_rate': 0.1, 'gamma': 0.2, 'colsample_bytree': 0.5, 'booster': 'gbtree'}


By applying XGBoost Classifier we got 90.53% Recall.

# SVM

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV

'''
params = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
                    {'kernel': ['sigmoid'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
                    {'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]}
                   ]
'''

params = {'kernel':['linear', 'poly', 'rbf'],
          'C': [0.001, 0.10, 0.1, 10, 25, 50, 100],
          'gamma': ['scale', 'auto', 1e-1, 1e-2, 1e-3],
          'degree':[2,3,4,5]}

svc_cl = RandomizedSearchCV(SVC(), param_distributions = params, n_iter = 10, scoring = 'recall', cv = 5, n_jobs = -1, verbose = 2)

svc_cl.fit(X,y)

print("Recall: ",svc_cl.best_score_)
print("Best parameters: ",svc_cl.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  4.8min finished


Recall:  0.9070478144723509
Best parameters:  {'kernel': 'rbf', 'gamma': 0.1, 'degree': 5, 'C': 25}


By applying SVM we got 90.70% Recall.

# KNN

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns = X.columns)

In [None]:
X_scaled.head()

Unnamed: 0,Age,CityTier,DurationOfPitch,NumberOfPersonVisited,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,MonthlyIncome,Self Enquiry,Large Business,Salaried,Small Business,Male,Deluxe,King,Standard,Super Deluxe,Married,Single,Executive,Manager,Senior Manager,VP
0,0.562774,1.456137,-1.695472,0.116221,-0.807358,-0.718383,-1.192121,1.232625,-0.908561,0.718297,-1.378916,-0.347968,0.631005,-0.263708,1.101666,-0.843913,-1.301821,1.58086,-0.200942,-0.3849,-0.224876,-0.777723,1.236853,-0.941001,1.58086,-0.3849,-0.200942
1,1.359564,-0.711137,-0.082497,0.116221,0.254139,0.515085,-0.636167,-0.811277,-0.145966,0.718297,1.055584,-0.560741,-1.584773,-0.263708,1.101666,-0.843913,0.768155,1.58086,-0.200942,-0.3849,-0.224876,-0.777723,-0.808503,-0.941001,1.58086,-0.3849,-0.200942
2,0.135112,-0.711137,-1.178174,0.116221,0.254139,-0.718383,2.143603,1.232625,-0.145966,-1.392182,-1.378916,-1.390576,0.631005,-0.263708,-0.907716,-0.843913,0.768155,-0.632567,-0.200942,-0.3849,-0.224876,-0.777723,1.236853,1.062698,-0.632567,-0.3849,-0.200942
3,-0.316363,-0.711137,-0.957643,-1.300566,-0.807358,-0.718383,-0.636167,1.232625,1.379225,0.718297,-0.161666,-1.153311,-1.584773,-0.263708,1.101666,-0.843913,-1.301821,-0.632567,-0.200942,-0.3849,-0.224876,-0.777723,-0.808503,1.062698,-0.632567,-0.3849,-0.200942
4,-0.31007,-0.711137,-1.178174,-1.300566,-0.807358,0.515085,-1.192121,-0.811277,1.379225,0.718297,-1.378916,-0.997519,0.631005,-0.263708,-0.907716,1.184956,0.768155,-0.632567,-0.200942,-0.3849,-0.224876,-0.777723,-0.808503,1.062698,-0.632567,-0.3849,-0.200942


In [None]:
y.head()

0    1
1    0
2    1
3    0
4    0
Name: ProdTaken, dtype: int64

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV

params = {'n_neighbors':[3,4,5,6,7,8,9,10,11,13,15],
          'weights':['uniform','distance'],
          'metric':['euclidean','manhattan']}

knn_cl = RandomizedSearchCV(KNeighborsClassifier(), param_distributions = params, cv = 5, scoring = 'recall', n_iter=100, n_jobs = -1, verbose = 2)

knn_cl.fit(X_scaled, y)

Fitting 5 folds for each of 44 candidates, totalling 220 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   17.2s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 220 out of 220 | elapsed:  1.7min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=KNeighborsClassifier(algorithm='auto',
                                                  leaf_size=30,
                                                  metric='minkowski',
                                                  metric_params=None,
                                                  n_jobs=None, n_neighbors=5,
                                                  p=2, weights='uniform'),
                   iid='deprecated', n_iter=100, n_jobs=-1,
                   param_distributions={'metric': ['euclidean', 'manhattan'],
                                        'n_neighbors': [3, 4, 5, 6, 7, 8, 9, 10,
                                                        11, 13, 15],
                                        'weights': ['uniform', 'distance']},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring='recall', verbose=2)

In [None]:
print("Recall: ",knn_cl.best_score_)
print("Best parameters: ",knn_cl.best_params_)

Recall:  0.9365223412669421
Best parameters:  {'weights': 'distance', 'n_neighbors': 5, 'metric': 'manhattan'}


By applying K-nearest neibors Classifier we got 93.65% Recall.

# CatBoost

In [None]:
!pip install catboost

In [None]:
# loading the non-encoded data
df2 = pd.read_csv('https://raw.githubusercontent.com/Suvam-Bit/Datasets/main/Tourism%20Package%20Purchase%20Prediction/preprocessed2.csv')
df2.head()

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,0,33.054181,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [None]:
# splitting in features and target data
df2_X = df2.drop('ProdTaken', axis = 1)
df2_y = df2['ProdTaken']

In [None]:
# label-encoding the categorical features
cat_var = ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']

for i in cat_var:
    df2_X[i] = df2_X[i].astype('category').cat.codes

In [None]:
df2_X.head()

Unnamed: 0,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
0,41.0,1,3,6.0,2,0,3,3.0,1,3.0,2,1.0,1,2,1,0.0,2,20993.0
1,49.0,0,1,14.0,2,1,3,4.0,1,4.0,0,2.0,0,3,1,2.0,2,20130.0
2,37.0,1,1,8.0,0,1,3,4.0,0,3.0,2,7.0,1,3,0,0.0,1,17090.0
3,33.0,0,1,9.0,2,0,2,3.0,0,3.0,0,2.0,1,5,1,1.0,1,17909.0
4,33.054181,1,1,8.0,3,1,2,3.0,0,4.0,0,1.0,0,5,1,0.0,1,18468.0


In [None]:
# category labels of each category in each categorical variable
cat_var = ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']
cat_dict = {}

for i in cat_var:
  d = {}
  for j in range(len(df2)):
    d[df2[i][j]] = df2_X[i][j]
  cat_dict[i] = d

for key,val in cat_dict.items():
  print(key)
  for i,j in val.items():
    print(f'{i} : {j}')
  print('\n')

TypeofContact
Self Enquiry : 1
Company Invited : 0


Occupation
Salaried : 2
Free Lancer : 0
Small Business : 3
Large Business : 1


Gender
Female : 0
Male : 1


ProductPitched
Deluxe : 1
Basic : 0
Standard : 3
Super Deluxe : 4
King : 2


MaritalStatus
Single : 2
Divorced : 0
Married : 1


Designation
Manager : 2
Executive : 1
Senior Manager : 3
AVP : 0
VP : 4




Now we'll apply SMOTE for over-sampling the data like the previous one.

In [None]:
df2_X.shape

(4888, 18)

In [None]:
print(df2_X[df2_y==1].shape, df2_X[df2_y==0].shape)

(920, 18) (3968, 18)


In [None]:
print(df2_y[df2_y==1].shape, df2_y[df2_y==0].shape)

(920,) (3968,)


In [None]:
num_features_2 = ['Age', 'DurationOfPitch', 'MonthlyIncome']
cat_features_2 = [list(df2_X.columns).index(i) for i in df2_X.columns if i not in num_features_2]
cat_features_2

[1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

In [None]:
from imblearn.over_sampling import SMOTENC
os = SMOTENC(categorical_features = cat_features_2, random_state = 42)

X2, y2 = os.fit_sample(df2_X, df2_y)

X2 = pd.DataFrame(data = X2, columns = df2_X.columns)
y2 = pd.Series(y2, name = 'ProdTaken')

In [None]:
print(y2[y2==1].shape, y2[y2==0].shape)

(3968,) (3968,)


In [None]:
print(X2[y2==1].shape, X2[y2==0].shape)

(3968, 18) (3968, 18)


In [None]:
X2.head()

Unnamed: 0,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
0,41.0,1.0,3.0,6.0,2.0,0.0,3.0,3.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,0.0,2.0,20993.0
1,49.0,0.0,1.0,14.0,2.0,1.0,3.0,4.0,1.0,4.0,0.0,2.0,0.0,3.0,1.0,2.0,2.0,20130.0
2,37.0,1.0,1.0,8.0,0.0,1.0,3.0,4.0,0.0,3.0,2.0,7.0,1.0,3.0,0.0,0.0,1.0,17090.0
3,33.0,0.0,1.0,9.0,2.0,0.0,2.0,3.0,0.0,3.0,0.0,2.0,1.0,5.0,1.0,1.0,1.0,17909.0
4,33.054181,1.0,1.0,8.0,3.0,1.0,2.0,3.0,0.0,4.0,0.0,1.0,0.0,5.0,1.0,0.0,1.0,18468.0


In [None]:
# applying CatBoost Classifier
from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

params = {'depth':[5,6,7,8,9,10],
            'iterations':[300,400,500,600,700],
            'learning_rate':[0.01, 0.05, 0.1, 0.2, 0.5]}

cb_cl = RandomizedSearchCV(CatBoostClassifier(), param_distributions= params, scoring = 'recall', n_iter=100, cv = 5, n_jobs = -1, verbose = 2)

cb_cl.fit(X2, y2)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed: 11.5min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed: 26.7min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 35.6min finished


0:	learn: 0.6352768	total: 60.1ms	remaining: 42s
1:	learn: 0.5840919	total: 70.3ms	remaining: 24.5s
2:	learn: 0.5507388	total: 79.6ms	remaining: 18.5s
3:	learn: 0.5184297	total: 89.1ms	remaining: 15.5s
4:	learn: 0.4892403	total: 98.3ms	remaining: 13.7s
5:	learn: 0.4617897	total: 108ms	remaining: 12.5s
6:	learn: 0.4404389	total: 117ms	remaining: 11.6s
7:	learn: 0.4277782	total: 126ms	remaining: 10.9s
8:	learn: 0.4133253	total: 135ms	remaining: 10.4s
9:	learn: 0.4003412	total: 146ms	remaining: 10s
10:	learn: 0.3878669	total: 155ms	remaining: 9.73s
11:	learn: 0.3728922	total: 164ms	remaining: 9.43s
12:	learn: 0.3610323	total: 174ms	remaining: 9.19s
13:	learn: 0.3542866	total: 189ms	remaining: 9.27s
14:	learn: 0.3477298	total: 201ms	remaining: 9.17s
15:	learn: 0.3411382	total: 210ms	remaining: 9s
16:	learn: 0.3349948	total: 220ms	remaining: 8.83s
17:	learn: 0.3302685	total: 229ms	remaining: 8.68s
18:	learn: 0.3250136	total: 239ms	remaining: 8.56s
19:	learn: 0.3190595	total: 248ms	remaining

RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=<catboost.core.CatBoostClassifier object at 0x7fda764a3350>,
                   iid='deprecated', n_iter=100, n_jobs=-1,
                   param_distributions={'depth': [5, 6, 7, 8, 9, 10],
                                        'iterations': [300, 400, 500, 600, 700],
                                        'learning_rate': [0.01, 0.05, 0.1, 0.2,
                                                          0.5]},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring='recall', verbose=2)

In [None]:
print("Recall: ",cb_cl.best_score_)
print("Best parameters: ",cb_cl.best_params_)

Recall:  0.9135972505010784
Best parameters:  {'learning_rate': 0.1, 'iterations': 700, 'depth': 9}


After applying CatBoost Classifier we got 91.36% Recall.

# Best Model

Till now we got highest recall of 93.65% for the K-nearest Neighbors Classifier.

So we choose KNN Classifier to be the Best model for our data.

In [None]:
# dumping the knn model
import pickle

file = open('knn_model.pkl','wb')

pickle.dump(knn_cl, file)