# Agenda:

- Basic data cleaning - One Hot Encoding, Feature Scaling
- Train - test split
- Distance-based Models (kNN)
- Accuracy Metrics
- Validation Strategies (Validation Set)
- Cross - Validation
- Linear Models (Linear Regression/**Logistic Regression**)
- Model improvement stratgies: Impact of Feature Engineering/Feature Selection, missing value imputation, cleaning/capping outliers, balancing target categories (oversampling/SMOTE) 
- Tree-based models (if time permits)

## Problem statement:

ABC Bank has provided us with a dataset that contains customer details for their customers in `BankAttrition - Details.csv` file. The transactions related information and what kind of credit card the customer holds is provided to us in another file `Transaction and Card Details.csv`. The bank is currently facing problems of customer attrition. They have consulted us to understand how can they understand the patterns of customer attrition and if they can get early signals so to stop losing customers.

Till now: Merged data, performed exploratory data analysis

In [49]:
import pandas as pd
import numpy as np

# read input files
details = pd.read_csv("Datasets/BankAttrition - Details.csv")
transaction = pd.read_csv("Datasets/Transaction and Card Details.csv")

details.shape, transaction.shape

((10127, 8), (10127, 14))

In [50]:
# merge to create ADS

ads = pd.merge(details, transaction, how = 'outer', on = ['CLIENTNUM'])

In [51]:
# check for missing values
ads.info()

## consider Unknown as a separate category
# typecasting variables
ads['Gender'] = ads['Gender'].astype('category')
ads['Education_Level'] = ads['Education_Level'].astype('category')
ads['Marital_Status'] = ads['Marital_Status'].astype('category')
ads['Income_Category'] = ads['Income_Category'].astype('category')
ads['Card_Category'] = ads['Card_Category'].astype('category')


<class 'pandas.core.frame.DataFrame'>
Int64Index: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           10127 non-null  object 
 6   Marital_Status            10127 non-null  object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_

In [52]:
# drop ClientNum as it is just the identifier
ads.drop(["CLIENTNUM"], axis = 1, inplace = True)

In [53]:
ads['Education_Level'].unique()

['High School', 'Graduate', 'Uneducated', 'Unknown', 'College', 'Post-Graduate', 'Doctorate']
Categories (7, object): ['High School', 'Graduate', 'Uneducated', 'Unknown', 'College', 'Post-Graduate', 'Doctorate']

In [54]:
labels_ed = {
    "Unknown": -1,
    "Uneducated": 0,
    "High School": 1,
    "College": 2,
    "Graduate": 2,
    "Post-Graduate": 3,
    "Doctorate": 4
}

ads["Education_Level"] = ads["Education_Level"].map(labels_ed)

In [55]:
ads['Income_Category'].value_counts(normalize = True)

Less than $40K    0.351634
$40K - $60K       0.176755
$80K - $120K      0.151575
$60K - $80K       0.138442
Unknown           0.109805
$120K +           0.071788
Name: Income_Category, dtype: float64

In [56]:
labels_income = {
    "Unknown": 0.11,
    'Less than $40K': 0.35,
    '$40K - $60K': 0.17,
    '$60K - $80K': 0.14,
    '$80K - $120K': 0.15,
    '$120K +': 0.07
}

ads["Income_Category"] = ads["Income_Category"].map(labels_income)

In [57]:
ads

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,Existing Customer,45,M,3,1,Married,0.14,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,2,Single,0.35,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,Existing Customer,51,M,3,2,Married,0.15,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.000
3,Existing Customer,40,F,4,1,Unknown,0.35,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.760
4,Existing Customer,40,M,3,0,Married,0.14,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.500,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,Existing Customer,50,M,2,2,Single,0.17,Blue,40,3,2,3,4003.0,1851,2152.0,0.703,15476,117,0.857,0.462
10123,Attrited Customer,41,M,2,-1,Divorced,0.17,Blue,25,4,2,3,4277.0,2186,2091.0,0.804,8764,69,0.683,0.511
10124,Attrited Customer,44,F,1,1,Married,0.35,Blue,36,5,3,4,5409.0,0,5409.0,0.819,10291,60,0.818,0.000
10125,Attrited Customer,30,M,2,2,Unknown,0.17,Blue,36,4,3,3,5281.0,0,5281.0,0.535,8395,62,0.722,0.000


In [58]:
# One hot encoding the categories
categorical_vars = ads.select_dtypes(include = ['category']).columns

ads = pd.get_dummies(ads, columns = categorical_vars)

In [59]:
ads

Unnamed: 0,Attrition_Flag,Customer_Age,Dependent_count,Education_Level,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,...,Income_Category_0.07,Income_Category_0.17,Income_Category_0.14,Income_Category_0.15,Income_Category_0.35,Income_Category_0.11,Card_Category_Blue,Card_Category_Gold,Card_Category_Platinum,Card_Category_Silver
0,Existing Customer,45,3,1,39,5,1,3,12691.0,777,...,0,0,1,0,0,0,1,0,0,0
1,Existing Customer,49,5,2,44,6,1,2,8256.0,864,...,0,0,0,0,1,0,1,0,0,0
2,Existing Customer,51,3,2,36,4,1,0,3418.0,0,...,0,0,0,1,0,0,1,0,0,0
3,Existing Customer,40,4,1,34,3,4,1,3313.0,2517,...,0,0,0,0,1,0,1,0,0,0
4,Existing Customer,40,3,0,21,5,1,0,4716.0,0,...,0,0,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,Existing Customer,50,2,2,40,3,2,3,4003.0,1851,...,0,1,0,0,0,0,1,0,0,0
10123,Attrited Customer,41,2,-1,25,4,2,3,4277.0,2186,...,0,1,0,0,0,0,1,0,0,0
10124,Attrited Customer,44,1,1,36,5,3,4,5409.0,0,...,0,0,0,0,1,0,1,0,0,0
10125,Attrited Customer,30,2,2,36,4,3,3,5281.0,0,...,0,1,0,0,0,0,1,0,0,0


In [60]:
ads.columns

Index(['Attrition_Flag', 'Customer_Age', 'Dependent_count', 'Education_Level',
       'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Gender_F', 'Gender_M', 'Marital_Status_Divorced',
       'Marital_Status_Married', 'Marital_Status_Single',
       'Marital_Status_Unknown', 'Income_Category_0.07',
       'Income_Category_0.17', 'Income_Category_0.14', 'Income_Category_0.15',
       'Income_Category_0.35', 'Income_Category_0.11', 'Card_Category_Blue',
       'Card_Category_Gold', 'Card_Category_Platinum', 'Card_Category_Silver'],
      dtype='object')

In [61]:

# encoding target to - 0, 1
ads['Attrition_Flag'][ads['Attrition_Flag'] == 'Existing Customer'] = 0
ads['Attrition_Flag'][ads['Attrition_Flag'] == 'Attrited Customer'] = 1
ads['Attrition_Flag'] = ads['Attrition_Flag'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ads['Attrition_Flag'][ads['Attrition_Flag'] == 'Existing Customer'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ads['Attrition_Flag'][ads['Attrition_Flag'] == 'Attrited Customer'] = 1


In [62]:
## Feature engineer
ads['Credit_Limit'] = np.log(ads['Credit_Limit'])
ads['Total_Revolving_Bal'] = np.log(ads['Total_Revolving_Bal'] + 0.01)
ads['Total_Trans_Amt'] = np.log(ads['Total_Trans_Amt'] + 0.01)

In [63]:
bins = [0, 18, 30, 50, 70, 200]
ads['binned_age'] = pd.cut(ads['Customer_Age'], bins)


ads['book_cat'] = None
ads['book_cat'][(ads['Months_on_book']<= 20)]='Low'
ads['book_cat'][(ads['Months_on_book'] > 20)] = 'Medium'
ads['book_cat'][(ads['Months_on_book']> 50)] = 'High'
ads['book_cat'] = ads['book_cat'].astype('category')


ads['open_buy_cat'] = None
ads['open_buy_cat'][(ads['Avg_Utilization_Ratio']<= 0)]='Low'
ads['open_buy_cat'][(ads['Avg_Utilization_Ratio'] > 0) & (ads['Avg_Utilization_Ratio']<= 0.7)] = 'Medium'
ads['open_buy_cat'][ads['Avg_Utilization_Ratio']> 0.7] = 'High'
ads['open_buy_cat'] = ads['open_buy_cat'].astype('category')

ads = pd.get_dummies(ads, columns = ['binned_age', 'book_cat', 'open_buy_cat'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ads['book_cat'][(ads['Months_on_book']<= 20)]='Low'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ads['book_cat'][(ads['Months_on_book'] > 20)] = 'Medium'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ads['book_cat'][(ads['Months_on_book']> 50)] = 'High'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ads['

In [64]:
#seperating independent and dependent variables
x = ads.drop(['Attrition_Flag'], axis=1)
y = ads['Attrition_Flag']
x.shape, y.shape

((10127, 42), (10127,))

In [65]:
y.value_counts(normalize = True)

0    0.83934
1    0.16066
Name: Attrition_Flag, dtype: float64

In [66]:
# importing the train test split function
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(x, y, random_state =123, stratify = y, test_size = 0.25)

from imblearn.over_sampling import SMOTE

smote = SMOTE(k_neighbors = 4)

x_smote, y_smote = smote.fit_resample(train_x, train_y)

y_smote.value_counts(), test_y.value_counts()

(1    6375
 0    6375
 Name: Attrition_Flag, dtype: int64,
 0    2125
 1     407
 Name: Attrition_Flag, dtype: int64)

In [67]:
train_y.value_counts(normalize = True), test_y.value_counts(normalize = True)

(0    0.839368
 1    0.160632
 Name: Attrition_Flag, dtype: float64,
 0    0.839258
 1    0.160742
 Name: Attrition_Flag, dtype: float64)

In [68]:
# import scalers

from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()

train_x = scaler.fit_transform(train_x)
test_x = scaler.transform(test_x)

In [None]:
train_x

In [None]:
# quick knn model

from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import accuracy_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


clf = KNN(n_neighbors = 7)
clf.fit(train_x, train_y)

# make training prediction
train_yhat = clf.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = clf.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

In [None]:
def elbow_curve(k):
    train_score = []
    test_score = []
    
    for i in k:
        clf = KNN(n_neighbors = i, metric = "manhattan")
        clf.fit(x_smote, y_smote)
        
        train_yhat = clf.predict(train_x)
        train_score_tmp = f1_score(train_y, train_yhat)
        train_score.append(train_score_tmp)
        
        test_yhat = clf.predict(test_x)
        test_score_tmp = f1_score(test_y, test_yhat)
        test_score.append(test_score_tmp)
        
    return train_score, test_score

In [None]:
k = range(1, 30, 2)

train_f1, test_f1 = elbow_curve(k)

import matplotlib.pyplot as plt
plt.plot(k, train_f1, label = "train_f1", color = "red")
plt.plot(k, test_f1, label = "test_f1", color = "black")
plt.legend()

In [None]:
from sklearn.linear_model import LogisticRegression as LR

lr = LR(max_iter = 10000, penalty = "l1", solver = 'saga')
lr.fit(train_x, train_y)


# make training prediction
train_yhat = lr.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = lr.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

sfs1 = sfs(lr,
          k_features = 20,
          forward = True,
          scoring = 'f1',
          cv = 3)

sfs1 = sfs1.fit(train_x, train_y)

In [None]:
features_selected = list(sfs1.k_feature_idx_)

In [None]:
features_selected

In [None]:
lr = LR(max_iter = 10000)
lr.fit(train_x[:, features_selected], train_y)


# make training prediction
train_yhat = lr.predict(train_x[:, features_selected])
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = lr.predict(test_x[:, features_selected])
test_score = f1_score(test_y, test_yhat)

train_score, test_score

## Tree based models

In [21]:
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.metrics import accuracy_score, recall_score, f1_score


dt_model = DT(random_state = 555, max_depth = 6)
dt_model.fit(train_x, train_y)


# make training prediction
train_yhat = dt_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = dt_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.863013698630137, 0.8312655086848636)

In [None]:
## Tuning a DT on depth

train_score = []
test_score = []

for depth in range(1, 30):
    dt_model = DT(random_state = 555, max_depth = depth)
    dt_model.fit(train_x, train_y)
    train_yhat = dt_model.predict(train_x)
    train_f1 = f1_score(train_y, train_yhat)
    train_score.append(train_f1)
    test_yhat = dt_model.predict(test_x)
    test_f1 = f1_score(test_y, test_yhat)
    test_score.append(test_f1)

In [None]:
df = pd.DataFrame({'max_depth': range(1, 30), 'train_score': train_score, 'test_score': test_score})
df

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.plot(df['max_depth'], df['train_score'], marker = 'o')
plt.plot(df['max_depth'], df['test_score'], marker = 'o')

In [29]:
## Grid Search CV - Model Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier

params = {'max_depth': [6, 10, 15], 'min_samples_split': list(range(2, 50, 5)), 'criterion': ['gini', 'entropy'],
         'max_leaf_nodes': list(range(2, 50, 5)), 'ccp_alpha': [0, 0.01, 0.1, 1]}

grid_search = RandomizedSearchCV(DecisionTreeClassifier(random_state = 555), param_distributions = params, scoring = 'f1', cv = 3, 
                           verbose = 1, n_iter = 300)
grid_search.fit(train_x, train_y)

Fitting 3 folds for each of 300 candidates, totalling 900 fits


RandomizedSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=555),
                   n_iter=300,
                   param_distributions={'ccp_alpha': [0, 0.01, 0.1, 1],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': [6, 10, 15],
                                        'max_leaf_nodes': [2, 7, 12, 17, 22, 27,
                                                           32, 37, 42, 47],
                                        'min_samples_split': [2, 7, 12, 17, 22,
                                                              27, 32, 37, 42,
                                                              47]},
                   scoring='f1', verbose=1)

In [30]:
grid_search.best_estimator_

DecisionTreeClassifier(ccp_alpha=0, max_depth=15, max_leaf_nodes=42,
                       min_samples_split=7, random_state=555)

In [34]:
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.metrics import accuracy_score, recall_score, f1_score


dt_model = DT(ccp_alpha=0, max_depth=6, max_leaf_nodes=42,
                       min_samples_split=7, random_state=555)

dt_model.fit(train_x, train_y)


# make training prediction
train_yhat = dt_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = dt_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.8548179871520343, 0.8300492610837438)

## Random Forest

In [36]:
from sklearn.ensemble import RandomForestClassifier as RF

rf_model = RF(random_state = 555)
rf_model.fit(train_x, train_y)


# make training prediction
train_yhat = rf_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = rf_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(1.0, 0.8423772609819121)

In [38]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

params = {'max_depth': [5, 10, 15], 'min_samples_split': list(range(2, 50, 5)), 'criterion': ['gini', 'entropy'],
         'max_leaf_nodes': list(range(2, 50, 5)), 'ccp_alpha': [0, 0.01, 0.1, 1], 'n_estimators': [100, 200, 300, 400, 500]}

grid_search = RandomizedSearchCV(RF(random_state = 555), param_distributions = params, scoring = 'f1', cv = 3, 
                           verbose = 1, n_iter = 200)
grid_search.fit(train_x, train_y)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(random_state=555),
                   n_iter=200,
                   param_distributions={'ccp_alpha': [0, 0.01, 0.1, 1],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': [5, 10, 15],
                                        'max_leaf_nodes': [2, 7, 12, 17, 22, 27,
                                                           32, 37, 42, 47],
                                        'min_samples_split': [2, 7, 12, 17, 22,
                                                              27, 32, 37, 42,
                                                              47],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500]},
                   scoring='f1', verbose=1)

In [39]:
grid_search.best_estimator_

RandomForestClassifier(ccp_alpha=0, criterion='entropy', max_depth=15,
                       max_leaf_nodes=47, min_samples_split=22,
                       n_estimators=300, random_state=555)

In [40]:
from sklearn.ensemble import RandomForestClassifier as RF

rf_model = RF(ccp_alpha=0, criterion='entropy', max_depth=15,
                       max_leaf_nodes=47, min_samples_split=22,
                       n_estimators=300, random_state=555)
rf_model.fit(train_x, train_y)


# make training prediction
train_yhat = rf_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = rf_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.7731660231660231, 0.7402031930333817)

In [41]:
from sklearn.ensemble import GradientBoostingClassifier as GBM

gbm_model = GBM(random_state = 555)
gbm_model.fit(train_x, train_y)


# make training prediction
train_yhat = gbm_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = gbm_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.9224211423699915, 0.8854568854568855)

In [45]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

params = {'max_depth': [2, 3, 5, 7], 
          'learning_rate': [0.05, 0.1, 0.2, 0.3, 0.5],
          'ccp_alpha': [0, 0.01, 0.1, 1], 'n_estimators': [100, 200, 300, 400, 500]}

grid_search = RandomizedSearchCV(GBM(random_state = 555), param_distributions = params, scoring = 'f1', cv = 3, 
                           verbose = 3, n_iter = 10)
grid_search.fit(train_x, train_y)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END ccp_alpha=1, learning_rate=0.3, max_depth=3, n_estimators=500;, score=0.000 total time=  12.5s
[CV 2/3] END ccp_alpha=1, learning_rate=0.3, max_depth=3, n_estimators=500;, score=0.000 total time=  13.9s
[CV 3/3] END ccp_alpha=1, learning_rate=0.3, max_depth=3, n_estimators=500;, score=0.000 total time=  12.3s
[CV 1/3] END ccp_alpha=0, learning_rate=0.1, max_depth=7, n_estimators=200;, score=0.901 total time=   9.6s
[CV 2/3] END ccp_alpha=0, learning_rate=0.1, max_depth=7, n_estimators=200;, score=0.899 total time=  10.3s
[CV 3/3] END ccp_alpha=0, learning_rate=0.1, max_depth=7, n_estimators=200;, score=0.884 total time=  11.3s
[CV 1/3] END ccp_alpha=0.01, learning_rate=0.3, max_depth=7, n_estimators=100;, score=0.512 total time=   4.4s
[CV 2/3] END ccp_alpha=0.01, learning_rate=0.3, max_depth=7, n_estimators=100;, score=0.383 total time=   4.5s
[CV 3/3] END ccp_alpha=0.01, learning_rate=0.3, max_depth=7, n_estima

RandomizedSearchCV(cv=3, estimator=GradientBoostingClassifier(random_state=555),
                   param_distributions={'ccp_alpha': [0, 0.01, 0.1, 1],
                                        'learning_rate': [0.05, 0.1, 0.2, 0.3,
                                                          0.5],
                                        'max_depth': [2, 3, 5, 7],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500]},
                   scoring='f1', verbose=3)

In [46]:
grid_search.best_estimator_

GradientBoostingClassifier(ccp_alpha=0, max_depth=7, n_estimators=200,
                           random_state=555)

In [69]:
from sklearn.ensemble import GradientBoostingClassifier as GBM

gbm_model = GBM(ccp_alpha=0, max_depth=5, n_estimators=200,
                           random_state=555)
gbm_model.fit(train_x, train_y)


# make training prediction
train_yhat = gbm_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = gbm_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.9987689782519491, 0.9016602809706258)

In [71]:
import xgboost as xgb

model = xgb.XGBClassifier()

model.fit(train_x, train_y)


# make training prediction
train_yhat = model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score





(1.0, 0.910691823899371)

In [72]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

params = {'max_depth': [2, 3, 5, 7, 10], 
          'learning_rate': [0.05, 0.1, 0.2, 0.3, 0.5],
          'n_estimators': [100, 200, 300, 400, 500]
         }

grid_search = RandomizedSearchCV(model, param_distributions = params, scoring = 'f1', cv = 3, 
                           verbose = 3, n_iter = 20)
grid_search.fit(train_x, train_y)
grid_search.best_estimator_

Fitting 3 folds for each of 20 candidates, totalling 60 fits




[CV 1/3] END learning_rate=0.3, max_depth=10, n_estimators=100;, score=0.883 total time=   1.4s




[CV 2/3] END learning_rate=0.3, max_depth=10, n_estimators=100;, score=0.902 total time=   2.1s




[CV 3/3] END learning_rate=0.3, max_depth=10, n_estimators=100;, score=0.904 total time=   2.9s




[CV 1/3] END learning_rate=0.5, max_depth=10, n_estimators=400;, score=0.880 total time=   7.7s




[CV 2/3] END learning_rate=0.5, max_depth=10, n_estimators=400;, score=0.904 total time=   6.5s




[CV 3/3] END learning_rate=0.5, max_depth=10, n_estimators=400;, score=0.908 total time=   3.8s




[CV 1/3] END learning_rate=0.3, max_depth=5, n_estimators=100;, score=0.892 total time=   1.8s




[CV 2/3] END learning_rate=0.3, max_depth=5, n_estimators=100;, score=0.917 total time=   1.8s




[CV 3/3] END learning_rate=0.3, max_depth=5, n_estimators=100;, score=0.901 total time=   1.4s




[CV 1/3] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.881 total time=   0.9s




[CV 2/3] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.883 total time=   0.7s




[CV 3/3] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.879 total time=   0.8s




[CV 1/3] END learning_rate=0.05, max_depth=5, n_estimators=100;, score=0.867 total time=   1.1s




[CV 2/3] END learning_rate=0.05, max_depth=5, n_estimators=100;, score=0.875 total time=   1.9s




[CV 3/3] END learning_rate=0.05, max_depth=5, n_estimators=100;, score=0.876 total time=   1.5s




[CV 1/3] END learning_rate=0.2, max_depth=2, n_estimators=400;, score=0.901 total time=   2.4s




[CV 2/3] END learning_rate=0.2, max_depth=2, n_estimators=400;, score=0.904 total time=   1.9s




[CV 3/3] END learning_rate=0.2, max_depth=2, n_estimators=400;, score=0.905 total time=   3.6s




[CV 1/3] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.889 total time=   4.9s




[CV 2/3] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.904 total time=   3.6s




[CV 3/3] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.898 total time=   5.0s




[CV 1/3] END learning_rate=0.05, max_depth=2, n_estimators=300;, score=0.879 total time=   1.9s




[CV 2/3] END learning_rate=0.05, max_depth=2, n_estimators=300;, score=0.855 total time=   2.2s




[CV 3/3] END learning_rate=0.05, max_depth=2, n_estimators=300;, score=0.863 total time=   1.7s




[CV 1/3] END learning_rate=0.2, max_depth=7, n_estimators=400;, score=0.894 total time=   5.1s




[CV 2/3] END learning_rate=0.2, max_depth=7, n_estimators=400;, score=0.905 total time=   4.7s




[CV 3/3] END learning_rate=0.2, max_depth=7, n_estimators=400;, score=0.898 total time=   5.0s




[CV 1/3] END learning_rate=0.1, max_depth=3, n_estimators=300;, score=0.903 total time=   2.5s




[CV 2/3] END learning_rate=0.1, max_depth=3, n_estimators=300;, score=0.913 total time=   2.4s




[CV 3/3] END learning_rate=0.1, max_depth=3, n_estimators=300;, score=0.894 total time=   2.8s




[CV 1/3] END learning_rate=0.3, max_depth=2, n_estimators=400;, score=0.897 total time=   2.2s




[CV 2/3] END learning_rate=0.3, max_depth=2, n_estimators=400;, score=0.898 total time=   2.1s




[CV 3/3] END learning_rate=0.3, max_depth=2, n_estimators=400;, score=0.899 total time=   1.9s




[CV 1/3] END learning_rate=0.3, max_depth=7, n_estimators=400;, score=0.884 total time=   7.3s




[CV 2/3] END learning_rate=0.3, max_depth=7, n_estimators=400;, score=0.897 total time=   5.7s




[CV 3/3] END learning_rate=0.3, max_depth=7, n_estimators=400;, score=0.898 total time=   4.3s




[CV 1/3] END learning_rate=0.1, max_depth=7, n_estimators=400;, score=0.894 total time=   5.4s




[CV 2/3] END learning_rate=0.1, max_depth=7, n_estimators=400;, score=0.904 total time=   6.6s




[CV 3/3] END learning_rate=0.1, max_depth=7, n_estimators=400;, score=0.898 total time=   8.2s




[CV 1/3] END learning_rate=0.2, max_depth=2, n_estimators=100;, score=0.901 total time=   0.6s




[CV 2/3] END learning_rate=0.2, max_depth=2, n_estimators=100;, score=0.880 total time=   0.6s




[CV 3/3] END learning_rate=0.2, max_depth=2, n_estimators=100;, score=0.877 total time=   0.6s




[CV 1/3] END learning_rate=0.1, max_depth=5, n_estimators=400;, score=0.890 total time=   4.5s




[CV 2/3] END learning_rate=0.1, max_depth=5, n_estimators=400;, score=0.908 total time=   3.7s




[CV 3/3] END learning_rate=0.1, max_depth=5, n_estimators=400;, score=0.905 total time=   4.0s




[CV 1/3] END learning_rate=0.05, max_depth=7, n_estimators=500;, score=0.897 total time=   7.2s




[CV 2/3] END learning_rate=0.05, max_depth=7, n_estimators=500;, score=0.908 total time=   7.2s




[CV 3/3] END learning_rate=0.05, max_depth=7, n_estimators=500;, score=0.900 total time=   7.5s




[CV 1/3] END learning_rate=0.3, max_depth=7, n_estimators=100;, score=0.887 total time=   1.1s




[CV 2/3] END learning_rate=0.3, max_depth=7, n_estimators=100;, score=0.896 total time=   1.5s




[CV 3/3] END learning_rate=0.3, max_depth=7, n_estimators=100;, score=0.892 total time=   1.5s




[CV 1/3] END learning_rate=0.3, max_depth=3, n_estimators=500;, score=0.896 total time=   3.2s




[CV 2/3] END learning_rate=0.3, max_depth=3, n_estimators=500;, score=0.900 total time=   3.1s




[CV 3/3] END learning_rate=0.3, max_depth=3, n_estimators=500;, score=0.901 total time=   3.4s




[CV 1/3] END learning_rate=0.3, max_depth=10, n_estimators=500;, score=0.885 total time=   5.0s




[CV 2/3] END learning_rate=0.3, max_depth=10, n_estimators=500;, score=0.898 total time=   5.0s




[CV 3/3] END learning_rate=0.3, max_depth=10, n_estimators=500;, score=0.904 total time=   5.5s




[CV 1/3] END learning_rate=0.05, max_depth=5, n_estimators=500;, score=0.891 total time=   5.7s




[CV 2/3] END learning_rate=0.05, max_depth=5, n_estimators=500;, score=0.912 total time=   5.7s




[CV 3/3] END learning_rate=0.05, max_depth=5, n_estimators=500;, score=0.909 total time=   6.0s




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=500, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [74]:
import xgboost as xgb

model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=5,
              min_child_weight=1, monotone_constraints='()',
              n_estimators=500, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

model.fit(train_x, train_y)


# make training prediction
train_yhat = model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score



(0.9959049959049959, 0.9147869674185464)

## Logitboost

In [1]:
pip install logitboost

Collecting logitboost
  Downloading logitboost-0.7-py3-none-any.whl (9.1 kB)
Installing collected packages: logitboost
Successfully installed logitboost-0.7
Note: you may need to restart the kernel to use updated packages.


In [76]:
from logitboost import LogitBoost

lboost = LogitBoost(n_estimators = 200, random_state = 555)
lboost.fit(train_x, train_y)


# make training prediction
train_yhat = lboost.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = lboost.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.9247491638795986, 0.8849557522123894)

In [78]:
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier

estimators = [('rf', RandomForestClassifier(n_estimators = 500, random_state = 555)),
             ('lb', LogitBoost(n_estimators = 200, random_state = 555))
             ]

clf = StackingClassifier(estimators)

clf.fit(train_x, train_y)

# make training prediction
train_yhat = clf.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = clf.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.9835661462612982, 0.8936708860759494)

In [80]:
from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(train_x, train_y)

# make training prediction
train_yhat = svc_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = svc_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.7735849056603773, 0.7091172214182345)

In [None]:
### Tuning SVM takes a lot of time

param_dist = {"C":[0.01, 0.1, 1, 10, 100, 1000, 10000],
             "kernel": ['linear', 'poly', 'rbf'],
             "degree": [2, 3, 4],
             "gamma": ['auto', 'scale']}

grid_search = RandomizedSearchCV(svc_model, param_dist, cv = 3, n_iter = 30,  
                                   verbose=10, n_jobs=-1, scoring = "f1")

grid_search.fit(train_x, train_y)

grid_search.best_estimator_