# Modeling - Hyperparameters Optimized 
- After hyperperameter optimization we run models again (10-iterations) with the same features and best parameters we got from gridsearchCV and randomsearchCV. 
- Goal is to improve model performance
- Note: GridsearchCV and RandomsearchCV do not work all the time 

<img align="center" width="600" height="500" src="hypermerameter_optimized_modeling.ipynb">


## Results <a name="t"></a>
1. [XGBoost](#xgb) 
2. [SVC](#svc) 
3. [Logistic Regression](#log) 
4. [KNN](#knn)
5. [Random Forest](#rf) 
6. [MLP](#mlp) 

In [1]:
# for preprocessing/eda models
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# balancing
from imblearn.over_sampling import SMOTE

# accuracy metrics and data split models
from sklearn.model_selection import train_test_split
from sklearn import metrics, model_selection
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

pd.set_option('display.max_columns', 500) # display max 500 rows
pd.set_option('display.max_rows', 500)

Using TensorFlow backend.


In [2]:
# read in data 
all_data = pd.read_csv('all_model_data.csv', index_col = 0)
all_data.head()

Unnamed: 0,Weekend,Revenue,Administrative_Duration_Scaled,Informational_Duration_Scaled,ProductRelated_Duration_Scaled,BounceRates_Scaled,ExitRates_Scaled,PageValues_Scaled,ExitRatesImpute_Scaled,totalFracAdmin_Scaled,totalFracInfo_Scaled,totalFracProd_Scaled,BounceExitAvg_Scaled,BounceExitW1_Scaled,BounceExitW2_Scaled,BounceExitW3_Scaled,BounceExitW4_Scaled,BouncePageRatio_Scaled,ExitPageRatio_Scaled,InfoPageRatio_Scaled,ProdRelPageRatio_Scaled,InfoBounceRatio_Scaled,AdminBounceRatio_Scaled,ProdRelBounceRatio_Scaled,InfoExitRatio_Scaled,ProdRelExitRatio_Scaled,Administrative_Duration_Scaled_Norm,Informational_Duration_Scaled_Norm,ProductRelated_Duration_Scaled_Norm,BounceRates_Scaled_Norm,ExitRates_Scaled_Norm,PageValues_Scaled_Norm,ExitRatesImpute_Scaled_Norm,totalFracAdmin_Scaled_Norm,totalFracInfo_Scaled_Norm,totalFracProd_Scaled_Norm,BounceExitAvg_Scaled_Norm,BounceExitW1_Scaled_Norm,BounceExitW2_Scaled_Norm,BounceExitW3_Scaled_Norm,BounceExitW4_Scaled_Norm,BouncePageRatio_Scaled_Norm,ExitPageRatio_Scaled_Norm,InfoPageRatio_Scaled_Norm,ProdRelPageRatio_Scaled_Norm,InfoBounceRatio_Scaled_Norm,AdminBounceRatio_Scaled_Norm,ProdRelBounceRatio_Scaled_Norm,InfoExitRatio_Scaled_Norm,ProdRelExitRatio_Scaled_Norm,VisitorType_bin_1,VisitorType_bin_2,VisitorType_bin_3,Month_bin_1,Month_bin_2,Month_bin_3,Month_bin_4,SpecialDay_0.0,SpecialDay_0.2,SpecialDay_0.4,SpecialDay_0.6,SpecialDay_0.8,SpecialDay_1.0,Browser_Bin_1,Browser_Bin_2,Browser_Bin_3,TrafficType_Bin_1,TrafficType_Bin_2,TrafficType_Bin_3,Region_Bin_1,Region_Bin_2,Region_Bin_3,OperatingSystems_Bin_1,OperatingSystems_Bin_2,OperatingSystems_Bin_3,Informational_Duration_Scaled_Bin,PageValues_Scaled_Bin,totalFracInfo_Scaled_Bin,BouncePageRatio_Scaled_Bin,ExitPageRatio_Scaled_Bin,InfoPageRatio_Scaled_Bin,ProdRelPageRatio_Scaled_Bin,InfoBounceRatio_Scaled_Bin,InfoExitRatio_Scaled_Bin,totalFracProd_Bin,Administrative_Duration_Norm_Scaled,Informational_Duration_Norm_Scaled,ProductRelated_Duration_Norm_Scaled,BounceRates_Norm_Scaled,ExitRates_Norm_Scaled,PageValues_Norm_Scaled,ExitRatesImpute_Norm_Scaled,totalFracAdmin_Norm_Scaled,totalFracInfo_Norm_Scaled,totalFracProd_Norm_Scaled,BounceExitAvg_Norm_Scaled,BounceExitW1_Norm_Scaled,BounceExitW2_Norm_Scaled,BounceExitW3_Norm_Scaled,BounceExitW4_Norm_Scaled,BouncePageRatio_Norm_Scaled,ExitPageRatio_Norm_Scaled,InfoPageRatio_Norm_Scaled,ProdRelPageRatio_Norm_Scaled,InfoBounceRatio_Norm_Scaled,AdminBounceRatio_Norm_Scaled,ProdRelBounceRatio_Norm_Scaled,InfoExitRatio_Norm_Scaled,ProdRelExitRatio_Norm_Scaled
0,False,False,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.339602,0.19895,0.0,0.196854,0.0,0.0,1.0,0.388586,0.376399,0.364063,0.160423,0.16993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,True,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,-0.996659,-0.492101,-2.096783,1.733188,1.982547,-0.531818,1.982622,-1.005365,-0.515133,0.757905,1.76066,1.784062,1.80832,1.738774,1.718535,-0.366273,-0.496257,-0.298863,-0.532522,-0.39044,-1.029711,-0.986837,-0.491352,-2.077588
1,False,False,0.0,0.0,0.001,0.0,0.5,0.0,0.499561,0.0,0.0,1.0,0.25,0.2,0.15,0.3,0.35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5e-05,0.0,0.0,0.031306,0.0,0.177272,0.0,0.175783,0.0,0.0,1.0,0.342421,0.320879,0.294237,0.130272,0.142518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006663,True,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,-0.996659,-0.492101,-1.074189,-0.974179,1.569866,-0.531818,1.57393,-1.005365,-0.515133,0.757905,1.171289,1.064098,0.903147,1.245637,1.299009,-0.366273,-0.496257,-0.298863,-0.532522,-0.39044,-1.029711,-0.986837,-0.491352,-1.190272
2,False,False,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.339602,0.19895,0.0,0.196854,0.0,0.0,1.0,0.388586,0.376399,0.364063,0.160423,0.16993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,True,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,True,-0.996659,-0.492101,-2.096783,1.733188,1.982547,-0.531818,1.982622,-1.005365,-0.515133,0.757905,1.76066,1.784062,1.80832,1.738774,1.718535,-0.366273,-0.496257,-0.298863,-0.532522,-0.39044,-1.029711,-0.986837,-0.491352,-2.077588
3,False,False,0.0,0.0,4.2e-05,0.25,0.7,0.0,0.699736,0.0,0.0,1.0,0.475,0.43,0.385,0.52,0.565,0.0,0.0,0.0,0.0,0.0,0.0,1.705531e-07,0.0,1e-06,0.0,0.0,0.006454,0.314382,0.190387,0.0,0.188576,0.0,0.0,1.0,0.375055,0.362113,0.348861,0.150307,0.16007,0.0,0.0,0.0,0.0,0.0,0.0,0.142554,0.0,0.001151,True,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,-0.996659,-0.492101,-1.875436,1.499177,1.832073,-0.531818,1.834505,-1.005365,-0.515133,0.757905,1.600768,1.609932,1.618258,1.592022,1.58424,-0.366273,-0.496257,-0.298863,-0.532522,-0.39044,-1.029711,0.62373,-0.491352,-1.788555
4,True,False,0.0,0.0,0.009809,0.1,0.25,0.0,0.249341,0.0,0.0,1.0,0.175,0.16,0.145,0.19,0.205,0.0,0.0,0.0,0.0,0.0,0.0,0.0001003332,0.0,0.000873,0.0,0.0,0.089834,0.254789,0.136293,0.0,0.135385,0.0,0.0,1.0,0.315545,0.303794,0.291588,0.106852,0.114152,0.0,0.0,0.0,0.0,0.0,0.0,0.314969,0.0,0.028814,True,False,False,True,False,False,False,True,False,False,False,False,False,False,True,False,True,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,-0.996659,-0.492101,0.057515,0.97376,0.72246,-0.531818,0.723407,-1.005365,-0.515133,0.757905,0.825942,0.845382,0.869842,0.81042,0.797889,-0.366273,-0.496257,-0.298863,-0.532522,-0.39044,-1.029711,0.990261,-0.491352,-0.279806


### Seperate features and label

In [3]:
# select X and y 
features = all_data.drop('Revenue', axis =1) #features
target = all_data['Revenue'] #target
print(all_data.shape)
print(features.shape)
print(target.shape)

(12330, 109)
(12330, 108)
(12330,)


## Logistic Regression <a name="log"></a>

Back to [Table of Contents](#t)

In [4]:
# select x and y
X = features[['ProdRelPageRatio_Scaled_Bin','totalFracAdmin_Scaled','Administrative_Duration_Scaled'
             ,'BounceRates_Norm_Scaled', 'ExitRates_Scaled','SpecialDay_1.0']]
y = target

In [5]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
f1_scores= []
roc_scores = []
    
# loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    
    # innitialize logistic regression 
    clf = LogisticRegression(solver='lbfgs', C=5, class_weight=dict,
                             dual=False,random_state = 123,max_iter=90,
                            verbose=0, warm_start=True) 

    # create training and testing vars
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)
        
    # this is the formula after you split the dataset
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

    # Train model
    clf.fit(x_train_res, y_train_res)

    # Predict on training set
    pred_y = clf.predict(X_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring=scoring) 
    
    # average is the bias, and std dev is variance
    f1_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='f1')
    roc_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='roc_auc')
    
    #calculate AUC
    clf_roc_auc = roc_auc_score(y_test, pred_y)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y, average='weighted')[2])
    auc_lst.append(clf_roc_auc)

# display average AUC and F1 score
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

print('average f1 score (bias)', f1_scores.mean())
print('average f1 score (variance)', f1_scores.std())
print('average AUC score (bias)', roc_scores.mean())
print('average AUC score (variance)', roc_scores.std())
    
# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y ),'class' )

# Print accuracy score
print('Accuracy of classifier on test set: {:.3f}'.format(clf.score(X_test, y_test)))
    
# Display 10-fold cross validation average accuracy
print("10-fold cross validation average accuracy of clf_0: %.3f" % (results.mean()))
    
# calculate cunfusion matrix
confusion_matrix_y = confusion_matrix(y_test, pred_y)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y))

F1 0.8749; AUC 0.8358 
average f1 score (bias) 0.8417846719871267
average f1 score (variance) 0.010284491967658438
average AUC score (bias) 0.8840503461580618
average AUC score (variance) 0.007458558130610671
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.866
10-fold cross validation average accuracy of clf_0: 0.848
Confusion Matrix for Classfier:
[[1822  247]
 [  83  314]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.96      0.88      0.92      2069
        True       0.56      0.79      0.66       397

    accuracy                           0.87      2466
   macro avg       0.76      0.84      0.79      2466
weighted avg       0.89      0.87      0.87      2466



- results stay the same (tuning did not help)
- perhaps our gridsearch was not extensive enough or the default parameters are the best 
    - we believe it could be more extensive but our computer was not strong enough 
- this model is more variance prone so it might benefit from **bagging**

## SVC <a name="svc"></a>

Back to [Table of Contents](#t)

In [6]:
# select x and y
X = features[['Month_bin_2','Month_bin_4','Month_bin_1','totalFracProd_Bin',
              'ProdRelPageRatio_Scaled_Bin','BounceExitAvg_Norm_Scaled','totalFracInfo_Scaled']]
y = target

In [None]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
f1_scores = []
roc_scores = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)
    
    # begin up-sampling with SMOTE
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

    # Train model
    clf = SVC(kernel='rbf', gamma=1.0672387970376063, class_weight='balanced', C=0.8914369396699439,
            probability=True, random_state = 123) # penalize

    clf.fit(x_train_res, y_train_res)

    # Predict on training set
    pred_y = clf.predict(X_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring=scoring)
    
    # average is the bias, and std dev is variance
    f1_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='f1')
    roc_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='roc_auc')

    #calculate AUC
    clf_roc_auc = roc_auc_score(y_test, pred_y)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y, average='weighted')[2])
    auc_lst.append(clf_roc_auc)
    

print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

print('average f1 score (bias)', f1_scores.mean())
print('average f1 score (variance)', f1_scores.std())
print('average AUC score (bias)', roc_scores.mean())
print('average AUC score (variance)', roc_scores.std())

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf.score(X_test, y_test)))

print("10-fold cross validation average accuracy of clf_3: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y))

## Random Forest <a name="rf"></a>

Back to [Table of Contents](#t)

In [4]:
X = features[['ProductRelated_Duration_Scaled',
       'BounceRates_Scaled','PageValues_Scaled','totalFracAdmin_Scaled','Month_bin_2','ExitRates_Scaled']]
y = target

In [5]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
f1_scores = []
roc_scores = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)
    
    # this is the formula after you split the dataset
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

    # Train model
    clf = RandomForestClassifier(bootstrap=False,class_weight='balanced',criterion='entropy',max_depth=20,
                                 max_features=0.4,max_leaf_nodes=5,min_samples_leaf=20,min_samples_split=14,
                                 n_estimators=100,random_state = 123)
    # fit model
    clf.fit(x_train_res, y_train_res)

    # Predict on training set
    pred_y = clf.predict(X_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring=scoring)
    
    # average is the bias, and std dev is variance
    f1_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='f1')
    roc_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='roc_auc')

    #calculate f1-score and AUC
    clf_roc_auc = roc_auc_score(y_test, pred_y)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y, average='weighted')[2])
    auc_lst.append(clf_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

print('average f1 score (bias)', f1_scores.mean())
print('average f1 score (variance)', f1_scores.std())
print('average AUC score (bias)', roc_scores.mean())
print('average AUC score (variance)', roc_scores.std())

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf.score(X_test, y_test)))

print("10-fold cross validation average accuracy of clf_4: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y))

F1 0.8740; AUC 0.8371 
average f1 score (bias) 0.8701715314075807
average f1 score (variance) 0.011241117023038623
average AUC score (bias) 0.9195668204903518
average AUC score (variance) 0.0055102743147837025
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.865
10-fold cross validation average accuracy of clf_4: 0.872
Confusion Matrix for Classfier:
[[1817  252]
 [  81  316]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.96      0.88      0.92      2069
        True       0.56      0.80      0.65       397

    accuracy                           0.86      2466
   macro avg       0.76      0.84      0.79      2466
weighted avg       0.89      0.86      0.87      2466



- our average F1 and AUC CV decreased 
- as random forest already is a type of ensamble we assume the model wouldn't benefit much from bagging (high variance)
- instead we will use the model in the voting ensamble 

## XGBoost <a name="xgb"></a>

Back to [Table of Contents](#t)

In [6]:
X = features[['PageValues_Norm_Scaled','AdminBounceRatio_Norm_Scaled','ProdRelExitRatio_Norm_Scaled',
              'Month_bin_4','Month_bin_2','VisitorType_bin_2','Informational_Duration_Scaled','totalFracProd_Bin']]
y = target

In [7]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
f1_scores = []
roc_scores = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)
    
    # this is the formula after you split the dataset
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

    # fit model no training data
    clf = XGBClassifier(random_state=123,learning_rate=0.3,loss='deviance',max_depth=11,max_leaf_nodes=1,
                       n_estimators=110,subsample=1.0)
  
    clf.fit(x_train_res, y_train_res)

    # make predictions for test data
    y_pred = clf.predict(X_test)
    predictions = [round(value) for value in y_pred]
    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring=scoring)
    
    # average is the bias, and std dev is variance
    f1_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='f1')
    roc_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='roc_auc')

    #calculate f1-score and AUC
    clf_roc_auc = roc_auc_score(y_test, y_pred)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, y_pred, average='weighted')[2])
    auc_lst.append(clf_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

print('average f1 score (bias)', f1_scores.mean())
print('average f1 score (variance)', f1_scores.std())
print('average AUC score (bias)', roc_scores.mean())
print('average AUC score (variance)', roc_scores.std())

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( y_pred ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf.score(X_test, y_test)))

print("10-fold cross validation average accuracy of clf_4: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, y_pred)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, y_pred))

F1 0.8536; AUC 0.8042 
average f1 score (bias) 0.8897120405587552
average f1 score (variance) 0.005450070995259071
average AUC score (bias) 0.9516176379740318
average AUC score (variance) 0.003491019553245417
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.842
10-fold cross validation average accuracy of clf_4: 0.888
Confusion Matrix for Classfier:
[[1780  289]
 [ 100  297]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.95      0.86      0.90      2069
        True       0.51      0.75      0.60       397

    accuracy                           0.84      2466
   macro avg       0.73      0.80      0.75      2466
weighted avg       0.88      0.84      0.85      2466



- we got beter results here but precision and recall decreased
- we assume gridsearchCV should be more comprehensive 

## Neural Network <a name="mlp"></a>

Back to [Table of Contents](#t)

In [None]:
X = features[['PageValues_Scaled_Bin', 'ExitRates_Scaled']]
y = target

In [124]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
f1_scores = []
roc_scores = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)
    
    # balance classes (up-sample)
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

    # innitialize our model params
    clf = MLPClassifier(hidden_layer_sizes=(20,40),gamma=1.0,
                        random_state = 123,class_weight='balanced', C=0.8914369396699439,
                        verbose=True,activation='identity', solver='lbfgs', alpha=0.0003, learning_rate='constant')
    
    # fit model
    clf.fit(x_train_res,y_train_res)
    
    # predict
    pred_y = clf.predict(X_test)
    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring=scoring)
    
    # average is the bias, and std dev is variance
    f1_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='f1')
    roc_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='roc_auc')

    #calculate f1-score and AUC
    clf_roc_auc = roc_auc_score(y_test, pred_y)
    
    xs#calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y, average='weighted')[2])
    auc_lst.append(clf_roc_auc)

print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

print('average f1 score (bias)', f1_scores.mean())
print('average f1 score (variance)', f1_scores.std())
print('average AUC score (bias)', roc_scores.mean())
print('average AUC score (variance)', roc_scores.std())

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf.score(X_test, y_test)))

print("10-fold cross validation average accuracy of clf_4: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y))

F1 0.8710; AUC 0.7212 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.880
10-fold cross validation average accuracy of clf_4: 0.851
Confusion Matrix for Classfier:
[[2960  130]
 [ 314  295]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.90      0.96      0.93      3090
        True       0.69      0.48      0.57       609

    accuracy                           0.88      3699
   macro avg       0.80      0.72      0.75      3699
weighted avg       0.87      0.88      0.87      3699



## KNN <a name="knn"></a>

Back to [Table of Contents](#t)

In [8]:
X = features[['PageValues_Norm_Scaled','ExitRates_Scaled','totalFracProd_Scaled']]

In [9]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
f1_scores = []
roc_scores = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)
    
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(X_train, y_train)
    
    # start our model with params
    clf = KNeighborsClassifier(n_neighbors=4,algorithm='auto',leaf_size=20,metric='minkowski',
                              p=3,weights='distance')
    # fit the model
    clf.fit(x_train_res, y_train_res)
    
    pred_y = clf.predict(X_test)
    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring=scoring)
    
    # average is the bias, and std dev is variance
    f1_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='f1')
    roc_scores = model_selection.cross_val_score(clf, x_train_res, y_train_res, cv=kfold, scoring='roc_auc')

    #calculate f1-score and AUC
    clf_roc_auc = roc_auc_score(y_test, pred_y)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y, average='weighted')[2])
    auc_lst.append(clf_roc_auc)

print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

print('average f1 score (bias)', f1_scores.mean())
print('average f1 score (variance)', f1_scores.std())
print('average AUC score (bias)', roc_scores.mean())
print('average AUC score (variance)', roc_scores.std())

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf.score(X_test, y_test)))

print("10-fold cross validation average accuracy of clf_4: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y))

F1 0.8316; AUC 0.7615 
average f1 score (bias) 0.8806464905875065
average f1 score (variance) 0.00943496057582695
average AUC score (bias) 0.9261122437452487
average AUC score (variance) 0.005792720357839831
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.818
10-fold cross validation average accuracy of clf_4: 0.878
Confusion Matrix for Classfier:
[[1749  320]
 [ 128  269]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.93      0.85      0.89      2069
        True       0.46      0.68      0.55       397

    accuracy                           0.82      2466
   macro avg       0.69      0.76      0.72      2466
weighted avg       0.86      0.82      0.83      2466



- results are now much worse 
- we assume our gridsearchCV was not comprehensive enough