## Hyperparameter optimization
We run **gridsearchCV** and **randomsearchCV** to get the best parameters possible. We still run the model on the same features. 

### Results: <a name="t"></a>

1. GridSearchCV: [XGBoost](#xgb) 
- AUC score: 0.884893 
- Parameters: <span style="color:red">{'learning_rate': 0.3, 'loss': 'deviance', 'max_depth': 11, 'max_leaf_nodes': 1, 'n_estimators': 110, 'subsample': 1.0}</span>

2. GridSearchCV: [KNN](#knn)
- AUC score: 0.878967
- Parameters: <span style="color:red">{'algorithm': 'auto', 'leaf_size': 20, 'metric': 'minkowski', 'n_neighbors': 4, 'p': 3, 'weights': 'distance'}</span>

3. GridSearchCV: [Random Forest](#rf)
- AUC score: 0.872383 
- Parameters: <span style="color:red">{'bootstrap': False, 'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 20, 'max_features': 0.4, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 14, 'n_estimators': 100}</span>

4. RandomSearchCV: [SVC](#svc) 
- AUC score: 0.850234 
- Parameters: <span style="color:red">{'kernel': 'rbf', 'gamma': 1.0672387970376063, 'class_weight': 'balanced', 'C': 0.8914369396699439}</span>

5. GridSearchCV: [Logistic Regression](#lr) 
- AUC score: 0.847899 
    - Parameters: <span style="color:red">{'C': 5, 'class_weight': <class 'dict'>, 'dual': False, 'max_iter': 90, 'solver': 'lbfgs', 'verbose': 0, 'warm_start': True}</span>

6. GridSearchCV: [MLP](#mlp) 
- AUC score: 0.847720
- Parameters: <span style="color:red">{'activation': 'identity', 'alpha': 0.0003, 'hidden_layer_sizes': (20, 40), 'learning_rate': 'constant', 'solver': 'lbfgs', 'verbose': True}</span>

In [1]:
import pandas as pd
import numpy as np

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import make_scorer, roc_auc_score

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn import metrics, model_selection
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

Using TensorFlow backend.


In [2]:
data = pd.read_csv('all_model_data.csv', index_col = 0)

In [3]:
data.shape

(12330, 6)

## Gridsearch CV

### 1. Logistic Regression <a name="lr"></a>

Back to [results](#t)

In [None]:
# select x and y
X = data[['ProdRelPageRatio_Scaled_Bin','totalFracAdmin_Scaled','Administrative_Duration_Scaled'
             ,'BounceRates_Norm_Scaled', 'ExitRates_Scaled','SpecialDay_1.0']]
y = data.Revenue

In [3]:
# we will use AUC to check validity of hyperparameters 
scorer = make_scorer(roc_auc_score)

# Split the `digits` data into two equal sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)

# balance the data
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

# create options for gridsearch (it will iterate through all these options)
dual=[True,False]
C = [3,5,7]
max_iter=[90,100,110]
solver = ['lbfgs','newton-cg']
verbose = [0,1,2]
warm_start = [True, False]
class_weight = [dict,'balanced',None]

param_grid = dict(dual=dual,C=C,max_iter=max_iter,solver=solver,warm_start=warm_start,class_weight=class_weight,
                 verbose=verbose)

# Create a classifier with the parameter candidates
grid = GridSearchCV(estimator=LogisticRegression(random_state=123), param_grid=param_grid, n_jobs=-1,scoring=scorer,
                    cv = 3)

# fit grid to the model
grid_result = grid.fit(x_train_res, y_train_res)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.847899 using {'C': 5, 'class_weight': <class 'dict'>, 'dual': False, 'max_iter': 90, 'solver': 'lbfgs', 'verbose': 0, 'warm_start': True}


## 2. Random Search: SVC <a name="svc"></a>

In [3]:
X = data[['Month_bin_2','Month_bin_4','Month_bin_1','totalFracProd_Bin',
              'ProdRelPageRatio_Scaled_Bin','BounceExitAvg_Norm_Scaled','totalFracInfo_Scaled']]
y = data.Revenue

In [6]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, roc_auc_score

# Split the `digits` data into two equal sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)

# balance the data
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

# Initialize the random number generator
np.random.seed(123)
# Create range of values to choose randomly from 
C1 = np.random.normal(1,0.1,1).astype(float)
kernel = np.random.choice(['rbf','sigmoid'],1)
gamma = np.random.uniform(0.1,1.5,1)
class_weight = np.random.choice([dict,'balanced'],1)

# join the parameter grid into a dictionary 
param_grid1 = dict(C=C1,kernel=kernel,gamma=gamma,class_weight=class_weight)

# innitialize the model 
rfr = SVC(random_state = 123)
# use auc to score
scorer = make_scorer(roc_auc_score)
# innitialize random search, put param grid in, use cv=3, use all processors
random = RandomizedSearchCV(estimator=rfr, param_distributions=param_grid1, cv = 3, n_jobs=-1,scoring=scorer)

#fit the model
random_result = random.fit(x_train_res, y_train_res)

# Summarize results
print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))



Best: 0.850234 using {'kernel': 'rbf', 'gamma': 1.0672387970376063, 'class_weight': 'balanced', 'C': 0.8914369396699439}


### 3.Grid Search: MLP <a name="mlp"></a>

In [9]:
X = data[['PageValues_Scaled_Bin', 'ExitRates_Scaled']]
y = data.Revenue

In [4]:
# we will use AUC to check validity of hyperparameters 
scorer = make_scorer(roc_auc_score)

# Split the `digits` data into two equal sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)

# balance the data
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

# Set the parameter candidates
hidden_layer_sizes=[(20,40),(20,40,80),(40,80)]
activation = ['identity','logistic','relu']
solver = ['lbfgs','solver']
alpha = [0.0003,0.0005,0.0007]
max_iter = [200,300,400]
learning_rate = ['constant', 'invscaling', 'adaptive']
#max_fun = [15000,17000]
verbose = [True,False]

# create param grid (join them in the dictionary)
param_grid = dict(hidden_layer_sizes=hidden_layer_sizes,activation=activation,solver=solver,alpha=alpha,
                 learning_rate=learning_rate,verbose=verbose,max_iter=max_iter)

# Create a classifier with the parameter candidates
grid1 = GridSearchCV(estimator=MLPClassifier(random_state=123), param_grid=param_grid, n_jobs=-1, scoring = scorer, cv = 3)

# Train the classifier on training data
grid_results1 = grid1.fit(x_train_res, y_train_res)

# Summarize results
print("Best: %f using %s" % (grid_results1.best_score_, grid_results1.best_params_))

Best: 0.847720 using {'activation': 'identity', 'alpha': 0.0003, 'hidden_layer_sizes': (20, 40), 'learning_rate': 'constant', 'solver': 'lbfgs', 'verbose': True}


## 4. Grid Search: XGBoost <a name="xgb"></a>

In [4]:
X = data[['PageValues_Norm_Scaled','AdminBounceRatio_Norm_Scaled','ProdRelExitRatio_Norm_Scaled',
              'Month_bin_4','Month_bin_2','VisitorType_bin_2','Informational_Duration_Scaled','totalFracProd_Bin']]
y = data.Revenue

In [5]:
# we will use AUC to check validity of hyperparameters 
scorer = make_scorer(roc_auc_score)

# Split the `digits` data into two equal sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)

# balance the data
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

# Set the parameter candidates
loss = ['deviance', 'exponential']
learning_rate = [0.3,0.4,0.5]
n_estimators = [110,120,130]
subsample = [0.8,1.0,1.2]
max_depth = [9,11,13]
max_leaf_nodes = [1,2,None]

# create param grid (join them in the dictionary)
param_grid = dict(loss=loss,learning_rate=learning_rate,n_estimators=n_estimators,subsample=subsample,
                 max_depth=max_depth,max_leaf_nodes=max_leaf_nodes)

# Create a classifier with the parameter candidates
grid2 = GridSearchCV(estimator=XGBClassifier(random_state=123), param_grid=param_grid, n_jobs=-1, scoring = scorer, cv = 3)

# Train the classifier on training data
grid_results2 = grid2.fit(x_train_res, y_train_res)

# Summarize results
print("Best: %f using %s" % (grid_results2.best_score_, grid_results2.best_params_))

Best: 0.884893 using {'learning_rate': 0.3, 'loss': 'deviance', 'max_depth': 11, 'max_leaf_nodes': 1, 'n_estimators': 110, 'subsample': 1.0}


## 5. Grid Search: Random Forest <a name="rf"></a>

In [3]:
X = data[['ProductRelated_Duration_Scaled','BounceRates_Scaled','PageValues_Scaled','totalFracAdmin_Scaled',
         'Month_bin_2','ExitRates_Scaled']]
y = data.Revenue

In [4]:
# we will use AUC to check validity of hyperparameters 
scorer = make_scorer(roc_auc_score)

# Split the `digits` data into two equal sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)

# balance the data
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

# Set the parameter candidates
n_estimators= [100]
max_depth= [10, 20]
max_leaf_nodes= [2,5]
class_weight= [None,'balanced']
bootstrap = [True, False]
criterion=['entropy','giny']
max_features=['auto',0.4]
min_samples_leaf=[15,20]
min_samples_split=[12,14]

# create param grid (join them in the dictionary)
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth, max_leaf_nodes=max_leaf_nodes, class_weight=class_weight,
                 bootstrap=bootstrap,criterion=criterion,max_features=max_features,min_samples_leaf=min_samples_leaf,
                 min_samples_split=min_samples_split)

# Create a classifier with the parameter candidates
clf = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, n_jobs=-1, scoring = scorer, cv=3)

# Train the classifier on training data
grid_result= clf.fit(x_train_res, y_train_res)

# Print out the results 
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.872383 using {'bootstrap': False, 'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 20, 'max_features': 0.4, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 14, 'n_estimators': 100}


## 6. Grid Search: KNN <a name="knn"></a>

In [5]:
X = data[['PageValues_Norm_Scaled','ExitRates_Scaled','totalFracProd_Scaled']]
y = data.Revenue

In [8]:
# we will use AUC to check validity of hyperparameters 
scorer = make_scorer(roc_auc_score)

# Split the `digits` data into two equal sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle = True)

# balance the data
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

# Set the parameter candidates
n_neighbors= [4,5,6]
weights= ['uniform','distance']
algorithm= ['auto', 'ball_tree','kd_tree']
leaf_size=[20,30,40]
p=[3,4]
metric= ['minkowski']

# create param grid (join them in the dictionary)
param_grid = dict(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, leaf_size=leaf_size, p=p, metric=metric)

# Create a classifier with the parameter candidates
clf = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=param_grid, n_jobs=-1, scoring = scorer, cv=10)

# Train the classifier on training data
grid_result= clf.fit(x_train_res, y_train_res)

# Print out the results 
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.878967 using {'algorithm': 'auto', 'leaf_size': 20, 'metric': 'minkowski', 'n_neighbors': 4, 'p': 3, 'weights': 'distance'}
