## Baseline Modeling Results
- Find the best sampling technique to balance the data 
    - SMOTE Up-Sampling (see table of contents below)
- Run 7 models with 10 iterations to get baseline results 

![Baseline Results](Baseline_models_results.jpeg)



# Table of contents <a name="t"></a>
1. [Basic Date Prep](#i)
2. [Logistic Regression: Unbalanced performance](#para1)
3. [Logistic Regression: Up-Sample Manual](#para2)
4. [Logistic Regression: Up-Sample SMOOT](#para3)
- Best balancing method: SMOTE (up-sampling)

**Models**
1. [MLP](#paragraph4)
2. [XGBoost](#paragraph5)
3. [Random Forest](#paragraph6)
4. [SVC](#paragraph7)
5. [Logistic Regression](#para3)
6. [Naive Bayes](#paragraph9)
7. [Decision Tree](#paragraph10)

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn import metrics, model_selection
from sklearn.metrics import classification_report,roc_auc_score,accuracy_score,confusion_matrix
from sklearn.metrics import roc_curve,f1_score,precision_recall_fscore_support

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler

Using TensorFlow backend.


# Basic Data Prep <a name="i"></a>
- Try different modeling techniques and balance the data on not preproccessed data
- We need to minimaly encode the data for this step 

In [2]:
# Loaded in the data
data = pd.read_csv('online_shoppers_intention.csv')
data.head(3)

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False


## Encode object/bool variables
- we decide to one-hot encode months into quarters 
- also one-hot visitor type

In [3]:
# we have 10 months, missing Jan, Apr 
# encode into quarters 
def division_func_month(div) :
    if div == 'Feb':
        return 1
    elif div == 'Mar':
        return 1
    elif div == 'May':
        return 2
    elif div == 'June':
        return 2
    elif div == 'Jul':
        return 3
    elif div == 'Aug':
        return 3
    elif div == 'Sep':
        return 3
    elif div == 'Oct':
        return 4
    elif div == 'Nov':
        return 4
    elif div == 'Dec':
        return 4
    else:
        return 0

Enncode some other way too 

In [4]:
# from object to integers
data['quarter'] = data['Month']
# apply function
data['quarter'] = data['quarter'].apply(division_func_month)

In [5]:
data['quarter'].value_counts()

4    5274
2    3652
1    2091
3    1313
Name: quarter, dtype: int64

In [6]:
# encoding for revenue and weekend
def div_fun_visType(div) :
    if div == 'Returning_Visitor':
        return 1
    if div == 'New_Visitor':
        return 2
    else:
        return 0

In [7]:
# from boolean to integers
data['VisitorType_encode'] = data['VisitorType']
# apply function
data['VisitorType_encode'] = data['VisitorType_encode'].apply(div_fun_visType)

In [8]:
data['VisitorType_encode'].value_counts()

1    10551
2     1694
0       85
Name: VisitorType_encode, dtype: int64

In [9]:
print(1908/10422*100,'% Made the purchase')
print(100-18.307426597582037,'% Did not make the purchase')

18.307426597582037 % Made the purchase
81.69257340241796 % Did not make the purchase


In [10]:
data1 = data.copy()
data1 = data1.drop(['VisitorType','Month'], axis = 1)

### Get dummies

In [11]:
# revenue (Y) and weeked only have 2 unique values so there is no need to encode
data1 = pd.get_dummies(data1, columns=['VisitorType_encode','quarter'])

## Logistic Regression: Unbalanced performance <a name="para1"></a>
- use MinMaxScaler to standerdize data 
- run logistic regression without sampling (balancing data)

Back to [Table of Contents](#t)

In [25]:
# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)

# make a function as we will be reusing it 
def logistic_regression_function(y, X):
    
    # lists for f1-score and AUC
    f1_score_lst = []
    auc_lst = []
    
    scaler = MinMaxScaler() # innitialize function
    x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature)
    
    # loop to calculate f1 and auc scores and present averages after 10 runs
    for count in range (1,10):
    
        # innitialize logistic regression 
        clf_0 = LogisticRegression(solver='lbfgs', max_iter=4000) # increase max_iter to converge data (default = 100)

        # create training and testing vars
        Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=123, shuffle=True)
        #print ('X_train: ',X_train.shape,'y_train', y_train.shape)
        #print ('X_test: ',X_test.shape,'y_test: ', y_test.shape)

        # Train model
        clf_0.fit(Xs_train, y_train)

        # Predict on training set
        pred_y_0 = clf_0.predict(Xs_test)

        #10-fold cross validation
        kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
        scoring = 'accuracy'
        results = model_selection.cross_val_score(clf_0, Xs_train, y_train, cv=kfold, scoring=scoring)

        #calculate f1-score and AUC
        clf_0_roc_auc = roc_auc_score(y_test, pred_y_0)
        #f1_score = precision_recall_fscore_support(y_test, pred_y_0, average='weighted')[2]
        
        #calculate average f1-score and AUC
        f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_0, average='weighted')[2])
        auc_lst.append(clf_0_roc_auc)

    # display average AUC and F1 score
    print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))
    
    # Is our model still predicting just one class?
    print('Model is predicting ',np.unique( pred_y_0 ),'class' )

    # Print accuracy score
    print('Accuracy of classifier on test set: {:.3f}'.format(clf_0.score(Xs_test, y_test)))
    
    # Display 10-fold cross validation average accuracy
    print("10-fold cross validation average accuracy of clf_0: %.3f" % (results.mean()))
    
    # calculate cunfusion matrix
    confusion_matrix_y = confusion_matrix(y_test, pred_y_0)
    print('Confusion Matrix for Classfier:')
    print(confusion_matrix_y)

    print('Classification Report for Classfier:')
    print(classification_report(y_test, pred_y_0))
    
    return 

In [26]:
# call function
logistic_regression_function(y,X)

F1 0.8441; AUC 0.6295 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.872
10-fold cross validation average accuracy of clf_0: 0.875
Confusion Matrix for Classfier:
[[2042   27]
 [ 289  108]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.88      0.99      0.93      2069
        True       0.80      0.27      0.41       397

    accuracy                           0.87      2466
   macro avg       0.84      0.63      0.67      2466
weighted avg       0.86      0.87      0.84      2466



## Logistic Regression - Up-Sample minority class  <a name="para2"></a>

- Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.
- There are several heuristics for doing so, but the most common way is to simply resample with replacement.

### Important: Do not up-sample before splitting you data
- By oversampling before splitting into training and validation datasets, you “bleed” information from the validation set into the training of the model.
- If I upsample a dataset before splitting it into a train and validation set, I could end up with the same observation in both datasets. 
- As a result, a complex enough model will be able to perfectly predict the value for those observations when predicting on the validation set, inflating the accuracy and recall.
- By oversampling only on the training data, none of the information in the validation data is being used to create synthetic observations. So these results should be generalizable. 
- You should only balance the training set, and leave the test set unbalanced to avoid introducing error into your test data.

Let's sample the right way by sampling after the data split. For this we need to slightly change our logistic regression.

Back to [Table of Contents](#t)

In [27]:
# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)
 
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
    
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature)
x_scaled = pd.DataFrame(x_scaled) # convert back to df neccesary for concat upsampling
x_scaled.columns = X.columns # name the columns 
    
# loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    
    # innitialize logistic regression 
    clf_0 = LogisticRegression(solver='lbfgs', max_iter=1000) # increase max_iter to converge data (default = 100)

    # create training and testing vars
    Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=123, shuffle=True)
        
    # Begin oversampling
    oversample = pd.concat([Xs_train,y_train],axis=1)
    max_size = oversample['Revenue'].value_counts().max()
    lst = [oversample]
    for class_index, group in oversample.groupby('Revenue'):
        lst.append(group.sample(max_size-len(group), replace=True))
    Xs_train = pd.concat(lst)
    y_train=pd.DataFrame.copy(Xs_train['Revenue'])
    del Xs_train['Revenue']

    # Train model
    clf_0.fit(Xs_train, y_train)

    # Predict on training set
    pred_y_0 = clf_0.predict(Xs_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf_0, Xs_train, y_train, cv=kfold, scoring=scoring)

    #calculate f1-score and AUC
    clf_0_roc_auc = roc_auc_score(y_test, pred_y_0)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_0, average='weighted')[2])
    auc_lst.append(clf_0_roc_auc)

# display average AUC and F1 score
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))
    
# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_0 ),'class' )

# Print accuracy score
print('Accuracy of classifier on test set: {:.3f}'.format(clf_0.score(Xs_test, y_test)))
    
# Display 10-fold cross validation average accuracy
print("10-fold cross validation average accuracy of clf_0: %.3f" % (results.mean()))
    
# calculate cunfusion matrix
confusion_matrix_y = confusion_matrix(y_test, pred_y_0)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_0))

F1 0.8451; AUC 0.7915 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.834
10-fold cross validation average accuracy of clf_0: 0.800
Confusion Matrix for Classfier:
[[1764  305]
 [ 104  293]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.94      0.85      0.90      2069
        True       0.49      0.74      0.59       397

    accuracy                           0.83      2466
   macro avg       0.72      0.80      0.74      2466
weighted avg       0.87      0.83      0.85      2466



Great, now the model is no longer predicting just one class disproportionetly. While the accuracy also took a nosedive, it's now more meaningful as a performance metric.

- AUC, F1, accuracy and classification report is now similar around 81% (also a good results)
- It means that there is a 81% chance that the model will correclty classify whether a user made a purchase or not 

Let's try Up-sampling using SMOTE which is considered to be a better alternative. 

## Up-sample with SMOTE - Logistic Regression <a name="para3"></a>
- Works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line
- Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space
- This procedure can be used to create as many synthetic examples for the minority class as are required
- Instalation: sudo pip install imbalanced-learn

Code help: https://beckernick.github.io/oversampling-modeling/

Back to [Table of Contents](#t)

In [28]:
# import SMOTE 
from imblearn.over_sampling import SMOTE

# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)
 
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []
    
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature)
x_scaled = pd.DataFrame(x_scaled) 
x_scaled.columns = X.columns # name the columns 
    
# loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    
    # innitialize logistic regression 
    clf_0 = LogisticRegression(solver='lbfgs', max_iter=3000) # increase max_iter to converge data (default = 100)

    # create training and testing vars
    Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.3, random_state=123, shuffle=True)
        
    # this is the formula after you split the dataset
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(Xs_train, y_train)

    # Train model
    clf_0.fit(x_train_res, y_train_res)

    # Predict on training set
    pred_y_0 = clf_0.predict(Xs_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf_0, x_train_res, y_train_res, cv=kfold, scoring=scoring)

    #calculate AUC
    clf_0_roc_auc = roc_auc_score(y_test, pred_y_0)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_0, average='weighted')[2])
    auc_lst.append(clf_0_roc_auc)

# display average AUC and F1 score
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))
    
# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_0 ),'class' )

# Print accuracy score
print('Accuracy of classifier on test set: {:.3f}'.format(clf_0.score(Xs_test, y_test)))
    
# Display 10-fold cross validation average accuracy
print("10-fold cross validation average accuracy of clf_0: %.3f" % (results.mean()))
    
# calculate cunfusion matrix
confusion_matrix_y = confusion_matrix(y_test, pred_y_0)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_0))

F1 0.8506; AUC 0.7887 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.841
10-fold cross validation average accuracy of clf_0: 0.827
Confusion Matrix for Classfier:
[[2677  413]
 [ 176  433]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.94      0.87      0.90      3090
        True       0.51      0.71      0.60       609

    accuracy                           0.84      3699
   macro avg       0.73      0.79      0.75      3699
weighted avg       0.87      0.84      0.85      3699



- Looking at the classification report we get **better results with SMOTE and will use it in our modeling**

Undersampling was considered but had lower results. With undersampling you can loose important information. 

# SVC <a name="paragraph7"></a>

Back to [Table of Contents](#t)

- The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class (we will use SVC)
- During training, we can use the argument **class_weight='balanced'**  to penalize mistakes on the minority class by an amount proportional to how under-represented it is.
- We also want to include the argument **probability=True**  if we want to enable probability estimates for SVM algorithms
- We can re-use a lot of the code from the function only the algorithm is different
- Let's train a model using Penalized-SVM on the original imbalanced dataset:
- **Warning:** SVC takes a long time - running a model 1 time is good enough 

In [13]:
from sklearn.svm import SVC

# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)

# standerdize     
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature) 
x_scaled = pd.DataFrame(x_scaled) 
x_scaled.columns = X.columns # name the columns 

# create training and testing vars
Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=123, shuffle = True)
    
# begin up-sampling with SMOTE
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(Xs_train, y_train)

# Train model
clf_3 = SVC(kernel='rbf', 
            class_weight='balanced', # penalize
            probability=True, random_state = 123)

clf_3.fit(x_train_res, y_train_res)

# Predict on training set
pred_y_3 = clf_3.predict(Xs_test)

#10-fold cross validation
kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
scoring = 'accuracy'
results = model_selection.cross_val_score(clf_3, x_train_res, y_train_res, cv=kfold, scoring=scoring)

#calculate average f1-score and AUC
clf_3_roc_auc = roc_auc_score(y_test, pred_y_3)
f1_score_lst = precision_recall_fscore_support(y_test, pred_y_3, average='weighted')[2]

print('F1 {:.4f}; AUC {:.4f} '.format(f1_score_lst,clf_3_roc_auc))

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_3 ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf_3.score(Xs_test, y_test)))

print("10-fold cross validation average accuracy of clf_3: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y_3)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_3))

F1 0.8519; AUC 0.7953 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.841
10-fold cross validation average accuracy of clf_3: 0.839
Confusion Matrix for Classfier:
[[1785  284]
 [ 108  289]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.94      0.86      0.90      2069
        True       0.50      0.73      0.60       397

    accuracy                           0.84      2466
   macro avg       0.72      0.80      0.75      2466
weighted avg       0.87      0.84      0.85      2466



# Naive Bayes <a name="paragraph9"></a>
Back to [Table of Contents](#t)

In [13]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)

# standerdize     
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature) 
x_scaled = pd.DataFrame(x_scaled) 
x_scaled.columns = X.columns # name the columns 

# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    #Create a Gaussian Classifier
    gnb = GaussianNB()

    # create training and testing vars
    Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=123, shuffle=True)
        
    # this is the formula after you split the dataset
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(Xs_train, y_train)
    
    #Train the model using the training sets
    gnb.fit(x_train_res, y_train_res)
    
    #Predict the response for test dataset
    pred_y_4 = gnb.predict(Xs_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(gnb, x_train_res, y_train_res, cv=kfold, scoring=scoring)

    #calculate AUC
    clf_4_roc_auc = roc_auc_score(y_test, pred_y_4)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_4, average='weighted')[2])
    auc_lst.append(clf_4_roc_auc)

# display average AUC and F1 score
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))
    
# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_4 ),'class' )

# Print accuracy score
print('Accuracy of classifier on test set: {:.3f}'.format(gnb.score(Xs_test, y_test)))
    
# Display 10-fold cross validation average accuracy
print("10-fold cross validation average accuracy of clf_0: %.3f" % (results.mean()))
    
# calculate cunfusion matrix
confusion_matrix_y = confusion_matrix(y_test, pred_y_4)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_4))

F1 0.6522; AUC 0.7042 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.601
10-fold cross validation average accuracy of clf_0: 0.728
Confusion Matrix for Classfier:
[[1142  927]
 [  57  340]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.95      0.55      0.70      2069
        True       0.27      0.86      0.41       397

    accuracy                           0.60      2466
   macro avg       0.61      0.70      0.55      2466
weighted avg       0.84      0.60      0.65      2466



# Neural Network (MLP) <a name="paragraph4"></a>
Back to [Table of Contents](#t)

In [12]:
# import library 
from sklearn.neural_network import MLPClassifier

# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)

# standerdize     
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature) 
x_scaled = pd.DataFrame(x_scaled) 
x_scaled.columns = X.columns # name the columns 

# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []

# create training and testing vars
Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=123, shuffle = True)

# begin up-sampling with SMOTE
sm = SMOTE(random_state=123, sampling_strategy = 'minority')
x_train_res, y_train_res = sm.fit_sample(Xs_train, y_train)

# Train model
clf_3 = MLPClassifier(random_state = 123, hidden_layer_sizes = (20,40,80))

clf_3.fit(x_train_res, y_train_res)

# Predict on training set
pred_y_3 = clf_3.predict(Xs_test)

#10-fold cross validation
kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
scoring = 'accuracy'
results = model_selection.cross_val_score(clf_3, x_train_res, y_train_res, cv=kfold, scoring=scoring)

#calculate AUC
clf_3_roc_auc = roc_auc_score(y_test, pred_y_3)

#calculate average f1-score and AUC
f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_3, average='weighted')[2])
auc_lst.append(clf_3_roc_auc)
    

print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_3 ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf_3.score(Xs_test, y_test)))

print("10-fold cross validation average accuracy of clf_3: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y_3)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_3))



F1 0.8667; AUC 0.8098 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.858
10-fold cross validation average accuracy of clf_3: 0.904
Confusion Matrix for Classfier:
[[1824  245]
 [ 104  293]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.95      0.88      0.91      2069
        True       0.54      0.74      0.63       397

    accuracy                           0.86      2466
   macro avg       0.75      0.81      0.77      2466
weighted avg       0.88      0.86      0.87      2466



# Gradient Boosting Classifier <a name="paragraph5"></a>
Back to [Table of Contents](#t)

In [13]:
from xgboost import XGBClassifier

# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)

# standerdize     
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature) 
x_scaled = pd.DataFrame(x_scaled) 
x_scaled.columns = X.columns # name the columns 

# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=123, shuffle = True)
    
    # begin up-sampling with SMOTE
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(Xs_train, y_train)

    # Train model
    clf_3 = XGBClassifier(random_state = 123)

    clf_3.fit(x_train_res, y_train_res)

    # Predict on training set
    pred_y_3 = clf_3.predict(Xs_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf_3, x_train_res, y_train_res, cv=kfold, scoring=scoring)

    #calculate AUC
    clf_3_roc_auc = roc_auc_score(y_test, pred_y_3)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_3, average='weighted')[2])
    auc_lst.append(clf_3_roc_auc)
    

print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_3 ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf_3.score(Xs_test, y_test)))

print("10-fold cross validation average accuracy of clf_3: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y_3)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_3))

F1 0.8831; AUC 0.8196 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.878
10-fold cross validation average accuracy of clf_3: 0.925
Confusion Matrix for Classfier:
[[1875  194]
 [ 106  291]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.95      0.91      0.93      2069
        True       0.60      0.73      0.66       397

    accuracy                           0.88      2466
   macro avg       0.77      0.82      0.79      2466
weighted avg       0.89      0.88      0.88      2466



## Use Tree-Based Algorithms: Random Forest <a name="paragraph6"></a>

Back to [Table of Contents](#t)
 
- tree models often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.
- In modern applied machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we'll jump right into those:

In [14]:
from sklearn.ensemble import RandomForestClassifier

# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)

# standerdize     
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature) 
x_scaled = pd.DataFrame(x_scaled) 
x_scaled.columns = X.columns # name the columns 

# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=123, shuffle = True)
    
    # this is the formula after you split the dataset
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(Xs_train, y_train)

    # Train model
    clf_5 = RandomForestClassifier(random_state = 123)
    clf_5.fit(x_train_res, y_train_res)

    # Predict on training set
    pred_y_5 = clf_5.predict(Xs_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf_5, x_train_res, y_train_res, cv=kfold, scoring=scoring)

    #calculate f1-score and AUC
    clf_5_roc_auc = roc_auc_score(y_test, pred_y_5)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_5, average='weighted')[2])
    auc_lst.append(clf_5_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_5 ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf_5.score(Xs_test, y_test)))

print("10-fold cross validation average accuracy of clf_4: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y_5)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_5))

F1 0.8816; AUC 0.8138 
Model is predicting  [False  True] class
Accuracy of classifier on test set: 0.877
10-fold cross validation average accuracy of clf_4: 0.939
Confusion Matrix for Classfier:
[[1877  192]
 [ 111  286]]
Classification Report for Classfier:
              precision    recall  f1-score   support

       False       0.94      0.91      0.93      2069
        True       0.60      0.72      0.65       397

    accuracy                           0.88      2466
   macro avg       0.77      0.81      0.79      2466
weighted avg       0.89      0.88      0.88      2466



## Decision tree <a name="paragraph10"></a>

Back to [Table of Contents](#t)

In [15]:
# Imported the needed packages
from sklearn.tree import DecisionTreeClassifier 

# if you want to draw the tree you need these packages 
#from sklearn.tree import export_graphviz
#from sklearn.externals.six import StringIO  
#from IPython.display import Image  
#import pydotplus

# Separate input features (X) and target variable (y)
y = data1.Revenue
X = data1.drop('Revenue', axis=1)

# standerdize     
scaler = MinMaxScaler() # innitialize function
x_scaled = scaler.fit_transform(X) # standerdize data (X - no target feature) 
x_scaled = pd.DataFrame(x_scaled) 
x_scaled.columns = X.columns # name the columns 

# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []

#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):

    # create training and testing vars
    Xs_train, Xs_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2 random_state=123, shuffle = True)
    
    sm = SMOTE(random_state=123, sampling_strategy = 'minority')
    x_train_res, y_train_res = sm.fit_sample(Xs_train, y_train)

    # Train model
    clf_6 = DecisionTreeClassifier(random_state = 123)
    clf_6.fit(x_train_res, y_train_res)

    # Predict on training set
    pred_y_6 = clf_6.predict(Xs_test)

    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle = True)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf_6, x_train_res, y_train_res, cv=kfold, scoring=scoring)

    #calculate f1-score and AUC
    clf_6_roc_auc = roc_auc_score(y_test, pred_y_6)
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y_test, pred_y_6, average='weighted')[2])
    auc_lst.append(clf_6_roc_auc)

print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

# Is our model still predicting just one class?
print('Model is predicting ',np.unique( pred_y_6 ),'class' )

print('Accuracy of classifier on test set: {:.3f}'.format(clf_6.score(Xs_test, y_test)))

print("10-fold cross validation average accuracy of clf_4: %.3f" % (results.mean()))

confusion_matrix_y = confusion_matrix(y_test, pred_y_6)
print('Confusion Matrix for Classfier:')
print(confusion_matrix_y)

print('Classification Report for Classfier:')
print(classification_report(y_test, pred_y_6))

SyntaxError: invalid syntax (<ipython-input-15-82625aeed91e>, line 28)

- results are not as good as random forest 