# Can online customers' intention be predicted beforehand?

### In this project I'm going to try to predict if a customer causes revenue to the company based on the features of the dataset and compare two fundamental ML algorithms. My main focus will be to show Scikit-Learn's general functionality and prediction but not interpretability. I will explain what steps I took and why over the notebook. Let's get started!

In [2]:
# Import necessary libraries and packages

import pandas as pd
import numpy as np
import imblearn
import matplotlib.pyplot as plt

from imblearn.pipeline          import make_pipeline
from imblearn.over_sampling     import SMOTE
from sklearn.pipeline           import Pipeline
from sklearn.model_selection    import train_test_split, RandomizedSearchCV
from sklearn.linear_model       import LogisticRegression
from sklearn.ensemble           import RandomForestClassifier 
from sklearn.impute             import SimpleImputer
from sklearn.preprocessing      import StandardScaler, OneHotEncoder
from sklearn.metrics            import balanced_accuracy_score, roc_auc_score, f1_score, precision_score, recall_score, confusion_matrix
from sklearn.compose            import ColumnTransformer
from sklearn.inspection         import permutation_importance
from sklearn                    import set_config
set_config(display='diagram')

# DATASET

In [3]:
# Read in the data

data = pd.read_csv('online_shoppers_intention.csv')

In [4]:
data.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


## Data information

In [5]:
# Size

data.shape

(12330, 18)

- There are no missing(np.nan) values in any of the columns based on pandas inspection
- Further inspection is done on a seperate notebook to see if missing values encoded differently than np.nan
- After doing EDA, I still have the same conclusion. **There aren't any missing values**

## Check target balance

In [7]:
revenue_positive = data.Revenue.sum() # Total class 1
revenue_negative = (~data.Revenue.values).sum() # Total class 0

revenue_positive, revenue_negative

(1908, 10422)

- Target is **imbalanced** and needed to be treated accordingly
- Approximately 16% of the data is *class 1*, rest is *class 0*

## Train/Test split

In [8]:
# Define features and label

X, y = data.drop('Revenue', axis=1), data.Revenue
y = y.values.ravel() # numpy trick to be able to split to train/test

- Use **stratified splitting** based on target since the dataset is imbalanced
- Important to do **before** applying any transformations

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1) # 0.9/0.1 stratified splitting

# Preprocessing and Feature Engineering

In [10]:
# Define categorical and continuous features to preprocess accordingly

cat_cols = ['SpecialDay', 'Month', 'OperatingSystems', 'Browser',
            'Region', 'TrafficType', 'VisitorType', 'Weekend']

con_cols = ['Administrative', 'Administrative_Duration', 'Informational',
            'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
            'BounceRates', 'ExitRates', 'PageValues']

## Categorical Variables

Even though there aren't any missing values at the moment, it doesn't mean we aren't going to have missing values in the future. Hence, the need for imputer.

- **Imputation**: *Most frequent imputation* for simplicity and fast modeling. Other options *kNNimputer*, *IterativeImputer*, *Learned Complex Model*...
- **Encoding**: *One Hot Encode* as there isn't high cardinality. Okay if not done or *Ordinal Encoding* for tree based algorithms but *OHE* has to be done for linear models.

In [11]:
# Pipeline for categorical variables

cat_pipe = Pipeline([('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent', add_indicator=True)), # Median Impute
                     ('ohe', OneHotEncoder(handle_unknown='ignore'))]) # One Hot Encode

## Continuous Variables

Imputer is needed for same reasoning as categorical variables

- **Imputation**: *Median imputation* for simplicity and fast modeling. Same options exist as categorical.
- **Encoding**: *Variance scaling* as it makes modeling faster. Okay if not done for tree based models but improves linear models because of faster gradient descent.

In [12]:
# Pipeline for continuous variables

con_pipe = Pipeline([('imputer', SimpleImputer(missing_values=np.nan, strategy='median', add_indicator=True)), # Mean Impute
                     ('scaler', StandardScaler())]) # Variance scale

## Column transformer to put everything together

In [13]:
preprocessing = ColumnTransformer([('categorical', cat_pipe, cat_cols), # Preprocess categorical variables
                                   ('continuous', con_pipe, con_cols)]) # Preprocess continuous variables

In [14]:
preprocessing # Show the steps for preprocessing

# Algorithms & Search 

We want to use *imblearn pipeline* for hyperparameter search with cross validation as we have an imbalanced dataset and *sklearn pipeline* doesn't play well with over sampling methods

- **Hyperparameter Search**: *RandomizedSearchCV* for faster search compared to grid search and it's still as accurate
- **Fold Count**: *10 fold* as we have big enough data to be able to do so and it would give a better idea about model's overall performance
- **Over Sampling**: *SMOTE* technique will be used for over sampling as it creates synthetic (not duplicate) samples of the minority class for model to notice
- **CV Scoring Metric**: *balanced accuracy* to chose the best model from the *RandomizedSearchCV* as we have an imbalanced dataset and it's a suitable metric for the business case. Other metrics will be evaluated for the Test set *f1 score*, *ROC*...

## Logistic Regression

In [15]:
# create a pipeline for linear regression

lr_pipe = make_pipeline(                                    
                        preprocessing,                      # preprocessing pipeline we created before
                        imblearn.over_sampling.SMOTE(),     # upsample using SMOTE
                        LogisticRegression()                # algorithm to use
                       )

# hyperparameter search space

hyperparameters = dict(smote__k_neighbors = [5, 10, 15],                           # k-neighbors to look for in SMOTE
                       logisticregression__penalty = ['l1', 'l2'],                 # penalty term
                       logisticregression__class_weight=['balanced', None],        # class weights
                       logisticregression__solver=['lbfgs', 'liblinear', 'saga'],  # solver types for logistic regression
                       logisticregression__C=np.logspace(0, 4, 10),                # Inverse regularization strength
                       logisticregression__max_iter=np.linspace(50, 250, 5))     # Maximum number of iterations taken for the solvers to converge

# Randomized Search Cross Validation

lr_rand_cv = RandomizedSearchCV(estimator = lr_pipe,                  # use the pipe as an estimator
                                param_distributions=hyperparameters,  # hyperparameters to search
                                scoring='balanced_accuracy',          # chose best model based on this metric
                                n_iter = 100,                         # do CV for 100 different models with different combination of hyperparameters
                                cv = 10,                              # 10 fold CV
                                n_jobs=-1,                            # use all cores
                                verbose=True)

In [16]:
lr_rand_cv.fit(X_train, y_train);

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


 0.81251678 0.81229846 0.81253586 0.81324478 0.81002944 0.81303807
 0.81036509 0.81007269 0.81332721 0.81049587 0.81434496 0.8106818
        nan 0.81378944 0.81107911 0.81172188 0.81108747 0.81299483
 0.81112901 0.81278317 0.81094833 0.81030187 0.8117995  0.8105284
 0.81020022 0.81283321 0.81418675 0.81011948 0.80946556 0.81073341
 0.81093407        nan 0.81186876 0.80999705 0.8111081  0.81256343
        nan 0.81275233 0.80922477 0.81246334 0.81145224        nan
 0.80814286        nan        nan 0.812386   0.81166036 0.81317736
 0.81248921 0.8135987  0.8136503  0.8098514  0.81023431 0.81099667
 0.80943813 0.81267996 0.81327872        nan 0.81010536 0.80986443
 0.81014861 0.81118558 0.81246164 0.8142334  0.81278657        nan
 0.81190781 0.81264247 0.81288992        nan 0.81200281 0.809409
        nan 0.81325299        nan 0.81094833 0.81175366 0.81195446
 0.81437594 0.81033936 0.81056853 0.81452593 0.8125343  0.81333203
 0.81148463        nan 0.81295068 0.8123792  0.81150385 0.8112564


In [18]:
# Best model's CV balanced accuracy score

lr_rand_cv.best_score_

0.8145259343255745

## Random Forest

Some advantages over Linear Regression:

- **Less hyperparameters** to search for compared to *Logistic Regression*
- Can take care of modeling without needing to *One Hot Encode* the categorical variables and *scaling* continuous variables
- Tends to perform better with **tabular** and **imbalanced data**

Disadvantage:

- Can **not extrapolate** like *Logistic Regression*
- Slower CV search

In [19]:
# create a pipeline for random forest

rf_pipe = make_pipeline(                                    
                        preprocessing,                      # preprocessing pipeline we created before
                        imblearn.over_sampling.SMOTE(),     # upsample using SMOTE
                        RandomForestClassifier()            # algorithm to use
                       )

# hyperparameter search space

hyperparameters = dict(smote__k_neighbors = [5, 10, 15],                               # k-neighbors to look for in SMOTE
                       randomforestclassifier__criterion=['gini', 'entropy'],          # loss functions for rf classifier
                       randomforestclassifier__max_depth=[5,10,20,30,40],              # max depth of a single tree in the forest
                       randomforestclassifier__min_samples_leaf=[1,2,3,4,5],           # min sample 
                       randomforestclassifier__max_features=['auto', 'sqrt', 'log'])   # max number of features to l                  
                       

# Randomized Search Cross Validation

rf_rand_cv = RandomizedSearchCV(estimator = rf_pipe,                  # use the pipe as an estimator
                                param_distributions=hyperparameters,  # hyperparameters to search
                                scoring='balanced_accuracy',          # chose best model based on this metric
                                n_iter = 100,                         # do CV for 100 different models with different combination of hyperparameters
                                cv = 10,                              # 10 fold CV
                                n_jobs=-1,                            # use all cores
                                verbose=True)

In [20]:
rf_rand_cv.fit(X_train, y_train);

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


 0.80585035 0.82342454 0.82475377 0.82601481 0.82297209 0.8211103
        nan 0.84055945 0.81118838        nan        nan 0.83722759
 0.8426426  0.8135964  0.82443068 0.82506864 0.82150265 0.82503951
 0.83941899 0.83926382        nan 0.83674905 0.82208248 0.82495551
 0.83364977 0.83204751        nan 0.81726027        nan 0.82622505
 0.82572935 0.83842737        nan 0.81898264        nan        nan
        nan 0.80828032 0.83825612 0.83254165        nan        nan
 0.83394962 0.83790254 0.81892697 0.83846501        nan        nan
        nan        nan 0.81750327 0.83893129 0.81640541 0.83842971
 0.82326619        nan 0.82712146 0.83998613 0.83859912 0.82841843
        nan 0.84061587        nan        nan 0.82825639 0.81671192
        nan 0.83617587 0.84121315 0.81430022 0.8390883  0.82367761
        nan 0.84038818 0.81788831        nan 0.804536          nan
        nan 0.8258208         nan        nan        nan        nan
 0.82295118 0.82657415 0.83809607 0.82879129 0.83868774 0.81657

In [22]:
# Best random forest model's CV balanced accuracy score

rf_rand_cv.best_score_

0.8426426038485083

# Results

## Champion Final Model

- According to the CV balanced accuracy score results, better algorithm is **Random Forest**
- Let's get some more insight about our best model by looking at the whole pipeline, hyperparameters, feature importances and metrics on testing set

In [23]:
# Pick the best model from Randomized Search CV

best_model = rf_rand_cv.best_estimator_

### Pipeline steps and non-default Hyperparameters of the Best Model

In [24]:
best_model

In [57]:
# Best hyperparameters

rf_rand_cv.best_params_

{'smote__k_neighbors': 5,
 'randomforestclassifier__min_samples_leaf': 5,
 'randomforestclassifier__max_features': 'auto',
 'randomforestclassifier__max_depth': 5,
 'randomforestclassifier__criterion': 'entropy'}

## Test set feature importance

In [40]:
# Get permutation importance of features

result = permutation_importance(best_model, X_test, y_test, n_repeats=10)

In [1]:
# Display permutation importance
# source: https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py

sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots(figsize=(14,7))
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=X_train.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

NameError: name 'result' is not defined

As we can see above, in the test set, *PageValues* is a **highly important predictor** for our trained model. Meaning that it helps the most, out of the features we have, to differentiate between revenue and not revenue for the **unseen data**. There is a dramatic decrease in importance after that.
- **PageValues**: Represent the average value for a web page that a user visited before completing an e-commerce transaction. Which makes sense for it to be important for revenue prediction because, speaking from experience, if we are really interested in buying a product we tend to go back and forth between websites to decide on where and what to buy

Starting from second ranked feature(*Browser*), the decrese in the importance is gradual

Starting from **ProductRelated_Duration** variable and other less important variables are actually causing our model to perform **worse** in general for the test set, based on negative permutation importance score

- Now that we tested our model on the test set and realized how features affect our model, we can decide on what features to include or not for the next updated model if we have more data. Ex. try **dropping** *TrafficType* feature from the model...

## Best model's test set performance on several evaluation metrics

In [27]:
# Get predictions from the best model

predictions = best_model.predict(X_test)

In [56]:
# Define which metrics to look at
metrics_to_test = [balanced_accuracy_score, recall_score, precision_score, f1_score]

### Why such metrics?

- For imbalanced datasets, **accuracy** score might not be the best indicator for model performance
- Because of that reason, we check **balanced accuracy** score but that itself is not enough to look at
- We also would like to see the **recall** and **precision** performance of our model on such imbalanced dataset
- In addition to such metrics, we might want to look at **f1 score** to see how well our model performs when we combine precision and recall performances
- On top of all, **confusion matrix** acts like an icing on the cake to better see the overall picture in terms of performance

In [55]:
# print metric names and corresponding scores
for metric in metrics_to_test:
    print(f"{metric.__name__} is {metric(y_test, predictions):.3f}", end='\n\n')

balanced_accuracy_score is 0.860

recall_score is 0.832

precision_score is 0.574

f1_score is 0.679



In [52]:
# Look at the confusion matrix

confusion_matrix(y_test, predictions)

array([[924, 118],
       [ 32, 159]])

## Conclusion

- Based on **balanced accuracy** and **recall**, our model **performs relatively well**. Accuracy score suggests that we can correctly classify 86% of the user sessions if they will cause a revenue or not. We have a recall score of 83.2% in that sense

- But if we look at our 57.4% **precision score**, we can see that our model is **over-predicting** a user causing a revenue and performing poorly in that sense. This might be because of our model not learning enough about the class *causing a revenue* as a result of having an **imbalanced dataset**

- 67.9% f1 score suggests that the model performs okay if we care equally about recall and precision based on business decisions.

- **Overall**: Our model's performance depend on **what we expect** from it and the **business goals**. It doesn't perform great in every aspect, but it performs well in terms of accuracy and recall. 

## Why does this matter?

- Customers who aren’t making a purchase can be analyzed in detail and further action can be taken accordingly to increase revenue

## Summary

- 1) Select dataset and research question
- 2) Examine data to understand features, identify missing values, and EDA to detect any anomalies in the data
- 3) Split data before doing any imputation and modeling to prevent leaking
- 4) Select which features are going to be in the model, and create new variables from existing ones if appropriate
- 5) Preprocess categorical and continuos variables accordingly
- 6) Decide on which algorithms and hyperparameters to search for and do CV search to chose the best model based on a certain metric
- 7) Select the best model and gather insights by looking at feature importance and different evaluation metrics

## Next Steps

- Features that caused our training model to perform worse based on the permutation importance can be taken out from the model and retraining can be done to see if there will be an improvement
- Collect more data about the minority class, if it makes sense in terms of cost/return
- Monitor model's performance and make adjustments if necessary