# Prediction of online shoppers’ purchasing intention
*by Georgy Lazarev* (**mlcourse slackname: jorgy**)

As the title goes the task was to predict whether the user is intended to make a purchase on Internet shop. Data for this project can be found [here](https://archive.ics.uci.edu/ml/datasets/Online+Shoppers%E2%80%99+Purchasing+Intention+Dataset). 

## Dataset and features description

We have a binary classification problem which measures user intention to finalize transaction. Originally this dataset was used in [research](https://link.springer.com/article/10.1007/s00521-018-3523-0) where there was an attempt to build a system consisting of two modules. The first one is to determine visitor's likelihood to leave the site. If probability of that is higher that set threshold, than the second module should predict whether or not this person has commercial intention. As authors of this paper state data is real and was collected and provided by retailer. Company might be interested in system which in real time can offer a special offer to client with positive commercial intention.

Data formed in such way that each session would correspond to different user in 1 year period to avoid any tendency. 
Target variable is called 'Revenue' and takes two values - 0 and 1, whether or not session ended with purchase.
There are 10 numeric and 7 categorical features:

 ***Numeric:***

*first six features were derived from the URL information of the pages visited by the user. They were updated each time visitor moved from one page to another till the end of the session.* 

 - **Administrative**  - Number of pages about account management visited by person
    
    
 - **Administrative duration** - Total amount of time (in seconds) spent by the visitor on administrative pages  
 
 
 - **Informational**  - Number of pages in session about Web site, communication and address information of the shopping site
 
 
 - **Informational duration** - time (in seconds) spent on informational pages 
 
 
 - **Product related**  - Number of pages concerning product visited
 
 
 - **Product related duration** - time spent on product related pages
 
 
*next three features were  measured by "Google Analytics" for each page in the online-shop website:*


 - **Bounce rate**  - Average bounce rate value of the pages visited by the visitor. Bounce rate itself is percentage of visitors    who enter the site from that page and then leave
 
 
 - **Exit rate**  - Average exit rate value of the pages visited by the visitor. Value of exit rate for page is percentage of all views of this page that were last in the session
 
 
 - **Page value**  - Average page value of the pages visited. Indicates how valuable a specific page is to shop holder in monetary terms
 
 
 
 
 - **Special day**  - Closeness of the site visiting time to a special day. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.
 

***Categorical:***

 - **OperatingSystems**  - Operating system of the visitor
 
 
 - **Browser**  - Browser of the visitor 
 
 
 - **Region** - Geographic region from which the session has been started by the visitor
 
 
 - **TrafficType** - Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct)
 
 
 - **VisitorType** - whether the visitor is the new or returning (or not specified)
 
 
 - **Weekend**  - Boolean value indicating whether the date of the visit is weekend 
 
 
 - **Month**  - Month value of the visit date 
 
 
 Dataset was formed such way that each session correpsonds to unique person. That was done to prevent any possible trends

## Exploratory data analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
#load data
df=pd.read_csv('online_shoppers_intention (1).csv')

In [None]:
df.shape

Let's look at dataset:

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

There is no missing data in dataset. 

Now let's look at distribution of target value:

In [None]:
sns.countplot(df.Revenue)

In [None]:
df.Revenue.value_counts(normalize=True)

Seems that we deal with somewhat imbalanced classes. There are more visitors that leave shop website without purchasing anything and that's not surprising.

Target value will be converted to binary type

In [None]:
#list of numeric features
num_feats=['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay']

In [None]:
df[num_feats].describe()

We certainly will scale numerical features. As we see they are of different scales

Now let's look at categorical features:

In [None]:
cat_feats=['Month','OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType','Weekend']

In [None]:
df[cat_feats].head()

As we see, some features are already label-encoded. Some are stll in string format. *Weekend* will be converted to binary.

In [None]:
df[cat_feats].astype('category').describe()

There are two interesting observations: number of months present and number of visitor types..

In [None]:
df.Month.unique()

January and April are missing.

In [None]:
df.VisitorType.unique()

'Other'? Let's see how many such values in our dataset:

In [None]:
df.VisitorType.value_counts()

That makes no sense though. We'll get back to that later.

In [None]:
df.groupby('VisitorType')['Revenue'].mean()

A bit surprising. I expected percentage of potentially beneficial clients would be higher among visitors who returned to website other than new ones. 

In [None]:
sum(df.loc[df.Revenue==1].Administrative==0)

In [None]:
sum(df.loc[df.Revenue==1].Informational==0)

In [None]:
sum(df.loc[df.Revenue==1].ProductRelated==0)

That makes sense. Only six people made purchase and at the same time din't visit any pages related to products.

In [None]:
(df.Administrative==0).sum()

In [None]:
(df.Administrative_Duration==0).sum()

So, there were cases when number of pages was greater than 0 but time spent was 0.

In [None]:
df.loc[df.Administrative>0].loc[df.Administrative_Duration==0].Administrative.value_counts()

So theoretically it is possible.

*Special day* feature shows closeness to ..special days, right. We might think that this feature will positively affect target value

In [None]:
df.loc[df.SpecialDay>0].Revenue.value_counts(normalize=True)

How come? That's again not what I expected. 

In [None]:
df[['Revenue','SpecialDay']].corr()

That's actualy strange..

## Primary visual data analysis

Here goes pairwise Pearson-correlation of numerical features:

In [None]:
corrl=num_feats.copy()
corrl.append('Revenue')

In [None]:
sns.heatmap(df[corrl].corr())

Yes, some features indeed are highly correlated!

In [None]:
df[['ProductRelated', 'ProductRelated_Duration','BounceRates', 'ExitRates']].corr()

In [None]:
fig, axes = plt.subplots(ncols=4, nrows = 2, figsize=(24, 18))
for i in range(len(cat_feats)):
    sns.countplot(df[cat_feats[i]],ax=axes[i//4, i%4])

Well, I'd say it's difficult to draw any concrete conclusions from this plot. There are leaders in each groups .
Now let's explore some features a bit more with respect to target value:

In [None]:
sns.countplot(df.Weekend,hue=df.Revenue)

plt.figure(figsize=(15,15))
plt.subplot(321)
df.groupby('Month').Revenue.mean().plot.bar()
plt.subplot(322)
df.groupby('Browser').Revenue.mean().plot.bar()
plt.subplot(323)
df.groupby('TrafficType').Revenue.mean().plot.bar()
plt.subplot(324)
df.groupby('OperatingSystems').Revenue.mean().plot.bar()

Percentage of visitors who made purchases in November seems a bit higher in comparison to other months. In February there was a small number of visitors and too few of them ended up buying something. Maybe it was bad advertising and price policy that was a reason

As for other features distribution of session results is consistent, as it seems. It's difficult to interpret those result in a sense that feature values are encoded by LavelEncoding already so we don't really know which real meanings stand behind them. Yep.

In [None]:
tmp=['Revenue','Administrative_Duration','Informational_Duration','ProductRelated_Duration','BounceRates','ExitRates','PageValues']

In [None]:
r=['Revenue','Administrative','Administrative_Duration','Informational','Informational_Duration','ProductRelated','ProductRelated_Duration','BounceRates','ExitRates','PageValues']

In [None]:
sns.pairplot(df[r],hue='Revenue',diag_kind='hist')

In general, trends here make sense. Lower Bounce and Exit Rates corresponds to more frequent transactions made. On the other hand higher PageValues not always lead to commercial benefit. Also in most distributions and pairplots related to website pages we see cases where visitor spent too much time on website but still quit it without purchase. That happens in real life too. Thus, as for outliers, I guess I can assume there is no such.

In [None]:
plt.figure(figsize=(10,20))   
for i,v in enumerate(range(len(num_feats))):
    v = v+1
    ax1 = plt.subplot(len(num_feats),1,v)
    ax1=sns.distplot(df[num_feats[i]])

Right-skewed. All of them.

## Insights and found dependencies

1. There is no sessions recorded for January and April. By numbers it seems that in November&October bigger percentage of sessions ended up with purchases. 
2. At the same time SpecialDay feature is shows negative effect on target value. We can explain it but assuming that most visitor prefer to shop in advance. 
3. There are two pairs of highly correlated features. It's worth checking later if deleting them will improve our models.
4. Almost 25% percent of new visitors made transactions in contrast to ~14% of returning ones. 
5. Also we have 85 instances which have VisitorType as 'Other'. As there are no sensible options except New and Returning, this fact does mean that information wasn't correctly derived. As this is only 0.6894% of the whole data , let's take a deep breath and drop these instances away.
6. I got an impression that all features are right-skewed. It can be useful later to do a log transformation.

## Metric choice

As we are dealing with imbalanced class accuracy is not the best option. Due to task specificty company doesn't want to miss potential buyers. So the cost of showing the cliend special offer is lower than loss of left visitors aimed to make an purchase.
Moreover, it is a good idea to not depend on threshold for making decision about class. Probabilities for class can be considered as intention scores and so special offers can be adjusted to degree of visitor intention. So ROC AUC seems pretty nice for our task.

## Model Choice

Following models were selected:
- Logistic Regression - classic and interpretable. We'll do OHE for categorical features and scale numeric.
- Random Forest - tree based model in contrast to LR, worth trying (we have categorical features as well as numerical). No need for OHE and scaling. 
- XGBoost Classifier - because why not? 

## Data preprocessing 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

First, we'll convert to boolean features to binary type

In [None]:
df[['Weekend','Revenue']]=df[['Weekend','Revenue']].apply(lambda x:x.astype(int))

Then instances with *VisitorType* as 'Returning Visitor" will be droped away:

In [None]:
df=df.drop(df.loc[df.VisitorType=='Other'].index)

In [None]:
df.shape

In [None]:
vt=df.VisitorType.map({'New_Visitor':0,'Returning_Visitor':1})

There will be different prepocessing as we work with models based on different approaches.

For logistic regression it's a good idea to scale our numeric features and do One_Hot Encoding on categorical ones. To avoid data leakage scaling will be done after splitting data. As for OHE and LabelEncoding (for tree based models), I suppose we can do it before splitting as we know range of all possible values of categorical features, so there is no data leakage to prevent.

In [None]:
dummies=pd.concat([pd.get_dummies(df.Month,drop_first=True),
                   pd.get_dummies(df.Browser,drop_first=True,prefix='Browser'),
                   pd.get_dummies(df.Region,drop_first=True,prefix='Region'),
                   pd.get_dummies(df.OperatingSystems,drop_first=True,prefix='OS'),
                   pd.get_dummies(df.TrafficType,drop_first=True,prefix='TT')],axis=1)

In [None]:
dummies.shape

In [None]:
target=df.Revenue

*feats_logreg* will contain all features for Logistic Regression. *feats_tb* is for tree-based models 

In [None]:
feats_logreg=pd.concat([df[num_feats],dummies,df['Weekend'],vt],axis=1)

In [None]:
feats_logreg.shape

Now we'll split our data.  *stratify* used due to imbalance in classes.

In [None]:
X_train_logreg_,X_test_logreg_,y_train_logreg,y_test_logreg=train_test_split(feats_logreg,
                                        target,test_size=0.3,random_state=17,stratify=target)

Let's check distribution of classes in train and test sets:

In [None]:
plt.subplot(121)
y_train_logreg.value_counts(normalize=True).plot.bar()
plt.subplot(122)
y_test_logreg.value_counts(normalize=True).plot.bar()

Yep, that seems right.

Now test set will be split into two same-sized sets: one for validation and other for final test. We won't test our models on second one until the end.

In [None]:
X_valid_logreg_,X_test_logreg_,y_valid_logreg,y_test_logreg=train_test_split(X_test_logreg_,
                                                            y_test_logreg,test_size=0.5,random_state=17)

In [None]:
scaler=StandardScaler()

In [None]:
X_train_logreg=X_train_logreg_.copy(deep=True)
X_valid_logreg=X_valid_logreg_.copy(deep=True)
X_test_logreg=X_test_logreg_.copy(deep=True)


X_train_logreg[num_feats]=scaler.fit_transform(X_train_logreg[num_feats])
X_valid_logreg[num_feats]=scaler.transform(X_valid_logreg[num_feats])
X_test_logreg[num_feats]=scaler.transform(X_valid_logreg[num_feats])

In [None]:
X_train_logreg.shape

For tree based models our preprocessing will include only LabelEncoding of *month*. Other Categorical features except boolean one are already label-encoded. Splitting into ***3*** sets is the same.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le=LabelEncoder()

In [None]:
feats_tb=pd.concat([df[num_feats],df[['Weekend','TrafficType','OperatingSystems','Browser','Region']],vt],axis=1)

In [None]:
feats_tb['month_enc']=le.fit_transform(df.Month)

In [None]:
feats_tb.shape

Let's split our data. *stratify* used due to imbalance in classes.

In [None]:
X_train_tb,X_test_tb,y_train_tb,y_test_tb=train_test_split(feats_tb,
                                        target,test_size=0.3,random_state=17,stratify=target)

In [None]:
plt.subplot(121)
y_train_tb.value_counts(normalize=True).plot.bar()
plt.subplot(122)
y_valid_tb.value_counts(normalize=True).plot.bar()

In [None]:
X_valid_tb,X_test_tb,y_valid_tb,y_test_tb=train_test_split(X_test_tb,
                                        y_test_tb,test_size=0.5,random_state=17)

## Cross-validation and adjustment of model hyperparameters

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

We'll use Statified cross validation again due to imbalanced classes. 

In [None]:
skf=StratifiedKFold(n_splits=5,random_state=17)

### Logistic Regression

Let's at first train a basic LogReg without tuning hyperparametes, creating new features to establish sort of baseline:

In [None]:
pr_lr=LogisticRegression(class_weight='balanced')

In [None]:
pr_lr.fit(X_train_logreg_,y_train_logreg)

In [None]:
print('Mean ROC-AUC on cross-validation:', np.mean(cross_val_score(pr_lr,X_train_logreg_,y_train_logreg,scoring='roc_auc',cv=skf)))

In [None]:
print('ROC AUC on valid set :', roc_auc_score(y_valid_logreg,pr_lr.predict_proba(X_valid_logreg_)[:,1]))

#### Tuning hyperparameters

In [None]:
param_grid={ 'C':np.logspace(-2,1,7), 'class_weight':[None, 'balanced']}

In [None]:
gs = GridSearchCV(pr_lr, param_grid, scoring='roc_auc', n_jobs=-1, cv=skf)

In [None]:
gs.fit(X_train_logreg_,y_train_logreg)

In [None]:
display(gs.best_params_)
display(gs.best_score_)

Now I'd select more narrow range for *C*:

In [None]:
gs = GridSearchCV(pr_lr, {'C':np.linspace(0.05,0.2,10),'class_weight':['balanced']}, scoring='roc_auc', n_jobs=-1, cv=skf)

In [None]:
gs.fit(X_train_logreg_,y_train_logreg)

In [None]:
print('Best found parameters for LogReg:',gs.best_params_)
print('Best score found for LogReg with GridSearch:',gs.best_score_)

In [None]:
print('ROC AUC on valid set :', roc_auc_score(y_valid_logreg,gs.predict_proba(X_valid_logreg_)[:,1]))

Increased. ~0.002

***Oversampling minority class***

This is known technique to handle imbalanced class and implemented in ***imbalanced-learn*** [package](https://imbalanced-learn.readthedocs.io/en/stable/). We'll just create new synthetic data instance corresponding to '1' class.

In [None]:
from imblearn.over_sampling import SMOTE

I'm going to check whether oversampling improves LR perfomance

In [None]:
%%time
best_params=[]
best_scores=[]
rocs=[]
for d in np.linspace(0.4,1,5):
    sm=SMOTE(sampling_strategy=d,random_state=17)
    X_train_logreg_res, y_train_logreg_res = sm.fit_sample(X_train_logreg, y_train_logreg)
    lr=LogisticRegression()
    lr.fit(X_train_logreg_res,y_train_logreg_res)
    gs=GridSearchCV(lr, {'C':np.linspace(0.05,1,11)}, scoring='roc_auc', n_jobs=-1, cv=skf)
    gs.fit(X_train_logreg_res,y_train_logreg_res)
    best_params.append(gs.best_params_)
    best_scores.append(gs.best_score_)
    rocs.append(roc_auc_score(y_valid_logreg,gs.predict_proba(X_valid_logreg)[:,1]))

In [None]:
max(best_scores)

In [None]:
max(rocs)

LogReg doesn't perfoms better after oversamling so we won't use it.

### Random Forest

For this and for XGBoost we use data with postfix *tb* (tree-based). Data is not scaled, categorical features are Label-encoded.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
rfc=RandomForestClassifier()

In [None]:
rfc.fit(X_train_tb,y_train_tb)

In [None]:
roc_auc_score(y_valid_tb,rfc.predict_proba(X_valid_tb)[:,1])

Time for gridsearch:

In [None]:
param_grid = {
    "n_estimators": [500],
    "max_depth": [4,5,10,15],
    "min_samples_split": [2,3],
    "min_samples_leaf": [2], #,1,3],
    'max_features': [1,'auto','log2'], 
    'criterion': ['gini'] }

In [None]:
gs = GridSearchCV(rfc, param_grid, scoring='roc_auc', n_jobs=-1, cv=skf, verbose=1)

In [None]:
%%time
gs.fit(X_train_tb, y_train_tb)
print('Best parameters for Random Forest: ', gs.best_params_)
print('Best score: ', gs.best_score_)

In [None]:
roc_auc_score(y_valid_tb,gs.predict_proba(X_valid_tb)[:,1])

Seems much better. Will XGBoost beat this?

We'll save best RandomForest version for future reference

In [None]:
rfc=gs.best_estimator_

In [None]:
rfc.fit(X_train_tb,y_train_tb)

### XGBoost Classifier

In [None]:
from xgboost import XGBClassifier

In [None]:
xgbclf = XGBClassifier(random_state=17, n_jobs=-1)

In [None]:
xgbclf.fit(X_train_tb,y_train_tb)

In [None]:
roc_auc_score(y_valid_tb,xgbclf.predict_proba(X_valid_tb)[:,1])

In [None]:
param_grid = {
    'max_depth': [2,3,4,5], 
    'n_estimators': [50,100,150,300], 
    'learning_rate':[0.01,0.05,0.1], 
    'reg_alpha': [0, 0.1, 0.2],
    'gamma': [0,1]
}

In [None]:
gs = GridSearchCV(xgbclf, param_grid, scoring='roc_auc', n_jobs=-1, cv=skf, verbose=1)

In [None]:
%%time
gs.fit(X_train_tb, y_train_tb)
print('Best parameters for XGBBoost Classifier: ', gs.best_params_)
print('Best scorefor XGBBoost Classifier: ', gs.best_score_)

In [None]:
roc_auc_score(y_valid_tb,gs.predict_proba(X_valid_tb)[:,1])

Bit worse.

Ok, basic XGBoost without tuning shows better results among 3 models.  But there is room for improvements in case for Logistic Regression so we'll get back to it once more.

### Logistic Regression 2.0

As you might remember, all our numeric features are right skewed, so let's see if log transformation will improve our model perfomance.  *_lt* stands for log-transformation

In [None]:
plt.figure(figsize=(10,20))   
for i,v in enumerate(range(len(num_feats))):
    v = v+1
    ax1 = plt.subplot(len(num_feats),1,v)
    ax1=sns.distplot(np.log1p(df[num_feats[i]]))

That's better.

In [None]:
#we'll do transformation over a copy of dataset. '_' stands for version before scaling
X_train_logreg_lt=X_train_logreg_.copy(deep=True)
X_valid_logreg_lt=X_valid_logreg_.copy(deep=True)
X_test_logreg_lt=X_test_logreg_.copy(deep=True)

In [None]:
X_train_logreg_lt[num_feats]=np.log1p(X_train_logreg_lt[num_feats])
X_valid_logreg_lt[num_feats]=np.log1p(X_valid_logreg_lt[num_feats])
X_test_logreg_lt[num_feats]=np.log1p(X_test_logreg_lt[num_feats])

X_train_logreg_lt[num_feats]=scaler.fit_transform(X_train_logreg_lt[num_feats])
X_valid_logreg_lt[num_feats]=scaler.transform(X_valid_logreg_lt[num_feats])
X_test_logreg_lt[num_feats]=scaler.transform(X_test_logreg_lt[num_feats])

In [None]:
lr=LogisticRegression(class_weight='balanced',C=0.05)

In [None]:
lr.fit(X_train_logreg_lt,y_train_logreg)

In [None]:
print('Mean ROC AUC on dataset after Log tansformation of numeric features',
      np.mean(cross_val_score(lr,X_train_logreg_lt,y_train_logreg,scoring='roc_auc',cv=skf)))

In [None]:
print('ROC AUC on valid set',roc_auc_score(y_valid_logreg,lr.predict_proba(X_valid_logreg_lt)[:,1]))

That's a quite  an improvement (~1.3%!!! :-))  in comparison with very first basic LogReg before gridsearch (~0.895). We'll keep transformed dataset for further exploration.

***Feature selection***

There were two pairs of highly correlated numerical features = *ProductRelated - ProductRelated_Duration* and *BounceRates - ExitRates*. Maybe deleting them will improve model.

In [None]:
#we'll make again a transformation on copy
X_train_logreg_copy=X_train_logreg_lt.copy(deep=True)
X_valid_logreg_copy=X_valid_logreg_lt.copy(deep=True)
X_test_logreg_copy=X_test_logreg_lt.copy(deep=True)

X_train_logreg_copy.drop(['ProductRelated','BounceRates'],axis=1,inplace=True)
X_valid_logreg_copy.drop(['ProductRelated','BounceRates'],axis=1,inplace=True)
X_test_logreg_copy.drop(['ProductRelated','BounceRates'],axis=1,inplace=True)

In [None]:
lr=LogisticRegression(class_weight='balanced',C=0.05)

In [None]:
lr.fit(X_train_logreg_copy,y_train_logreg)

In [None]:
print('Mean ROC AUC on dataset after deleting "ProductRelated" and "BounceRates"',np.mean(cross_val_score(lr,X_train_logreg_copy,y_train_logreg,scoring='roc_auc',cv=skf)))

In [None]:
print('ROC AUC on valid set',roc_auc_score(y_valid_logreg,lr.predict_proba(X_valid_logreg_copy)[:,1]))

Results in cross validation hasn't changed, but slighly decreased in hold-out validation. It's not quite clear what should be done so we'll keep the old version.
Let's refer to our RandomForest and XGBoost models and see which features were the least important.

In [None]:
feat_names=['Administrative','Administrative_Duration','Informational','Informational_Duration','ProductRelated','ProductRelated_Duration',
 'BounceRates','ExitRates','PageValues','SpecialDay','Weekend','TrafficType','OperatingSystems','Browser','Region','VisitorType',
           'month']

In [None]:
rfc_feat_imp=dict(zip(feat_names, rfc.feature_importances_))
xgb_feat_imp=dict(zip(feat_names, xgbclf.feature_importances_))

In [None]:
plt.figure(figsize=(20,15))
plt.subplot(211)
plt.bar(range(len(feat_names)),list(rfc_feat_imp.values()),tick_label=list(rfc_feat_imp.keys()))
plt.xticks(rotation=90)
plt.subplot(212)
plt.bar(range(len(feat_names)),list(xgb_feat_imp.values()),tick_label=list(xgb_feat_imp.keys()))
plt.xticks(rotation=90)

We'll drop *Informational_Duration*, *SpecialDay*, *Weekend*, *Browser* and *OperatingSystems* (well, dummy columns for last two)

In [None]:
X_train_logreg_copy=X_train_logreg_lt.copy(deep=True)
X_valid_logreg_copy=X_valid_logreg_lt.copy(deep=True)
X_test_logreg_copy=X_test_logreg_lt.copy(deep=True)

In [None]:
X_train_logreg_copy.drop(['Informational_Duration','Weekend','Browser_2',
       'Browser_3', 'Browser_4', 'Browser_5', 'Browser_6', 'Browser_7',
       'Browser_8', 'Browser_9', 'Browser_10', 'Browser_11', 'Browser_12',
       'Browser_13','OS_2', 'OS_3', 'OS_4',
       'OS_5', 'OS_6', 'OS_7', 'OS_8','Region_2', 'Region_3', 'Region_4', 'Region_5',
       'Region_6', 'Region_7', 'Region_8', 'Region_9',],axis=1,inplace=True)

X_valid_logreg_copy.drop(['Informational_Duration','Weekend','Browser_2',
       'Browser_3', 'Browser_4', 'Browser_5', 'Browser_6', 'Browser_7',
       'Browser_8', 'Browser_9', 'Browser_10', 'Browser_11', 'Browser_12',
       'Browser_13','OS_2', 'OS_3', 'OS_4',
       'OS_5', 'OS_6', 'OS_7', 'OS_8','Region_2', 'Region_3', 'Region_4', 'Region_5',
       'Region_6', 'Region_7', 'Region_8', 'Region_9',],axis=1,inplace=True)
X_test_logreg_copy.drop(['Informational_Duration','Weekend','Browser_2',
       'Browser_3', 'Browser_4', 'Browser_5', 'Browser_6', 'Browser_7',
       'Browser_8', 'Browser_9', 'Browser_10', 'Browser_11', 'Browser_12',
       'Browser_13','OS_2', 'OS_3', 'OS_4',
       'OS_5', 'OS_6', 'OS_7', 'OS_8','Region_2', 'Region_3', 'Region_4', 'Region_5',
       'Region_6', 'Region_7', 'Region_8', 'Region_9'],axis=1,inplace=True)

In [None]:
lr=LogisticRegression(class_weight='balanced',C=0.1)

In [None]:
lr.fit(X_train_logreg_copy,y_train_logreg)

In [None]:
np.mean(cross_val_score(lr,X_train_logreg_copy,y_train_logreg,scoring='roc_auc',cv=skf))

In [None]:
roc_auc_score(y_valid_logreg,lr.predict_proba(X_valid_logreg_copy)[:,1])

Getting rid of five least important (from RandomForestClassifier perspective) gave a little improvement in our logistic regreesion perfomance. But it's still not close to Forest or XGboost Classifier.

In [None]:
gs = GridSearchCV(lr, {'C':np.logspace(-2,1,10)}, scoring='roc_auc', n_jobs=-1, cv=skf)

In [None]:
gs.fit(X_train_logreg_copy,y_train_logreg)

In [None]:
gs.best_score_

In [None]:
print('ROC AUC on valid set :', roc_auc_score(y_valid_logreg,gs.predict_proba(X_valid_logreg_copy)[:,1]))

In [None]:
lr=gs.best_estimator_

Well, ok. We can make conclusion that this now is the best version of Logistic Regression. 
Log-transformation and feature selection based on tree-based models were right solutions, while deleting two the most correlated features - not. 

XGBoost is the best model by now, Logistic Regression is the worst yet. We'll keep those all for experiments with engineering new features.

## Creation of new features and description of this process

### Logistic Regression

So I'd try to make several interaction features by myself and see what will happen. Intuitively I suppose that feature showing amount of time visitor spends on ProductRelated pages with the fact that visitor already was there might be useful for determining this visitor's intention. Interaction between VisitorType and PageValues might be important too (PageValues itself is quite important as we could see on previous plots)

In [None]:
#_wn = with new features
X_train_logreg_wn=X_train_logreg_copy.copy(deep=True)
X_valid_logreg_wn=X_valid_logreg_copy.copy(deep=True)
X_test_logreg_wn=X_test_logreg_copy.copy(deep=True)

In [None]:
X_train_logreg_wn['interfeat1']=X_train_logreg_wn.VisitorType*X_train_logreg_wn.ProductRelated
X_valid_logreg_wn['interfeat1']=X_valid_logreg_wn.VisitorType*X_valid_logreg_wn.ProductRelated
X_test_logreg_wn['interfeat1']=X_test_logreg_wn.VisitorType*X_test_logreg_wn.ProductRelated

X_train_logreg_wn['condfeat1']=X_train_logreg_wn.VisitorType*X_train_logreg_wn.PageValues
X_valid_logreg_wn['condfeat1']=X_valid_logreg_wn.VisitorType*X_valid_logreg_wn.PageValues
X_test_logreg_wn['condfeat1']=X_test_logreg_wn.VisitorType*X_test_logreg_wn.PageValues

In [None]:
lr.fit(X_train_logreg_wn,y_train_logreg)

In [None]:
tmp_train=X_train_logreg_wn.copy(deep=True)
tmp_valid=X_valid_logreg_wn.copy(deep=True)
tmp_test=X_test_logreg_wn.copy(deep=True)

In [None]:
print('ROC AUC on valid set :', roc_auc_score(y_valid_logreg,lr.predict_proba(X_valid_logreg_wn)[:,1]))

Again too small but improvement

Now let's try to generate interaction features using [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html). Then we'll select the most important ones.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
polfeat = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

In [None]:
polfeats_train=pd.DataFrame(polfeat.fit_transform(X_train_logreg_wn))
polfeats_valid=pd.DataFrame(polfeat.fit_transform(X_valid_logreg_wn))
polfeats_test=pd.DataFrame(polfeat.transform(X_test_logreg_wn))

In [None]:
polfeats_train.shape

I'll fit XGboost Classifier on those 820 features to see which features were the most important from its prespective.

In [None]:
xg=XGBClassifier()
xg.fit(polfeats_train,y_train_logreg)

In [None]:
plt.figure(figsize=(25,15))
plt.bar(range(820),list(xg.feature_importances_),tick_label=polfeat.get_feature_names())
plt.xticks(rotation=90)

Yep, the feature names are not readable. I'll print ***10*** most important features.

In [None]:
xg_imp_2=dict(list(zip(polfeat.get_feature_names(),xg.feature_importances_)))

In [None]:
sorted(xg_imp_2.items(), key=lambda x: x[1], reverse=True)[:10]

In [None]:
dict(zip(polfeat.get_feature_names()[:38],X_train_logreg_wn.columns))

In [None]:
X_train_logreg_wn['x7 x13']=X_train_logreg_wn.PageValues*X_train_logreg_wn.Mar
X_valid_logreg_wn['x7 x13']=X_valid_logreg_wn.PageValues*X_valid_logreg_wn.Mar
X_test_logreg_wn['x7x13']=X_test_logreg_wn.PageValues*X_test_logreg_wn.Mar


X_train_logreg_wn['x3 x6']=X_train_logreg_wn.ProductRelated*X_train_logreg_wn.ExitRates
X_valid_logreg_wn['x3 x6']=X_valid_logreg_wn.ProductRelated*X_valid_logreg_wn.ExitRates
X_test_logreg_wn['x3 x6']=X_test_logreg_wn.ProductRelated*X_test_logreg_wn.ExitRates


X_train_logreg_wn['x7 x14']=X_train_logreg_wn.PageValues*X_train_logreg_wn.May
X_valid_logreg_wn['x7 x14']=X_valid_logreg_wn.PageValues*X_valid_logreg_wn.May
X_test_logreg_wn['x7 x14']=X_test_logreg_wn.PageValues*X_test_logreg_wn.May


X_train_logreg_wn['x6 x7']=X_train_logreg_wn.PageValues*X_train_logreg_wn.ExitRates
X_valid_logreg_wn['x6 x7']=X_valid_logreg_wn.PageValues*X_valid_logreg_wn.ExitRates
X_test_logreg_wn['x6 x7']=X_test_logreg_wn.PageValues*X_test_logreg_wn.ExitRates


X_train_logreg_wn['x0 x7']=X_train_logreg_wn.Administrative*X_train_logreg_wn.ExitRates
X_valid_logreg_wn['x0 x7']=X_valid_logreg_wn.Administrative*X_valid_logreg_wn.ExitRates
X_test_logreg_wn['x0 x7']=X_test_logreg_wn.Administrative*X_test_logreg_wn.ExitRates


X_train_logreg_wn['x4 x7']=X_train_logreg_copy.ProductRelated_Duration*X_train_logreg_wn.PageValues
X_valid_logreg_wn['x4 x7']=X_valid_logreg_copy.ProductRelated_Duration*X_valid_logreg_wn.PageValues
X_test_logreg_wn['x4 x7']=X_test_logreg_wn.ProductRelated_Duration*X_test_logreg_wn.PageValues


X_train_logreg_wn['x4 x15']=X_train_logreg_copy.ProductRelated*X_train_logreg_wn.Nov
X_valid_logreg_wn['x4 x15']=X_valid_logreg_copy.ProductRelated*X_valid_logreg_wn.Nov
X_test_logreg_wn['x4 x15']=X_test_logreg_wn.ProductRelated*X_test_logreg_wn.Nov


The feature added *below* was made up intuitively.

In [None]:
X_train_logreg_wn['condfeat2']=(X_train_logreg_wn.ProductRelated_Duration>1).astype(int)
X_valid_logreg_wn['condfeat2']=(X_valid_logreg_wn.ProductRelated_Duration>1).astype(int)
X_test_logreg_wn['condfeat2']=(X_test_logreg_wn.ProductRelated_Duration>1).astype(int)

We can also see sort of importance by referring to ***lr*** attribute *coef_*

In [None]:
X_valid_logreg_copy.shape

In [None]:
plt.figure(figsize=(25,15))
plt.bar(range(40),list(lr.coef_[0]),tick_label=list(tmp_train.columns))
plt.xticks(rotation=90)

Deleting some columns:

In [None]:
X_train_logreg_wn.drop(['TT_14','TT_17','TT_7','TT_12','TT_18','TT_19','TT_9','TT_15','TT_6','TT_4','June','Oct'],axis=1,inplace=True)
X_valid_logreg_wn.drop(['TT_14','TT_17','TT_7','TT_12','TT_18','TT_19','TT_9','TT_15','TT_6','TT_4','June','Oct'],axis=1,inplace=True)
X_test_logreg_wn.drop(['TT_14','TT_17','TT_7','TT_12','TT_18','TT_19','TT_9','TT_15','TT_6','TT_4','June','Oct'],axis=1,inplace=True)

In [None]:
X_train_logreg_wn.drop(['ExitRates','Informational','Administrative_Duration','ProductRelated_Duration'],axis=1,inplace=True)
X_valid_logreg_wn.drop(['ExitRates','Informational','Administrative_Duration','ProductRelated_Duration'],axis=1,inplace=True)
X_test_logreg_wn.drop(['ExitRates','Informational','Administrative_Duration','ProductRelated_Duration'],axis=1,inplace=True)

In [None]:
lr=LogisticRegression(C=0.1,class_weight='balanced')

In [None]:
lr.fit(X_train_logreg_wn,y_train_logreg)

In [None]:
print('ROC AUC on valid-out set :', roc_auc_score(y_valid_logreg,lr.predict_proba(X_valid_logreg_wn)[:,1]))

Our feature engineering improved model approximately by ~0.8%

# *Part without name*

Actually RandomForest and XGBoost Classifier showed better ROC AUC score but fitting and tuning Logistic Regression is much faster. Now suppose we have an optimal threshold established by retail company *=0.5*. Out of curiosity I decided to check recall score:

In [None]:
from sklearn.metrics import recall_score

In [None]:
print('Random Forest Classifier recognised %f %% of all visitor who have purchasing intention higher that 0.5' 
      % (100*recall_score(y_test_tb,rfc.predict(X_test_tb))))

In [None]:
print('XGBoost Classifier recognised %f %% of all visitor who have purchasing intention higher that 0.5'
      % (100*recall_score(y_test_tb,xgbclf.predict(X_test_tb))))

In [None]:
print('Logistic Regression recognised %f %% of all visitor who have purchasing intention higher that 0.5'
      % (100*recall_score(y_test_logreg,lr.predict(X_test_logreg_wn))))

## Plotting training and validation curves

In [None]:
from sklearn.model_selection import learning_curve,validation_curve

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("ROC AUC")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring='roc_auc')
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [None]:
plt.figure(figsize=(10, 7))
plot_learning_curve(lr, 'Logistic Regression', X_train_logreg_wn, y_train_logreg, cv=skf, n_jobs=-1);

We observe a good thing - training cross-validation curves have tend to converge. No underfitting.

In [None]:
plt.figure(figsize=(10,7))
param_range=np.array([0.01, 0.05, 0.1, 0.25, 0.5, 1, 5])
train_scores, test_scores = validation_curve(lr, X_train_logreg_wn, y_train_logreg, param_name="C",
                                             param_range=param_range, cv=skf, scoring="roc_auc", n_jobs=-1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve")
plt.xlabel("C")
plt.ylabel("ROC AUC")
#plt.ylim(0.0, 1.1)
lw = 2
plt.plot(param_range, train_scores_mean, label="Training score",
             color="darkorange", lw=lw)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.2,
                 color="darkorange", lw=lw)
plt.plot(param_range, test_scores_mean, label="Cross-validation score",
             color="navy", lw=lw)
plt.fill_between(param_range, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.2,
                 color="navy", lw=lw)
plt.legend(loc="best")
plt.show()

So our scores pretty much consistent along the *C* range. Though ROC AUC drastically rised with *C* increasing from *0* to *~0.4*

## Prediction for test or hold-out samples

Now it's time to make predictions on ***test*** set. This one which we created at the beginning and transformed each with *train* and *valid* but haven't used that.

And before that, we can fit our model on ***train***+***valid***

**RandomForest**

In [None]:
X_train_tb_fin=pd.concat([X_train_tb,X_valid_tb],axis=0)
y_train_tb_fin=pd.concat([y_train_tb,y_valid_tb],axis=0)

In [None]:
%%time
rfc.fit(X_train_tb_fin,y_train_tb_fin)

In [None]:
print('ROC AUC on test set :', roc_auc_score(y_test_tb,rfc.predict_proba(X_test_tb)[:,1]))

**XGBoost Clasiffier**

In [None]:
%%time
xgbclf.fit(X_train_tb_fin,y_train_tb_fin)

In [None]:
print('ROC AUC on test set :', roc_auc_score(y_test_tb,xgbclf.predict_proba(X_test_tb)[:,1]))

**Logistic Regression**

In [None]:
X_train_logreg_wn.shape

In [None]:
X_valid_logreg_wn.shape

In [None]:
X_train_logreg_fin=pd.concat([X_train_logreg_wn,X_valid_logreg_wn],axis=0)
y_train_logreg_fin=pd.concat([y_train_logreg,y_valid_logreg],axis=0)

In [None]:
X_train_logreg_fin.shape

In [None]:
y_train_logreg_fin.shape

In [None]:
y_test_logreg.shape

In [None]:
lr.fit(X_train_logreg_fin,y_train_logreg_fin)

In [None]:
print('ROC AUC on test set :', roc_auc_score(y_test_logreg,lr.predict_proba(X_test_logreg_wn)[:,1]))

Results are even higher than on ***valid*** set. 

## Conclusions 

So this is the end. We used three model on imbalanced data and each of them showed quite a high ROC AUC score. We could observe that score on valid set changed in accordance to *cross_val_score*. 
I'd stick to Logistic Regression. You could notice this by last parts of project. Moreover, I feel like it could be done more in case of RandomForest and XGBoost.

When integrated with other module to determine likelihood of visitor to leave the site (I mentioned in the beginning), company can use this classification model to show individual special offers to such visitors before they leave shop website.


- Data was collected during one year and we have a feature *Month*. So basically we have sort of timeline. However I used *Month* solely as a categorical feature without any time context. And I'm not sure the opposite would make any sense.

- Feature selection and engineering helped a bit to increase LogReg score. Except experimenting with visualizations I don't see further way to improve this process. Oversampling didn't help (but didn't worsen too).

- As for choosing parameters range: I'm very new to this so I don't have much experience in tuning models. So I just make range of values close to default or just select a wide but small range first and then iterate over bigger amount of values in choosen smaller areas that seems optimal. Parameter grids used in code above is what I came to after some time. Hyperparameters tuning definitely needs wiser approach.

- It's also an interesting idea to build online learning system which could be updated with each new example.

## Thank you for attention!