# COSC2669 Case Studies in Data Science
# Fortnightly Task 1
## Name: Bhargav Rele (s3761977)

In [175]:
#Packages required for this notebook
from IPython.display import Image
import pandas as pd
import numpy as np
import altair as alt

from sklearn import preprocessing 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import metrics

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
import warnings
warnings.filterwarnings('ignore')

## Part 1: Job Role

The description of the job role is as follows:
[Note: the job description has been copied and pasted from the original link on the company application portal: https://omnicommediagroup.springboard.com.au/jobs/OMG-1470758?in_site=LINKEDIN]


### Job Description
##### Senior Data Scientist

Annalect is a global data, technology, and analytics consultancy, with offices based in Sydney, Melbourne, Brisbane and major global hubs. We create solutions across data, measurement, analytics, visualisation, and marketing technology to maximise the effectiveness of media and our client's media ROI.

When we launched Annalect globally in 2011, we dedicated ourselves to transforming the effects of data and analytics on media. Now we have learned to transform the entire marketing process.

Annalect is part of the Omnicom Media Group, and services hundreds of clients across OMD, PHD, Hearts & Science, Resolution Media, Foundation, and OMG United, in addition to working on our own portfolio of direct clients

Our work with PHD won B&T's 2019 award for Best Media Campaign for Children's Panadol, and our work with OMD won B&T's 2019 award for Data Driven Marketing for McDonald's Monopoly.

##### The role

As a Senior Data Scientist in the marketing science team, you will be responsible for helping some of Omnicom's leading clients make informed data driven marketing decisions.

Specifically, you will analyse large amounts of data to determine how customers interact with our clients' advertising across different channels and devices.

Your focus will be to develop custom attribution and predictive models for both online and offline media. Using these models, you will deliver actionable insights that will drive value to our clients and improve their ROI.

This role will allow you refine your work across a number of clients internally and externally. You will be responsible for evangelising the product across the agencies within Omnicom and really own the process that you will be responsible for creating.

##### Responsibilities

* Your role will be to apply your broad skillset across data and analytics to understand business problems, create insights and envisage practical solutions in areas such as media measurement and customer insights
* Build bespoke digital attribution models and improve our digital attribution product offering
* Provide business solutions and optimisation through various statistical and quantitative methods.
* Provide statistical models to support predictive analytics and deliver non-technical presentations to all levels of the business as well as technical documentation to the wider team.
* Derive insights from data and communicate those insights to a non-technical audience through presentations and documentation
* Build Annalect's experimentation/predictive modelling ecosystem
* Build scalable backend solutions for automation of data processing
* Analyse and mash-up massive amounts of data to mine useful business insights

##### What you'll need

* A strong background in data science, analytics, or data engineering
* Proficiency and experience in statistical modelling and machine learning techniques (feature engineering, regression, classification, segmentation, cross validation, bootstrapping, Bayesian techniques etc.)
* A high level of proficiency with at least one programming languages used in data science (R/Python/SAS/Scala/MATLAB).
* Excellent communication skills - being able to both interpret and convey information in a clear, concise way with people from technical and non-technical background and use data to tell a story
* Strong understanding of media, and media data
* Ability to work in a fast-paced growth environment.

### Domain

Annalect is a part of Omnicom Media Group, an American based media company that services clients across several domains and; their primary purpose is to use advanced analytical techniques to enhance marketing decisions and strategies of said clients. Given that Omnicom deals with clients across a diverse range of domains, Annalect uses data and analytics to optimise marketing efforts for companies across healthcare and business (primarily). 

### Insights or Predictions?

As a senior data scientist at Annalect, my primary purpose will be to analyse offline and online customer activity using customer datasets provided by clients. The analysis will utilise machine learning models (such as classification supervised machine learning techniques) to gain an insight into the effectiveness of current marketing strategies by examing the way in which customers interact with advertising. On gaining an insight into such interactions, Annalect may be able to predict whether or not a new customer is likely to make purchases online. Based on the results of the analysis we can make recommendations to clients on how they can maximise return on investment of marketing strategies. Hence, Annalect relies on analytics to not only gain insights but also make predictions. 



## Part 2: Data Set

The purpose of this analysis is to create a model that is capable (with a certain degree of accuracy) of predicting whether or not a customer will contribute to the sales revenue of an online retailer. To be able to do so, we shall train and test Classifiers on an existing customer dataset that has been sourced from the UCI Machine Learning Repository (Web Link: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset).

The dataset consists of 12,330 online customers of an undisclosed online retailer and, comprises of 10 numerical features and 8 categorical features. The description of each feature can be accessed via the web link provided above. Given that the role requires the applicant to analyse customer interactions with online retailers, the dataset was deemed appropriate.

The feature termed 'Revenue' takes up the value, 'TRUE' for those customers that had made purchases online and 'FALSE' for those customers that had not. The dataset consists of 10,422 customers where 'Revenue' = 'FALSE' and 1908 customers where 'Revenue' = 'TRUE'. On being able to predict the 'Revenue' identification of new customers, we may be able to gain an insight into the effectiveness of targetted-marketing strategies while also gaining an insight into how the strategies can be optimised. For instance, a greater portion of marketing efforts can be focused on new customers who are predicted to make purchases online. Alternatively, a greater portion of marketing efforts can also be focused on those customers that are predicted to not make purchases online, thus enabling the client to capture a larger market share ('long tail theory').

## Part 3: Experiment

For the purpose of creating a model that is capable of classifying customers into those that contribute to online sales and those that do not, we use a supervised machine learning technique (i.e., Classification). The variable we are trying to predict is the 'Revenue' variable. Hence, we have a binary classification problem. Furthermore, given that we have 10,422 customers where 'Revenue' = 'FALSE' and 1908 customers where 'Revenue' = 'TRUE', we have a class imbalance problem. However, we assume that the dataset is a statistically accurate representation of customer behaviour online. Hence, we do not need to carry out stratified sampling to balance out the class labels in 'Revenue'. 

The two classification techniques we will be utilising, is the K-Nearest Neighbour[s] classifier and the Gaussian-Naive Bayes Classifer. For both classifiers, we shall train and test the model on the same dataset using the K-Fold Cross Validation method. We will then utilise Pipelines to construct several models at once (based on several hyperparameter values and feature selection methods). From all the models that are created, we shall pick the one with the highest performance metric. 

The dataset does not consist of any missing values and is clean for the most part. Nonetheless, we make several adjustments to the dataset in order to prepare it for the pipeline we wish to create. The next section goes over the transformation and preparation of the dataset. The modelling phase is commenced after.

### Data Preparation and Transformation

##### Data Importation

In [176]:
sales = pd.read_csv('online_shoppers_intention.csv', error_bad_lines=False)

In [177]:
sales.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [178]:
sales.tail()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
12325,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,Dec,4,6,1,1,Returning_Visitor,True,False
12326,0,0.0,0,0.0,5,465.75,0.0,0.021333,0.0,0.0,Nov,3,2,1,8,Returning_Visitor,True,False
12327,0,0.0,0,0.0,6,184.25,0.083333,0.086667,0.0,0.0,Nov,3,2,1,13,Returning_Visitor,True,False
12328,4,75.0,0,0.0,15,346.0,0.0,0.021053,0.0,0.0,Nov,2,2,3,11,Returning_Visitor,False,False
12329,0,0.0,0,0.0,3,21.25,0.0,0.066667,0.0,0.0,Nov,3,2,1,2,New_Visitor,True,False


In [179]:
sales.dtypes

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                       bool
Revenue                       bool
dtype: object

##### Data type specifications

The first 10 columns of the dataset are the numeric features. The 8 columns that follow after are the categorical features. It can be observed from the previous output that a few of our categorical features consist of numeric data types. Hence, we convert them to object data types.

In [180]:
sales.iloc[:, 10:] = sales.iloc[:, 10:].astype(np.object)

In [181]:
sales.dtypes

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems            object
Browser                     object
Region                      object
TrafficType                 object
VisitorType                 object
Weekend                     object
Revenue                     object
dtype: object

##### Feature Encoding

Prior to encoding, we seperate the descriptive features from the target feature of our dataset. Followed by which, we begin to use one-hot encoding techniques to encode the categorical features of our dataset.

In [182]:
sales_descrip = sales.iloc[:,0:17]
sales_target = sales.iloc[:,17]

sales_descrip_onehot = sales_descrip.copy()
sales_descrip_onehot['Month'] = sales_descrip_onehot['Month'].replace({'Jan':1,'Feb':2,'Mar':3,'Apr':4,
                                                                       'May':5,'Jun':6,'Jul':7,'Aug':8,
                                                                       'Sep':9,'Oct':10,'Nov':11,'Dec':12})
sales_descrip_onehot['Month'] = sales_descrip_onehot['Month'].astype(np.object)

In [183]:
cat_cols = sales_descrip_onehot.columns[sales_descrip_onehot.dtypes==np.object].tolist()

for i in cat_cols:
    n = len(sales_descrip_onehot[i].unique())
    if (n==2):
        sales_descrip_onehot[i]=pd.get_dummies(sales_descrip_onehot[i],drop_first=True)

sales_descrip_onehot = pd.get_dummies(sales_descrip_onehot)

In [184]:
sales_descrip_onehot.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,TrafficType_14,TrafficType_15,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1


##### Target Encoding

We can then proceed with encoding our target feature such that 'False' = 0 (negative class) and 'True' = 1 (positive class).

In [185]:
sales_target_encoded = sales_target.copy()

In [186]:
sales_target.value_counts()

False    10422
True      1908
Name: Revenue, dtype: int64

In [187]:
sales_target_encoded= sales_target_encoded.replace({False:0,True:1})

In [188]:
sales_target_encoded.value_counts()

0    10422
1     1908
Name: Revenue, dtype: int64

##### Normalisation/Scaling

The dataset is then normalised using a robust scaler in order to;
1) Normalise the distributions of numeric features.
2) Prevent outliers from influencing the normalisation procedure.

In [189]:
sales_descrip_onehot_norm = sales_descrip_onehot.copy()

sales_descrip_onehot_norm = preprocessing.RobustScaler().fit_transform(sales_descrip_onehot)

In [190]:
sales_descrip_onehot_norm = pd.DataFrame(sales_descrip_onehot_norm,
                                         columns=sales_descrip_onehot.columns.values)

In [191]:
sales_descrip_onehot_norm

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,TrafficType_14,TrafficType_15,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor
0,-0.25,-0.080424,0.0,0.0,-0.548387,-0.467912,11.710742,4.895621,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.25,-0.080424,0.0,0.0,-0.516129,-0.417913,-0.185128,2.095621,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.25,-0.080424,0.0,0.0,-0.548387,-0.467912,11.710742,4.895621,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.25,-0.080424,0.0,0.0,-0.516129,-0.465829,2.788840,3.215621,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.25,-0.080424,0.0,0.0,-0.258065,0.022315,1.004459,0.695621,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12325,0.50,1.474432,0.0,0.0,1.129032,0.925654,0.239725,0.108478,12.241717,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12326,-0.25,-0.080424,0.0,0.0,-0.419355,-0.104051,-0.185128,-0.107046,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12327,-0.25,-0.080424,0.0,0.0,-0.387097,-0.323969,4.771485,1.722287,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12328,0.75,0.723812,0.0,0.0,-0.096774,-0.197604,-0.185128,-0.114906,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [192]:
df_combo = sales_descrip_onehot_norm
df_combo['Target'] = sales_target_encoded
df_final = df_combo.sample(n=5000, random_state=999)
df_final.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,TrafficType_15,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Target
332,0.0,0.144762,3.0,116.0,0.096774,0.011507,0.310534,-0.043268,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
8274,2.5,2.423162,0.0,0.0,3.225806,2.853293,-0.185128,-0.34991,14.671986,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
10754,0.75,0.486116,0.0,0.0,1.258065,0.908071,0.225075,-0.082195,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4817,-0.25,-0.080424,0.0,0.0,-0.548387,-0.467912,11.710742,4.895621,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
436,-0.25,-0.080424,0.0,0.0,0.096774,-0.063681,0.947812,-0.004379,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


### Data Modelling

In order to fit the model, we first specify the the training and testing descriptive and target arrays. The datasets are assigned rows at random. Specifying the stratify argument ensures us that each cross validation experiment will have the same imbalance in target features, as the original dataset.

[Note: We randomly sample 5000 observations from the dataset. This is to ensure we do not encounter any computational issues while running the models.]

In [193]:
Data = df_final.iloc[:,0:74]
target=df_final.iloc[:,74].astype('category')

D_train, D_test, t_train, t_test = train_test_split(Data, target, test_size = 0.3, 
                                                    stratify=target, shuffle=True, 
                                                    random_state=999)

We can then proceed with specifying the cross validation experiment to be used. We want the sample to be split into 5 equal parts (at random). In each experiment of the cross validation, 1/5th of the class labels will be predicted using 4/5th of the dataset. This will occur until every split has participated in the testing and training procedure. The cross validation will be carried out thrice. Note that we do not run the cross validation experiment outside of the pipeline. We are simply specifying the parameters of the experiment in the immediate code below.

In [194]:
cv_method = RepeatedStratifiedKFold(n_splits=5,
                                    n_repeats=3,
                                    random_state=999)

#### K-Nearest Neighbours 

The first classification model we shall use is the K-Nearest Neighbours Algorithm. Pipelines enable us to specify various parameters over which we would like to construct our model. These parameter specifications are as follows:

* The F-Score method and the Mutual Information method are specified as feature selection methods. 

* The model will utilise several combinations of features (30,50,70 or 74 features at a time), within these methods.

* The appropriate number of neighbours that determine the prediction for any given observation, can be either; 71, 81, 91 or 101. According to popular literature (Subramanian, 2019), a good approximation of k is given as the square root of the number of observation in the training set. Hence, a good benchmark for k in our model is approximately 92 [=√(12330*0.7)]. Furthermore, given that we have a binary classification task at hand, it is advisable for us to use an odd number of neighbours (Subramanian, 2019). This would prevent the algorithm from being unable to classify observations that are equi-distant from its neighbours.

* The appropriate value of 'p' in our distance metric can be either; 1, 2, or 3.

* According to Raschka (2018), the scoring metric we should aim to optimise is the F-1 Score (macro/micro/sample/weighted) or the Receiver Operating Charecteristic (ROC-AUC score), given that we have a class imbalance problem in our target feature (i.e., the number of non-churners is almost 4 times the number of churners in our dataset). 

In [195]:
pipe_KNN = Pipeline( [ ('fselector',SelectKBest()),
                     ('knn',KNeighborsClassifier())] )

params_pipe_KNN = {'fselector__score_func':[f_classif,mutual_info_classif],
                  'fselector__k':[30,50,70,74],
                  'knn__n_neighbors':[71,81,91,101],
                  'knn__p':[1,2,3]}

gs_pipe_KNN = GridSearchCV(estimator=pipe_KNN,
                          param_grid = params_pipe_KNN,
                          cv=cv_method,
                          verbose=1,
                          scoring='f1_weighted')

gs_pipe_KNN.fit(D_train,t_train);

Fitting 15 folds for each of 96 candidates, totalling 1440 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1440 out of 1440 | elapsed: 24.1min finished


In [203]:
gs_pipe_KNN.best_estimator_

Pipeline(memory=None,
         steps=[('fselector',
                 SelectKBest(k=30,
                             score_func=<function mutual_info_classif at 0x1a20dc94d0>)),
                ('knn',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=91, p=3,
                                      weights='uniform'))],
         verbose=False)

Hence, from 1,440 tentative models that the algorithm has created within the grid search, the model that yields an optimal F1 Weighted Score, has the following parameter specifications:

* 30 descriptive features are being used to predict the target feature. (mutual information feature selection)
* The optimal value for the number of neighbors is 91
* The selected distance metric is the Minkowski distance metric where p=3.

In [197]:
gs_pipe_KNN.best_score_

0.8864922822765989

Hence, the best score in our model is 88.6492%. Now that our model has been trained, we can utilise it to make predictions on the testing data. In particular we can predict the target variable (t_test), by allowing the algorithm to examine the descriptive features of our testing dataset (D_test). 

We then find the various scoring metrics in order to evaluate the accuracy of our predictions:

In [198]:
knn_model=gs_pipe_KNN.best_estimator_

t_pred = knn_model.predict(D_test)

print('the roc auc score is:',metrics.roc_auc_score(t_test, t_pred))
print('the confusion matrix is:',metrics.confusion_matrix(t_test,t_pred))
print('the accuracy score is:',metrics.accuracy_score(t_test,t_pred))
print('the classification report is: ')
print(metrics.classification_report(t_test, t_pred))
print('the F-1 score is:',metrics.f1_score(t_test,t_pred))

the roc auc score is: 0.7582228255194737
the confusion matrix is: [[1184   72]
 [ 104  140]]
the accuracy score is: 0.8826666666666667
the classification report is: 
              precision    recall  f1-score   support

           0       0.92      0.94      0.93      1256
           1       0.66      0.57      0.61       244

    accuracy                           0.88      1500
   macro avg       0.79      0.76      0.77      1500
weighted avg       0.88      0.88      0.88      1500

the F-1 score is: 0.6140350877192982


The accuracy of the model is 88.2667%. A non-purchaser is missclassified as a purchaser 72 times while a purchaser is missclassified as a non-purchaser 104 times. Depending on the profit and loss matrix (to be provided by the client), we can determine the significance of these results.

The classification report indicates an ideal precision and recall for non-purchasers but the same cannot be said for purchasers. 

The ROC-AUC metric (0.7582) and the F1-Score for our model (0.6140) is however less than ideal.

#### Random Forest Classifier

The next classification model we shall use is the Random Forest Classifier. The parameters over which we would like to construct our model are as follows:

* The F-Score method and the Mutual Information method are specified as the appropriate feature selection methods, again. 

* The model will utilise several combinations of features (20,30,50 or 74 features at a time), within these methods.

* The appropriate number of estimators (n_estimators), according to Jan (2013), shouldn't be overestimated in order to prevent overfitting and limit excess strain on computational power. Although the model accuracy may increase by increasing the number of trees being constructed, the cost associated with computation time and overfitting may far outweigh the benefits. Hence, the random forest model will construct 100, 150, 200 or 250 trees at a time, within our grid search.

* For the same reasons stated in the K-Nearest Neighbours parameter specifications, we shall choose to optimise the F-1 Score or the ROC-AUC Score in our grid search.

In [199]:
pipe_RFC = Pipeline( [ ('fselector',SelectKBest()),
                     ('rfc',RandomForestClassifier())] )

params_pipe_RFC = {'fselector__score_func':[f_classif,mutual_info_classif],
                  'fselector__k':[20,30,50,74],
                  'rfc__n_estimators':[100,150,200]}

gs_pipe_RFC = GridSearchCV(estimator=pipe_RFC,
                          param_grid = params_pipe_RFC,
                          cv=cv_method,
                          verbose=1,
                          scoring='f1_weighted')

gs_pipe_RFC.fit(D_train,t_train);

Fitting 15 folds for each of 24 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:  7.7min finished


In [204]:
gs_pipe_RFC.best_estimator_

Pipeline(memory=None,
         steps=[('fselector',
                 SelectKBest(k=20,
                             score_func=<function f_classif at 0x1a2017ab90>)),
                ('rfc',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

Hence, from 360 tentative models that the algorithm has created within the grid search, the model that yields the highest F1 Weighted, has the following parameter specifications:

* the forest consists of 100 decision trees
* the min_sample_split is specified as 2
* the criterion for splitting is specified as 'gini'
* the min_samples_leaf is specified as 1
* 20 descriptive features have been utilised in the model 

In [201]:
gs_pipe_RFC.best_score_

0.8928982939240057

Hence, the best score in our model is 89.2898%. Now that our model has been trained, we can utilise it to make predictions on the testing data. In particular we can predict the target variable (t_test), by allowing the algorithm to examine the descriptive features of our testing dataset (D_test).

We then find the various scoring metrics in order to evaluate the accuracy of our predictions:

In [202]:
rfc_model=gs_pipe_RFC.best_estimator_

t_pred = rfc_model.predict(D_test)

print('the roc auc score is:',metrics.roc_auc_score(t_test, t_pred))
print('the confusion matrix is:',metrics.confusion_matrix(t_test,t_pred))
print('the accuracy score is:',metrics.accuracy_score(t_test,t_pred))
print('the classification report is: ')
print(metrics.classification_report(t_test, t_pred))
print('the F-1 score is:',metrics.f1_score(t_test,t_pred))

the roc auc score is: 0.7876487939855905
the confusion matrix is: [[1191   65]
 [  91  153]]
the accuracy score is: 0.896
the classification report is: 
              precision    recall  f1-score   support

           0       0.93      0.95      0.94      1256
           1       0.70      0.63      0.66       244

    accuracy                           0.90      1500
   macro avg       0.82      0.79      0.80      1500
weighted avg       0.89      0.90      0.89      1500

the F-1 score is: 0.6623376623376623


The accuracy of the model is 89.6%. A non-purchaser is missclassified as a purchaser 65 times while a purchaser is missclassified as a non-purchaser 91 times. Depending on the profit and loss matrix (to be provided by the client), we can determine the significance of these results.

The classification report indicates an ideal precision and recall for non-purchasers but the same cannot be said for purchasers. 

The ROC-AUC metric (0.78) and the F1-Score for our model (0.6623) is however less than ideal.

### Summary

The performance evaluation metrics indicate that the random forest classifier has greater accuracy in predicting whether or not a customer is going to make purchases online. The random forest classifier has a higher f1-score and roc-auc score than the KNN model; a result that is of greater significance to us given than we have a class imbalance problem. It should also be noted that classification techniques had been utilised to predict whether or not new customers are going to make purchases online, based on the data of previous customers. If the purpose of the analysis is to gain insights into the online behaviour of consumers, we recommend using clustering machine learning algorithms.

### References

Sebastien, R. Machine Learning FAQ: How can the F1-Score help with dealing with class imbalance?. Sebastianraschka.com. https://sebastianraschka.com/faq/docs/computing-the-f1-score.html

Dhilip, S. (2019). A simple introduction to K-Nearest Neighbours algorithm. Towards Data Science. https://towardsdatascience.com/a-simple-introduction-to-k-nearest-neighbors-algorithm-b3519ed98e

Varul, A. (2019).  Feature Selection and Ranking in Machine Learning. www.featureranking.com

Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018)