# Data Science Challenge - Part-II - Report : by Debisree Ray


##  1. The Data:

The data contains the user action logs from a popular online retail website, captured for 14 days between 2016-06-01 to 2016-06-14 (both days inclusive). Columns in the dataset are follows:

* **userid:** unique identifier of user who visited the website
* **offerid:** unique identifier of the offer shown
* **countrycode:** two-character country code
* **category:** category ID of the offer
* **merchant:** unique identifier of the merchant who has published the offer
* **utcdate:** timestamp of the user action
* **rating:** if the user has clicked the offer or not (1:clicked, 0: not clicked, only viewed)


##  1.a.Questions:

1. Think about a situation, where a mobile advertisement company has this historical data. Each impression (placing advertisement) cost the advertisement company 1 cent, and each click cost the advertisement company 1USD,(1USD=100 cents). Each {userid, offerid, merchantid} should have 10 impressions. It has been given by merchants (the companies who have contracted with the advertisement company to run the advertisement campaign) that for each impression the ROI (return on investment) for the merchants is 10 cents and for each click the ROI for the merchant is 10USD. The advertisement company has 10,000 USD to run the advertisement campaign in the next 7 days. Based on the above historical dataset could you identify the {userid, offerid, merchantid} combination (or combinations) that the advertisement agency should target in this campaign? Please clearly narrate your intuition and process behind choosing the combinations.

2. Develop at least two models which will predict whether the advertisement will be clicked or not. (***rating*** is the dependent variable). Provide detailed reports behind choosing different parameters in building your models by comments in your code. Produce the relevant validation metrics for training and testing the data.

##  2. Data wrangling:

* To start, we need to first import all the necessary modules and libraries.
* The imp. libraries include:
   * **NumPy:** Provides a fast numerical array structure and helper functions.
   * **pandas:** Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
   * **scikit-learn:** The essential Machine Learning package in Python.
   * **matplotlib:** Basic plotting library in Python; most other Python plotting libraries are built on top of it.
   * **Seaborn:** Advanced statistical plotting library.
   
* Read the train/test data set into the **'Pandas dataframe'**.

* There are 7 columns and 15844717 rows in the training data.

* There are 7 columns and 1919561 rows in the test data. The first few lines of the train and the test data sets look as follows:

<img src="train.png" align="center" width="90%"/>
<img src="test.png" align="center" width="90%"/>

* The columns are: **'userid', 'offerid', 'countrycode', 'category', 'merchant', 'utcdate', 'rating'**.
* There is no missing value in the columns.
* The column **'countrycode' = 'de'** for the entire dataset. So, We can get rid of the column, as that has nothing to do with the modeling.
* The target variable is **'rating'**, which can take the value either 1/0 (depending on whether the link has been clicked or not.)


##  3. Exploratory Data Analysis (EDA):

* To start the EDA, here every different features have been studied and visually displayed against the target variable, so as to infer any relationship between them.

###  3.1. userid:

* This is the unique identifier of user who visited the website.
* Total 291485 users.
* There are some users who are frequent users (visited the website at least twice).
* There are some users who are actually one-timers (did not return to the website)
* 48.9% of the total users are non-returning, who clicked.
* 0.5% of the total users are non-returning, who has not clicked.
* 28.9% of the total users are returning, who has not clicked.
* 21.7% of the total users are returning, who clicked.

<img src="1.png" align="left" width="50%"/>
<img src="2.png" align="right" width="50%"/>

### 3.2. offerid:

* This is the unique identifier of the offer shown.
* There are 2158859 offer-IDs listed.
* Here we have shown the distribution of the offer IDs for both clicked and not clicked.

<img src="3.png" align="center" width="50%"/>


###  3.3. category :

* These are the different categiroes (IDs) for different offer IDs
* 271 unique different offer categories are there.
* Maximum frequency for an offer category = 934537
* Minimum frequency for an offer category = 15
* Here we have shown the distribution of the categories for both clicked and not clicked.

<img src="4.png" align="center" width="50%"/>


###  3.4. merchantid:

* unique identifier of the merchant who has published the offer.
* 703 different merchants are there.
* Here we have shown the distribution of the merchant IDs for both clicked and not clicked.

<img src="5.png" align="center" width="50%"/>

###  3.5. utcdate:

* This is the timestamp of the user action.
* Here we have splitted the timestamp into further details.
* The additional columns created are: 
   * **'dayofweek'**: Here we have plotted the user activities over the days of the week. And on the same graph, we have shown the clicked ones. We can see that the maximum activities are on Fridays. However, 30.5% of the clicks are done on the Tuesdays.
   * **'date'**: Here we have plotted the activities over the dates. And on the same graph plotted the clicked ones. We see that the maximum activity is on 12th. However maximum clicks are on 14th.
   * **'Hour'**: Here we have plotted the user activities over the hours of days. In two different graphs we have shown the clicks/ not click activity rates. Clearly, the click rates are maximum at the 17th hour.
 
* Here we have shown the distribution of the dayofweek, date and hour for both clicked and not clicked.

<img src="6.png" align="left" width="50%"/> <img src="7.png" align="right" width="50%"/>
<img src="8.png" align="left" width="50%"/> <img src="9.png" align="right" width="50%"/>
          <img src="10.png" align="center" width="50%"/> 

### 3.6. rating:

* This column tells, if the user has clicked the offer or not (1:clicked, 0: not clicked, only viewed)
* This is the target variable.
* we see theat, there is **major class imbalance** in the data. Very few click-through (<5%), as compared to the large amount of non-click-through.
* 95.5% of the cases, the click-through =0 (i.e. not clicked)
* Only 4.5% of the cases, the click-through =1 (clicked)
* Here we have shown the distribution of the rating for both clicked and not clicked.

<img src="11.png" align="left" width="50%"/> <img src="12.png" align="right" width="50%"/>


##  4. Some additional features through feature engineering:

* The three unique user IDs cannot be used as a feature in the ML models directly.
* So, I am creating additional columns to gather the total click=1 information, corresponding each userID/offer ID/merchant ID.
* So, for both train and test set, the columns we would consider for the Machine Learning are as follows:
**'userid', 'offerid', 'category', 'merchant','dayofweek', 'date', 'hour', 'user_rating_sum', 'mer_rating_sum',
 'off_rating_sum'**.
 
* The target variable is **'rating'**.

## 5. Prepare the data for applying ML (Classification Algorithms):

As we have discussed earlier, the dataset has huge class imbalance. So, there are different ways to tackle class imbalance: which are as follows:

* **Resampling: oversample minority class:**  Good when you don't have a ton of data
* **Resampling: undersample majority class:** Good when you have huge data
* **Generate synthetic samples:** SMOTE (Synthetic minority oversampling technique)
* **Class_weight:** This is one of the simple way to address the problem. The idea is to provide a weight for each class which places more emphasis on the minority classes such that the end result is a classifier which can learn equally from both the classes.

Ref: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

## 6. Modeling:

As this is a classification problem. We will be applying i) **Random Forest** ii) **Gradient Boost** and iii) **Logistic Regression** and compare their performances.

To trial with the models and find the best parameters, I will work with a random subset of the data. The original dataset(s) being too large, the computation is too time consuming, or actually not possible. (Given my computation power is very limited). 

The subset will be the represntative of the original dataset.


### 6.1. Low volume dataset(s) : Random subsets:

* Here we have taken the random subset (low volume) of the training and test datasets.
* The low volume train data has 79224 rows and 12 columns.
* The low volume test data has 9598 rows and 12 columns.
* The new, low volume data is the representative of the main dataset.
* I have performed the same EDA with the 'rating'feature again on this smaller dataset. Found the two plots (plotted with the big data and the small representative one) are exactly same.

<img src="13.png" align="left" width="50%"/> <img src="14.png" align="right" width="50%"/>

### 6.2. Applying ML algorithms and comparing their performances:

This is a classification problem, in supervised learning. Here we have used the following classification models:

* **Logistic Regression**
* **Random Forest**
* **Gradient Boost**

Evaluating the performance of a model by training and testing on the same dataset can lead to the overfitting. Hence the model evaluation is based on splitting the dataset into train and validation set. However, the performance of the prediction result depends upon the random choice of the pair of (train, validation) set. To overcome, the Cross-Validation procedure is used where under the k-fold CV approach, the training set is split into k smaller sets, where a model is trained using k-1 of the folds as training data, and the model is validated on the remaining part.

* **Classification/Confusion Matrix:** This matrix summarizes the correct and incorrect classifications that a classifier produced for a certain dataset. Rows and columns of the classification matrix correspond to the true and predicted classes respectively. The two diagonal cells (upper left, lower right) give the number of correct classifications, where the predicted class coincides with the actual class of the observation. The off diagonal cells gives the count of the misclassification. The classification matrix gives estimates of the true classification and misclassification rates.

We applied different ML models above and evaluated their performances in terms of ROC-AUC score for both the training and test data. Here we have tabulated the scores and plotted them.

<img src="score.png" align="center" width="50%"/> 

<img src="17.png" align="left" width="50%"/> <img src="18.png" align="right" width="50%"/> 

### 6.3. Hyperparameter Tunning and the final two models:

In Machine Learning, the hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is set before the learning process begins. This is significant as the performance of the entire model is based on the hyper parameter values specified. Some examples of hyperparameters include penalty in logistic regression and loss in stochastic gradient descent.

different methods for optimizing hyperparameters: 1) Grid Search 2) Random Search

**Grid search** is a traditional way to perform hyperparameter optimization. It works by searching exhaustively through a specified subset of hyperparameters. Using sklearn’s GridSearchCV, we first define our grid of parameters to search over and then run the grid search. Here I ran the GridSearchCV for both the **Random Forest** and **Gradient Boost**.

* After fitting the models, with the tunned parameters, We improved our cross validation scores.

* The ROC-AUC score for the final RF (Tunned) model is: 0.9008744808634841 and for GB (Tunned) model is: 0.9181525410415521

* Performing a feature importance search reveals that, the engineered features are the most important ones.

<img src="15.png" align="left" width="50%"/> <img src="16.png" align="right" width="50%"/> 

* Based on the low-volume training and test data, the final prediction table has been saved in the name **'final_result_rf.csv'** and **'final_result_GB.csv'**. This file has four columns. User-ID, Offer-ID, Merchant-ID, and the Rating.

* 

* The ROC curves are as follows

<img src="19.png" align="left" width="50%"/> <img src="20.png" align="right" width="50%"/> 

##  7. Conclusions:


* The original dataset is enormous. (15844717 rows are there for the train set). Given my limited computational facility and time, it is almost impossible to deal with the Big-data. So, I have decided to take a random subset of it. This low-volume data would be easier to handle and representative of the population.

* The 'country-code' feature has nothing to do with the analysis as the entire data belongs to only one country. So, I dropped the column.

* The original, hence the low-volume data suffers from a significant class-imbalance problem. Though there are many ways to deal with it,  I have decided to use the 'class_weight' parameter in the sklearn for RF and the LR. The GB already takes care of the class imbalance problem.

* To predict the ratings (probable clicks by the user), here I have considered a bunch of (7) features, either directly from the dataset or engineered/derived from the data. Interestingly, engineered features are the most important ones in terms of relative importance.

* This is a **Classification** problem. Here we have used the following classification models:
  * Logistic Regression
  * Random Forest
  * Gradient Boost

* Evaluating the performance of a model by training and testing on the same dataset can lead to overfitting. Hence the model evaluation is based on splitting the dataset into train and validation set. But the performance of the prediction result depends upon the random choice of the pair of (train, validation) set. In order to overcome that, the **Cross-Validation** procedure is used where, under the k-fold CV approach, the training set is split into k smaller sets, where a model is trained using k-1 of the folds as training data, and the model is validated on the remaining part.

* We have evaluated each model in terms of model accuracy score, precision, recall, f1, and the 'ROC-AUC' score for both the training and test data, and plotted them. The two best performing models are the Random forest and the Gradient boost. Both are the ensemble model based on decision trees.

* Next, we have carried out the grid search CV for the hyperparameter tuning for both the models separately. This step was the most time consuming one in terms of computation. With the result of the optimized hyperparameters, we have again fitted the two models and got the predictions separately.

* We have evaluated the ROC-AUC scores with the optimized hyperparameters. The model performance improved with the optimized parameters. The final ROC-AUC scores fro both RF and the GB are 0.901 and 0.918

* The final prediction tables (Two columns: **User-ID**, **Offer-ID'**, **'Merchant-ID** and **Rating**) are saved as csv files.

## 8. Future Direction:

There is enough room to improve the model.

* The first target would be to tackle the big-data. Given some resource (cloud computing platform, like AWS), the modeling needs to be done on the full train set.

* To tackle the class-imbalance, other methods (including generating the synthetic set using SMOTE) needs to be tried and rested against each other.

* Here we have used only the data of 14 days. The model can be improved if we can use more data.

* Use ensembles of the machine learning models to average out bias and improve performance.

* I wish there are more features in the data, as the age/sex of the users with different location information, along with some login information. These would have helped the modeling.

* Try to fit and predict using the Extreme Gradient boost, classifier model.