# Data Science Challenge - Part-I
## Notebook-I by Debisree Ray


### 1. The Data:

The data contains the user action logs from a popular online retail website, captured for 14 days between 2016-06-01 to 2016-06-14 (both days inclusive). Columns in the dataset are follows:

* **userid:** unique identifier of user who visited the website
* **offerid:** unique identifier of the offer shown
* **countrycode:** two-character country code
* **category:** category ID of the offer
* **merchant:** unique identifier of the merchant who has published the offer
* **utcdate:** timestamp of the user action
* **rating:** if the user has clicked the offer or not (1:clicked, 0: not clicked, only viewed)


### 1.a Question:

1. Think about a situation, where a mobile advertisement company has this historical data. Each impression (placing advertisement) cost the advertisement company 1 cent, and each click cost the advertisement company 1USD,(1USD=100 cents). Each {userid, offerid, merchantid} should have 10 impressions. It has been given by merchants (the companies who have contracted with the advertisement company to run the advertisement campaign) that for each impression the ROI (return on investment) for the merchants is 10 cents and for each click the ROI for the merchant is 10USD. The advertisement company has 10,000 USD to run the advertisement campaign in the next 7 days. Based on the above historical dataset could you identify the {userid, offerid, merchantid} combination (or combinations) that the advertisement agency should target in this campaign? Please clearly narrate your intuition and process behind choosing the combinations.

2. Develop at least two models which will predict whether the advertisement will be clicked or not. (***rating*** is the dependent variable). Provide detailed reports behind choosing different parameters in building your models by comments in your code. Produce the relevant validation metrics for training and testing the data.

In [2]:
import os
import math 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#ignore warning messages to ensure clean outputs
import warnings
warnings.filterwarnings('ignore')

In [70]:
#Read the Train set:
df_train = pd.read_csv('train_de.csv', sep='\t', encoding="latin1")
#df_train.head()

In [5]:
df_train['countrycode'].value_counts()

de    15844717
Name: countrycode, dtype: int64

* The entire data belongs to the **'countrycode'= 'de'**. So, this column has no significance and can be dropped.

In [71]:
df_train=df_train.drop(['countrycode'], axis=1)
#df_train.head()

In [33]:
# Let's divide the dataframe (from the training) into two groups - clicked/not clicked:

df_zero=  df_train[df_train['rating'] == 0]
df_one= df_train[df_train['rating']>0]

In [37]:
#Unique combinations of these three IDs, and the corresponding 'count' column:


df_one_uniq=df_one.groupby(['userid','offerid','merchant']).size().reset_index().rename(columns={0:'count'})
df_one_uniq.shape

(434373, 4)

In [65]:
#Let's explore the 'count':

uniq_count=df_one_uniq['count'].value_counts()

In [68]:
#How many counts are there

print(uniq_count.index.min())
print(uniq_count.index.max())

1
670


In [69]:
#Count Frequency:

print(uniq_count.values.min())
print(uniq_count.values.max())

1
393580


In [62]:
#Target those combinations which have more counts for click =1:

df_one_target=df_one_uniq[df_one_uniq['count']> 48]
df_one_target.shape



(994, 4)

In [75]:
#We can get rid of the 'count':
# Target dataframe consists of my targeted combinations of user-id, offer-id and the 
#merchant-id which needs to be targeted by the company:

target = df_one_target.drop(['count'], axis=1)
target.head()

Unnamed: 0,userid,offerid,merchant
184,001fecc308b147cbd9837051c62f035fd75ab42b3ef19c...,0f2fcf95319f5c1e5745371351f521e5,a7b2f269064dbe77eb21b5a8b0f067d3f297a26aa185d3...
187,001fecc308b147cbd9837051c62f035fd75ab42b3ef19c...,deafd09a713ed5c1be69d354bf0d7f5d,66863da8db7e6c51bed5eccc89a91f756e4baee85ad446...
188,001fecc308b147cbd9837051c62f035fd75ab42b3ef19c...,eb0389774fca117ee06c5c02a6ba76af,66863da8db7e6c51bed5eccc89a91f756e4baee85ad446...
189,001fecc308b147cbd9837051c62f035fd75ab42b3ef19c...,ebb77a97cfdfd01c8b2f5cbffb1d5627,ac26975cf46eae9898b7d906bdfbbf99ce7813ffc3f9b7...
190,001fecc308b147cbd9837051c62f035fd75ab42b3ef19c...,f2206f242381e739775e6f60740842e9,ac26975cf46eae9898b7d906bdfbbf99ce7813ffc3f9b7...


### Solution to Question-1


My assumption and idea for targeting the unique users (combined with offer and merchant), is as follows:

* **Those who have clicked in the past, are more probable to click in future.**
* **Those who have clicked more in numbers in the past, have the excellent probability of clicking more in future.**

So, what I have done:

0. If somebody clicks the advertisement, that produce more ROI (in comparison to just seeing the advertisement). So, definitely the companies would look for the clicks. So, my analysis is also focused towards the rating=1.

1. From the entire training data, I filtered the click=1. As because I am focused to target only those combinations which produced clicks before ('rating' = 1)

2. Next, to see the unique combinations of user-ID, offer-ID, and merchant ID, I grouped them and calculated the frequency/ counts of each combination.

3. So, from the 'count' column, I know the no. of times any combination clicked the advertisement.

2. Technically or ideally, the company could have targeted these entire population of combinations (click=1). However, there is a budget constraint.

3. So we need to filter out more probable combinations. Those need to be targeted who have clicked more in numbers.

5. In the problem, it's told, each combination will give rise to 10 advertisements. And each click will cost = 1USD 

6. The budget is 10000USD

7. So, I need to filter some ~ 1000 entries (combinations). Because for each combination, they will place 10 advertisements. And each would (for click=1) cost =1 USD

8. By trial and error with the 'count' value, I figured out if we filter with count> 48 (that means these combinations clicked more than 48 times): we would get 994 unique combinations.

9. 994 combos: each 10 advertisement
   Total cost = 9940
   Each click will cost 1USD (If everybody does not click, the cost for just placing the ad=1c < 1USD)
   So total cost = 9940USD, which is in the budget (10000USD).
   
10. So, the 'target' data frame is the final result, which consists of there columns. These are three unique combinations of user-ID, offer-ID, and merchant-ID. They need to be targeted for the campaign.
