## Outbrain Click Prediction

## I. Data

In [9]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import gc
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

df_train = pd.read_csv('/Users/aiswaryasankar/Desktop/Kaggle/kaggle_winter/outbrain/clicks_train.csv')
df_test = pd.read_csv('/Users/aiswaryasankar/Desktop/Kaggle/kaggle_winter/outbrain/clicks_test.csv')

# Any results you write to the current directory are saved as output.

Since we have so many files, which should we start off with?  The test.csv and train.csv contain the main data that we will be using to train and test our model so we'll look at that first.

In [30]:
print('Train columns\n')
print(df_train.columns)

print('Test columns\n')
print(df_test.columns)

Train columns

Index(['display_id', 'ad_id', 'clicked'], dtype='object')
Test columns

Index(['ad_id', 'count_x', 'sum_x', 'mean_x', 'count_y', 'sum_y', 'mean_y',
       'display_id', 'count', 'sum', 'mean'],
      dtype='object')


In [31]:
print('Train head')
print(df_train.head)

print('Test head')
print(df_test.head)

Train head
<bound method DataFrame.head of           display_id   ad_id  clicked
0                  1   42337        0
1                  1  139684        0
2                  1  144739        1
3                  1  156824        0
4                  1  279295        0
5                  1  296965        0
6                  2  125211        0
7                  2  156535        0
8                  2  169564        0
9                  2  308455        1
10                 3   71547        0
11                 3   95814        0
12                 3  152141        0
13                 3  183846        0
14                 3  228657        1
15                 3  250082        0
16                 4  149930        0
17                 4  153623        1
18                 4  184709        0
19                 4  186849        0
20                 4  233730        0
21                 5   54606        0
22                 5  162868        0
23                 5  184548        0
24     

So the only data we are given in the train file is the display id with all the ads that are a part of that display and which of those ads in the display was actually clicked.

Our test set has a set of displays with their corresponding ads and we have to rank the ads in order of chance of being clicked.

## II. Brute Force Solution

Now let's try to do a brute force solution! 
Our test data file has a list of display_ids with all the ad_ids that are in each display.
We need to rank the ads from most likely to be clicked to the least likely to be clicked.

What we can do first is to ignore the fact that the ad is in a specific display in the first place.  
Instead let's just see which ads are clicked the most overall.  If an add has a high probability of being clicked when displayed then it will likely be clicked again when displayed on a new page.  
Let's thus start by figuring out the click percentage of each ad.

We will store this information in a dictionary.  Then when we need to make predictions we will sort according to these percentages.  If not in the dictionary we will add to the end randomly.

## A. Calculate the click percent for each ad

In [32]:
# First groupby ad_id in the train data and calculate the mean.
percentages = df_train.groupby(['ad_id'])['clicked'].agg(['count', 'sum', 'mean']).reset_index()
print(percentages)

         ad_id  count  sum      mean
0            1      2    0  0.000000
1            2     22    1  0.045455
2            3    161   11  0.068323
3            4     32    4  0.125000
4            5      1    0  0.000000
5            6      1    0  0.000000
6            7   3406  985  0.289196
7            8     28    4  0.142857
8            9      3    0  0.000000
9           10     81    9  0.111111
10          11     11    0  0.000000
11          12     12    2  0.166667
12          13      8    1  0.125000
13          14     10    1  0.100000
14          15      2    0  0.000000
15          16      3    1  0.333333
16          17     20    2  0.100000
17          18      9    0  0.000000
18          20     14    0  0.000000
19          21     32    0  0.000000
20          22      4    0  0.000000
21          23     11    2  0.181818
22          24     13    0  0.000000
23          25      1    0  0.000000
24          26     22    1  0.045455
25          27      4    0  0.000000
2

In [33]:
# This gives us the probability that a given ad was clicked. 
# Lets us check whether or not all the ads in the test set are included in
# the training set.

no_diff = sum([1 for idVal in df_test['ad_id'] if idVal not in df_train['ad_id']])
print(no_diff)

0


## B. Merge ad click percents with test data

Now we have the percent clicked for each Id. 
How can we use this to rank the ads for each display in our test data?

We will first merge the percent clicked column with the test data. 
We will merge on id_name.  
Let us print that out and see what it looks like.

In [34]:
df_test = df_test.merge(percentages, how='left')
print(df_test)

# Thus we see that we have the test data merged with the mean probabilities 
# for each ad_id.

           ad_id  count_x  sum_x    mean_x  count_y  sum_y    mean_y  \
0         145293    25321   4516  0.178350    25321   4516  0.178350   
1         201649    51067   4074  0.079778    51067   4074  0.079778   
2         292363     7602    589  0.077480     7602    589  0.077480   
3          57097    34294   1261  0.036770    34294   1261  0.036770   
4          89725    19390   3991  0.205828    19390   3991  0.205828   
5          86281    11425   1852  0.162101    11425   1852  0.162101   
6         254698      208     21  0.100962      208     21  0.100962   
7         187797   108989   9346  0.085752   108989   9346  0.085752   
8          83252    18266   5760  0.315340    18266   5760  0.315340   
9         162994    11430   2585  0.226159    11430   2585  0.226159   
10        489589      642     99  0.154206      642     99  0.154206   
11        173403   128030  16931  0.132242   128030  16931  0.132242   
12        208244    27184   7630  0.280680    27184   7630  0.28

Now let's group by display_ID and get a list of all the IDS that correspond to that display.
Then we need to extract the corresponding percentages for each ID.
Lastly we need to sort and output to a file.

In [35]:
print(df_test['mean'].fillna(M, inplace=True))

None


## C. Sort by Display_id and Mean

In [29]:
# Now we want to sort the data by display_id and the mean. This will 
# give us each display with the ad_id's sorted from the ones that occur
# with the highest probability to the lowest probability.

df_test.sort_values(['display_id', 'mean'], inplace=True, ascending=False)
print(df_test)

           ad_id  count_x  sum_x    mean_x  count_y  sum_y    mean_y  \
12083645  145293    25321   4516  0.178350    25321   4516  0.178350   
19057859  201649    51067   4074  0.079778    51067   4074  0.079778   
24552163  292363     7602    589  0.077480     7602    589  0.077480   
4106239    57097    34294   1261  0.036770    34294   1261  0.036770   
6836909    89725    19390   3991  0.205828    19390   3991  0.205828   
6507003    86281    11425   1852  0.162101    11425   1852  0.162101   
22454656  254698      208     21  0.100962      208     21  0.100962   
17564633  187797   108989   9346  0.085752   108989   9346  0.085752   
6224066    83252    18266   5760  0.315340    18266   5760  0.315340   
14504252  162994    11430   2585  0.226159    11430   2585  0.226159   
29379284  489589      642     99  0.154206      642     99  0.154206   
15807826  173403   128030  16931  0.132242   128030  16931  0.132242   
19628242  208244    27184   7630  0.280680    27184   7630  0.28

## D. Generate Submission

In [36]:
# Lastly we need to convert this into a list for each display_id.
# We can do this by using a join operation.

# We first group by display.
submission = df_test.groupby('display_id').ad_id.apply(lambda x: ' '.join(map(str, x))).reset_index()
print(submission)

         display_id                                              ad_id
0          16874594           170392 172888 162754 150083 66758 180797
1          16874595                                  8846 143982 30609
2          16874596  289915 11430 289122 132820 57197 153260 173005...
3          16874597  305790 285834 143981 182039 155945 180965 3088...
4          16874598                         145937 335632 67292 250082
5          16874599           173130 91681 213116 210516 296295 163776
6          16874600               30682 2150 57591 70529 133050 114836
7          16874601            190713 92003 129490 14082 140942 118470
8          16874602                 154918 281563 269017 131316 268548
9          16874603                         163139 156050 44460 140423
10         16874604  169925 149047 83804 240836 129490 308610 16702...
11         16874605                                257544 240836 40526
12         16874606                        303990 267702 158632 163113
13    

In [37]:
submission.to_csv("submisison.csv", index=False)

## II. Improve brute force 

Now we will try to improve upon the brute force method
One main issue with the brute force method is that it doesn't take into
consideration the fact that some ads are displayed much more than
the others are.  For example, one add may be displayed 800 times 
and clicked on 600 times with a ratio of 75% while another add may have
only been displayed once with a 100% click ratio.  Due to the limited
info on that ad, we should not weight it as highly, thus we add in 
a regularization term 10.

In [47]:

train = pd.read_csv('/Users/aiswaryasankar/Desktop/Kaggle/kaggle_winter/outbrain/clicks_train.csv')
                    
# The number of times each add was clicked 
total_clicks_per_ad = train[train['clicked'] == 1].ad_id.value_counts()

# Total number of adds displayed
total_displayed = train.ad_id.value_counts()

print(count)

175214    44824
173130    41852
303990    41380
3314      36782
130952    35421
184220    34999
92759     32935
26711     32030
193952    31905
228959    31404
151028    31074
84109     30879
142690    28061
57071     27744
153193    27542
332844    27458
250319    27096
199220    26177
330515    24371
101761    24191
202812    23649
190216    23553
231378    22605
190567    21527
48185     21485
476937    21402
124591    20477
6915      20091
204447    20069
10344     20007
          ...  
25816         1
462875        1
353400        1
364619        1
320584        1
315491        1
441819        1
490979        1
315501        1
364613        1
528887        1
337020        1
479260        1
446508        1
353396        1
304204        1
331860        1
348252        1
435212        1
233604        1
135348        1
539658        1
118988        1
173187        1
20732         1
304195        1
342324        1
402451        1
91386         1
227043        1
Name: ad_id, dtype: int6

In [51]:
# Now we'll compute the probability differently
# This way we 
def get_prob(k):
    if k not in total_clicks_per_ad:
        return 0
    else:
        return total_clicks_per_ad[k]/ (float(total_displayed[k]) + 10)

# k is the ad_id
# Create the grouped test data
# pass in the ids to get_prob and append to the test data
# then sort and create submission

def sort(x):
    ad_ids = map(int, x.split())
    
    # sorts ids based on the value returned from get_prob
    ad_ids = sorted(ad_ids, key=get_prob, reverse=True)
    
    # creates the string representation of display_id, sorted ids
    return " ".join(map(str, ad_ids))
    
submission = pd.read_csv('/Users/aiswaryasankar/Desktop/Kaggle/kaggle_winter/outbrain/sample_submission.csv')

# rewrites the ad_id column to include all of the sorted ids 
submission['ad_id'] = submission.ad_id.apply(lambda x: sort(x))

submission.to_csv('submission.csv', index=False)

## IV. Random Forest

Now let's try a more involved model. The shortcoming of the brute and optimized methods was that it didn't take an display information into consideration.

How can we take display into consideration when building our model?

In [53]:
from sklearn.ensemble import RandomForestClassifier

# Now let's look at some of the other files
# promoted_content.csv has the following columns: 
# ad_id, document_id, campaign_id, advertiser_id

# We will thus create a random forest with all 3 fields
# This will be better than the previous because it includes document!

content = pd.read_csv('/Users/aiswaryasankar/Desktop/Kaggle/kaggle_winter/outbrain/promoted_content.csv')

# Here we will merge content and train to include document, campaign and 
# advertiser info
train=pd.merge(train, content, how='left', on='ad_id')
print(train)

          display_id   ad_id  clicked  document_id  campaign_id  advertiser_id
0                  1   42337        0       938164         5969           1499
1                  1  139684        0      1085937        17527           2563
2                  1  144739        1      1337362        18488           2909
3                  1  156824        0       992370         7283           1919
4                  1  279295        0      1670176        27524           1820
5                  1  296965        0      1133290        28334           3922
6                  2  125211        0       876083        16054            772
7                  2  156535        0       630534        16636            380
8                  2  169564        0      1394819        20109            640
9                  2  308455        1      1712989          245            232
10                 3   71547        0      1043039         8711           1919
11                 3   95814        0       811706  

## A. Train our Random Forest

In [59]:
# what features are we using for Random Forest?
# columns that aren't display or clicked since those are in test

predictors = [x for x in train.columns if x not in ['display_id', 'clicked']]
train = train.fillna(0.0)

# fit the classifier
# These hyperparameters for the RandomForestClassifier can be determined
# by iterating through values and finding the ones with the lowerst
# validation set error

forest = RandomForestClassifier(random_state=1, n_estimators=3, 
                               min_samples_split=4, min_samples_leaf=2)

# fit the algorithm with features, predicted result (click status)
forest.fit(train[predictors], train['clicked'])


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=4, min_weight_fraction_leaf=0.0,
            n_estimators=3, n_jobs=1, oob_score=False, random_state=1,
            verbose=0, warm_start=False)

## B. Predict on test data

In [61]:
# We need to merge the content with the test data

df_test = pd.merge(df_test, content, how='left', on='ad_id')
df_test = df_test.fillna(0.0)

# Predict using the trained RandomForest model from train data
# Use the predict_proba function that will output a probability likelihood
predictions = list(forest.predict_proba(df_test[predictors]).astype(float)[:,1])



## C. Generate submission

In [62]:
# What do we have in test and predictions?
results = pd.concat(df_test, pd.DataFrame(predictions), axis=1, ignore_index=True)

print(results.head())

results.columns = ['display_id', 'ad_id', 'clicked']

results = results.sort_values(by=['display_id', 'clicked'], ascending=[True,False])

restuls = results.reset_index(drop=True)
results = results[['display_id', 'ad_id']].groupby('display_id')['ad_id'].agg(
    lambda col: ' '.join(map(str,col)))
results.columns = [['display_id', 'ad_id']]

results.to_csv('submission3.csv')


TypeError: concat() got multiple values for argument 'axis'