# Outbrain Ad Clickthrough - Creating train.csv and test.csv (Initial US Split)

https://www.kaggle.com/c/outbrain-click-prediction

Kagglers are challenged to predict which pieces of content its global base of users are likely to click on.



In [2]:
import pandas as pd
import numpy as np
import random

## Prepare the data and then split for train and test

The original intent was to randomly split the data, with the bulk to be used for train and the remainder to be used for test. The US data, however, is geotagged to Designsated Market Area (a geographical split by Nielsen, usually focused on major metropolitan markets), so in order to expand the predictor to include geographical data, only the US data will be used. This consituted just over 80% of the entire data set. The event data set, which is the number of clicks on ads, is over 23 million rows.

In [6]:
data = pd.read_csv('./data/clicks_train.csv')
events = pd.read_csv('./data/events.csv')

  interactivity=interactivity, compiler=compiler, result=result)


## Split out US data

During the initial EDA, the US data had the finest level of geo tagging, so in order to see if geography has much of an impact (beyond timezone), I decided to split out and work only with US data.

In [4]:
data.head()


Unnamed: 0,display_id,ad_id,clicked
0,1,42337,0
1,1,139684,0
2,1,144739,1
3,1,156824,0
4,1,279295,0


In [7]:
events.head()

Unnamed: 0,display_id,uuid,document_id,timestamp,platform,geo_location
0,1,cb8c55702adb93,379743,61,3,US>SC>519
1,2,79a85fa78311b9,1794259,81,2,US>CA>807
2,3,822932ce3d8757,1179111,182,2,US>MI>505
3,4,85281d0a49f7ac,1777797,234,2,US>WV>564
4,5,8d0daef4bf5b56,252458,338,2,SG>00


### Split out the country section of geo_location

In [8]:
events['country'] = events['geo_location'].str[0:2]

In [9]:
country_count = events.groupby('country')['country'].count().sort_values(ascending = False)
country_count.head(10)

country
US    18595452
CA     1215350
GB     1117544
AU      483021
IN      228461
ZA      111523
NZ      109802
PH       85338
DE       82384
SG       81975
Name: country, dtype: int64

In [10]:
print('The US constitutes {0:.2f}% of the data'.\
      format(100*country_count[0]/float(country_count.sum())))

The US constitutes 80.43% of the data


In [26]:
# US_bool = events['country'] == 'US'
us_events = events[events['country'] == 'US']
us_events.head()

Unnamed: 0,display_id,uuid,document_id,timestamp,platform,geo_location,country
0,1,cb8c55702adb93,379743,61,3,US>SC>519,US
1,2,79a85fa78311b9,1794259,81,2,US>CA>807,US
2,3,822932ce3d8757,1179111,182,2,US>MI>505,US
3,4,85281d0a49f7ac,1777797,234,2,US>WV>564,US
5,6,7765b4faae4ad4,1773517,395,3,US>OH>510,US


In [27]:
us_events['country'].unique()

array(['US'], dtype=object)

In [18]:
# this gives the tags for display_id's that were US based (click events that occurred 
# in the US)

us_display_id = us_events['display_id']

In [19]:
data['US'] = [1 if x in us_display_id else 0 for x in data['display_id']]

In [21]:
data.tail()

Unnamed: 0,display_id,ad_id,clicked,US
87141726,16874592,186600,0,1
87141727,16874593,151498,1,1
87141728,16874593,282350,0,1
87141729,16874593,521828,0,1
87141730,16874593,522693,0,1


In [22]:
data['US'].unique()

array([1, 0])

In [25]:
np.sum(data['US'])/float(len(data))

0.8049518892389227

In [28]:
data = data[data['US'] == 1]
data.head()

Unnamed: 0,display_id,ad_id,clicked,US
0,1,42337,0,1
1,1,139684,0,1
2,1,144739,1,1
3,1,156824,0,1
4,1,279295,0,1


In [30]:
np.sum(data['US'])/float(len(data))

1.0

## Split US data into Train and Test
Now that the main dataset has been filtered down to US only, the test data set will be extracted from the main data set.

In [32]:
random.seed(a=47)
rand_list = random.sample(data['display_id'],500000)

In [33]:
# set up test data
test = data[data['display_id'].isin(rand_list)]
# set up training data
train = data[~data['display_id'].isin(rand_list)]

In [1]:
test['display_id'].nunique()

NameError: name 'test' is not defined

In [35]:
train['display_id'].nunique()

13085451

In [40]:
test.describe()

Unnamed: 0,display_id,ad_id,clicked,US
count,2862690.0,2862690.0,2862690.0,2862690.0
mean,8432952.0,188719.2,0.1716728,1.0
std,4858113.0,124339.0,0.3770959,0.0
min,37.0,3.0,0.0,1.0
25%,4175247.0,95725.0,0.0,1.0
50%,8502946.0,167204.0,0.0,1.0
75%,12603300.0,251959.0,0.0,1.0
max,16874560.0,547865.0,1.0,1.0


In [37]:
peek = pd.pivot_table(train, index='ad_id', values=['display_id'], 
                      aggfunc='count').sort_values(by = 'display_id', ascending = False)
peek.describe()

Unnamed: 0,display_id
count,449840.0
mean,149.569205
std,1535.896224
min,1.0
25%,2.0
50%,5.0
75%,16.0
max,165522.0


In [38]:
test.to_csv('./data/test.csv',index=False)

In [39]:
train.to_csv('./data/train.csv',index=False)