# Outbrain Clickthrough

## Sampling for Train and Test data & merging tables
Given the size and scope of the data, and having done the initial EDA, the preliminary model is to be developed using a random subset of the data.

#### The following factors are being applied to pair down the data
* Only using US data
* This is a predominant part of the data, and EDA shows not much variation in behaviour in regards to display count and other factors.
* Of this data, only those with full geo_location data will be used. This is to ensure proper correction for time zones (base times are in UTC counts).
* From this a random slice of events will be chosen.

The initial training data will be 5,000 events and the test set will be 500 events. Since each event has on average around five ad choices, this will tranlate in 25,000 rows of data for developing the training set.

#### Further refinements to the process will include what selection of document data to use
* Much of the data has low confidence levels
* Will need to develop a threshold for when to use and when to ignore confidence level
* Many documents also have multiple categories, so will need to determine how to pick the main one to use, or as the model develops, how to incorporate multiple categories.

## Slicing and prepping the data

Note: CAPITAL LETTER items reference tables provided by Outbrains.

Process:
* Read in Outbrain CLICK data in parts, too big to read in as one dataframe
* Compare to EVENTS data supplied by outbrain (Events is larger set the Click data)
* EVENTS - make 2nd version that:
    * Slice EVENTS down so that it only has items that are in CLICK
    * Split up geo_location column into 'states' and 'DMA'
    * Filter out empties and non US data, as well as data missing state and DMA info


In [74]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import random
import matplotlib
import matplotlib.cm as cm
import seaborn as sns
import datetime
import time

matplotlib.style.use('ggplot') 
%matplotlib inline

## Read in Event Data

The event file contains the item clicked (not the other ads displayed) and some basic user information.


In [2]:
# !ls ./data/ 

In [3]:
events = pd.read_csv('./data/events.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
print('Total number of events in full data: {0:,}'.\
      format(events['display_id'].nunique()))

Total number of events in full data: 23,120,126


## Read the Click Data into 2 data frames
File is too large to read into one dataframe. This file contains all the ads displayed at each click event. Each selection of ads is denoted by 'display_id'.

In [5]:
df1 = pd.read_csv('./data/clicks_train.csv', nrows = 40000000)
df2 = pd.read_csv('./data/clicks_train.csv', skiprows = 40000000)

In [6]:
titles = df1.columns
df2.columns = titles

In [7]:
events.head(2)

Unnamed: 0,display_id,uuid,document_id,timestamp,platform,geo_location
0,1,cb8c55702adb93,379743,61,3,US>SC>519
1,2,79a85fa78311b9,1794259,81,2,US>CA>807


In [9]:
df1.head(7)

Unnamed: 0,display_id,ad_id,clicked
0,1,42337,0
1,1,139684,0
2,1,144739,1
3,1,156824,0
4,1,279295,0
5,1,296965,0
6,2,125211,0


As seen above the 'display_id' = 1 has 6 ads displayed and the one with 'ad_id' = 144739 was clicked.

## Do some data cleaning

In [10]:
print(events['platform'].unique())
events['platform'].value_counts()

[3 2 1 '2' '1' '3' '\\N']


2     10684579
1      8747280
3      3032907
2       291699
1       279988
3        83668
\N           5
Name: platform, dtype: int64

In [11]:
# function to convert all data to strings (will be used for classification)

def platform_filter(x):
    if x == '\\N':
        return np.nan
    elif type(x) is int:
        return str(x)  # make into string since this is a category
    else:
        return x

# updated platform column
events['platform'] = [platform_filter(x) for x in events['platform']]
events['platform'].unique()

array(['3', '2', '1', nan], dtype=object)

## Break out Geo_Location data and use only US data

In [12]:
events['country'] = events['geo_location'].str[0:2]
# take only US data from events
# and break out other geographic data
events = events[events['country'] == 'US']
events['state'] = events['geo_location'].str[3:5]
events['DMA'] = events['geo_location'].str[6:]

In [13]:
events.head(3)

Unnamed: 0,display_id,uuid,document_id,timestamp,platform,geo_location,country,state,DMA
0,1,cb8c55702adb93,379743,61,3,US>SC>519,US,SC,519
1,2,79a85fa78311b9,1794259,81,2,US>CA>807,US,CA,807
2,3,822932ce3d8757,1179111,182,2,US>MI>505,US,MI,505


In [14]:
# Clean DMA and state columns
def filter_DMA(x):
    try:
        return str(int(x))
    except:
        return np.nan

def filter_state(x):
    if x == '':
        return np.nan
    else:
        return x
    
events['state'] = [filter_state(x) for x in events['state']]
events['DMA'] = [filter_DMA(x) for x in events['DMA']]

In [15]:
print('# rows mising state data: {0:,}'.format(len(events[events['state'].isnull()])))
print('# rows mising platform data: {0:,}'.format( len(events[events['platform'].isnull()])))
print('# rows mising DMA data: {0:,}'.format( len(events[events['DMA'].isnull()])))

# rows mising state data: 758,487
# rows mising platform data: 5
# rows mising DMA data: 1,264,852


In [16]:
print('# of total US events: {0:,}'.format(len(events)))

# of total US events: 18,595,452


In [17]:
US_orig_length = len(events)

### Take out events missing geo data & military location designators
AA and AP are military postal designations for overseas.

In [18]:
events = events[(events['state'].notnull()) & (events['DMA'].notnull()) \
                & (events['state'] != "AA") & (events['state'] != 'AP') \
                & (events['platform'].notnull())]


In [19]:
print('# rows mising state data: {0:,}'.format(len(events[events['state'].isnull()])))
print('# rows mising platform data: {0:,}'.format( len(events[events['platform'].isnull()])))
print('# rows mising DMA data: {0:,}'.format( len(events[events['DMA'].isnull()])))

# rows mising state data: 0
# rows mising platform data: 0
# rows mising DMA data: 0


In [20]:
print('# of rows reduced: {0:,}'.format(US_orig_length - len(events)))

# of rows reduced: 1,264,857


All the missing state data came from rows missing the DMA data (the 5 stragglers may be from misssing platform data or DMA or both)

### Sync Data Between Main Click Data and EVENT table
EVENT has more click's than in the main data set, so we want to reduce EVENT table to only have correpsonding click events.

In [21]:
# identify events that are in main click data set 
event_id = list(df1['display_id'].unique()) + list(df2['display_id'].unique())

In [22]:
# filter main events dataframe to only include event_id's in main click data set
events = events[events['display_id'].isin(event_id)]

In [23]:
print('Further adjustment to number of events: {0:,}'.\
      format(US_orig_length - 1264857 - len(events)))

Further adjustment to number of events: 4,652,848


Check events dataframe for stray nan or other odd values (ok since nothing odd at begining or end)

In [24]:
events['uuid'].unique()

array(['cb8c55702adb93', '79a85fa78311b9', '822932ce3d8757', ...,
       '4032cf074d74a3', '49396799cb3a40', '21f03d8a66e702'], dtype=object)

In [25]:
events['document_id'].unique()

array([ 379743, 1794259, 1179111, ..., 2681997, 2788096,  682611])

In [26]:
events['timestamp'].unique()

array([        61,         81,        182, ..., 1123199470, 1123199601,
       1123199936])

### Take list from events['display_id'] to generate random sample list
This will be used later to draw random samples from the assembled dataset.

In [27]:
event_US_id = list(events['display_id'].unique())

## Date - put timstamp in terms of dates and local times

In [28]:
# to adjsut timestamp later
from datetime import datetime
datetime.fromtimestamp(1465876800)
# the timestamp is adjusted to time zero for data colleciton, whcih is 6/14/2016
# adj in milliseconds: 1465876799998ms

datetime.datetime(2016, 6, 14, 0, 0)

In [29]:
tz= pd.read_csv('./data/tz.csv')
tz.head()

Unnamed: 0,state,tz,utc_summer,tz_adj_sec
0,AK,AKST,-8,-28800
1,AL,CST,-5,-18000
2,AR,CST,-5,-18000
3,IL,CST,-5,-18000
4,IA,CST,-5,-18000


In [30]:
events = events.merge(tz,on='state',how='left', suffixes=('_l','_r'))

In [31]:
events['ts_UTC'] = [t + 1465876799998 for t in events['timestamp']]
events['date_UTC'] = pd.to_datetime(events['ts_UTC'],unit='ms')

In [32]:
events['ts_local'] = events['ts_UTC'] + 1000*events['tz_adj_sec']  # assigned to a column
events['date_local'] = pd.to_datetime(events['ts_local'],unit='ms')
events.tail()

Unnamed: 0,display_id,uuid,document_id,timestamp,platform,geo_location,country,state,DMA,tz,utc_summer,tz_adj_sec,ts_UTC,date_UTC,ts_local,date_local
12677742,16874588,1bf30bbd832319,2822648,1123199298,2,US>VA>511,US,VA,511,EST,-4,-14400,1466999999296,2016-06-27 03:59:59.296,1466985599296,2016-06-26 23:59:59.296
12677743,16874589,d2c47d8183e37b,876520,1123199313,3,US>TN>557,US,TN,557,CST,-5,-18000,1466999999311,2016-06-27 03:59:59.311,1466981999311,2016-06-26 22:59:59.311
12677744,16874590,4032cf074d74a3,2819923,1123199470,3,US>NM>790,US,NM,790,MST,-6,-21600,1466999999468,2016-06-27 03:59:59.468,1466978399468,2016-06-26 21:59:59.468
12677745,16874591,49396799cb3a40,2816969,1123199601,1,US>IN>582,US,IN,582,EST,-4,-14400,1466999999599,2016-06-27 03:59:59.599,1466985599599,2016-06-26 23:59:59.599
12677746,16874593,21f03d8a66e702,2777166,1123199936,2,US>NJ>501,US,NJ,501,EST,-4,-14400,1466999999934,2016-06-27 03:59:59.934,1466985599934,2016-06-26 23:59:59.934


In [33]:
events.drop(['utc_summer','timestamp','tz','utc_summer','tz_adj_sec','ts_UTC'],\
            axis=1, inplace=True)

In [34]:
events.drop(['date_UTC','ts_local'], axis=1, inplace=True)

In [35]:
events.head()

Unnamed: 0,display_id,uuid,document_id,platform,geo_location,country,state,DMA,date_local
0,1,cb8c55702adb93,379743,3,US>SC>519,US,SC,519,2016-06-14 00:00:00.059
1,2,79a85fa78311b9,1794259,2,US>CA>807,US,CA,807,2016-06-13 21:00:00.079
2,3,822932ce3d8757,1179111,2,US>MI>505,US,MI,505,2016-06-14 00:00:00.180
3,4,85281d0a49f7ac,1777797,2,US>WV>564,US,WV,564,2016-06-14 00:00:00.232
4,6,7765b4faae4ad4,1773517,3,US>OH>510,US,OH,510,2016-06-14 00:00:00.393


## Now slice and dice main data and events data to what is needed to run models

This will be 5,000 for training data and 500 for test data

* Random list of 5,500
* Filter EVENTS
    * Done a few steps above
* Filter df1 and df2 for 5.5k id's in sample list
    * Merge now that down to manageable size
    * Split into train and test
    * Save to MySQL

In [36]:
import random

# Random list from event_id list of df1 & df2
random.seed(a=47)
rand_list = random.sample(event_US_id,5500)

train_list = rand_list[:5000]
test_list = rand_list[5000:]


In [37]:
train1 = df1[df1['display_id'].isin(train_list)]
train2 = df2[df2['display_id'].isin(train_list)]
test1 = df1[df1['display_id'].isin(test_list)]
test2 = df2[df2['display_id'].isin(test_list)]


In [38]:
train_US = pd.concat([train1,train2], axis=0)
test_US = pd.concat([test1,test2], axis=0)

In [39]:
train_US.head(3)

Unnamed: 0,display_id,ad_id,clicked
9140,1796,9638,1
9141,1796,14120,0
9142,1796,134057,0


In [40]:
print('Number of events in train data: {0:,}'.format(train_US['display_id'].nunique()))
print('Number of rows (total ads shown) in train data: {0:,}'.\
      format(len(train_US['display_id'])))

Number of events in train data: 5,000
Number of rows (total ads shown) in train data: 25,744


In [41]:
print('Number of events in test data: {0:,}'.format(test_US['display_id'].nunique()))
print('Number of rows (total ads shown) in test data: {0:,}'.\
      format(len(test_US['display_id'])))


Number of events in test data: 500
Number of rows (total ads shown) in test data: 2,649


In [42]:
# clean up dataframes no longer being used:
del df1, df2, train1, train2, test1, test2

## Merge click and event data into main dataframe

In [43]:
data_test = test_US.merge(events,on='display_id',how='left', suffixes=('_l','_r'))
data_train = train_US.merge(events,on='display_id',how='left', suffixes=('_l','_r'))

In [44]:
data_test.tail(6)

Unnamed: 0,display_id,ad_id,clicked,uuid,document_id,platform,geo_location,country,state,DMA,date_local
2643,16838190,8561,0,a92dc3af90157a,1655195,1,US>VA>511,US,VA,511,2016-06-26 23:21:57.782
2644,16838190,76898,0,a92dc3af90157a,1655195,1,US>VA>511,US,VA,511,2016-06-26 23:21:57.782
2645,16838190,99081,0,a92dc3af90157a,1655195,1,US>VA>511,US,VA,511,2016-06-26 23:21:57.782
2646,16838190,105336,0,a92dc3af90157a,1655195,1,US>VA>511,US,VA,511,2016-06-26 23:21:57.782
2647,16838190,150808,1,a92dc3af90157a,1655195,1,US>VA>511,US,VA,511,2016-06-26 23:21:57.782
2648,16838190,197526,0,a92dc3af90157a,1655195,1,US>VA>511,US,VA,511,2016-06-26 23:21:57.782


In [45]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2649 entries, 0 to 2648
Data columns (total 11 columns):
display_id      2649 non-null int64
ad_id           2649 non-null int64
clicked         2649 non-null int64
uuid            2649 non-null object
document_id     2649 non-null int64
platform        2649 non-null object
geo_location    2649 non-null object
country         2649 non-null object
state           2649 non-null object
DMA             2649 non-null object
date_local      2649 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(6)
memory usage: 248.3+ KB


In [46]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25744 entries, 0 to 25743
Data columns (total 11 columns):
display_id      25744 non-null int64
ad_id           25744 non-null int64
clicked         25744 non-null int64
uuid            25744 non-null object
document_id     25744 non-null int64
platform        25744 non-null object
geo_location    25744 non-null object
country         25744 non-null object
state           25744 non-null object
DMA             25744 non-null object
date_local      25744 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(6)
memory usage: 2.4+ MB


## Bring in promoted content info
This table ties the ad_id from the click data to the ad's document_id, the advertiser and the campaign. The document_id is key to pulling topical/categorical data later.

In [47]:
pc = pd.read_csv('./data/promoted_content.csv')
pc.head()

Unnamed: 0,ad_id,document_id,campaign_id,advertiser_id
0,1,6614,1,7
1,2,471467,2,7
2,3,7692,3,7
3,4,471471,2,7
4,5,471472,2,7


In [48]:
data_train = data_train.merge(pc,on='ad_id',how='left', suffixes=('_l','_r'))
data_test = data_test.merge(pc,on='ad_id',how='left', suffixes=('_l','_r'))

In [49]:
data_train.head()

Unnamed: 0,display_id,ad_id,clicked,uuid,document_id_l,platform,geo_location,country,state,DMA,date_local,document_id_r,campaign_id,advertiser_id
0,1796,9638,1,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,844724,1855,1197
1,1796,14120,0,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,700394,2496,1222
2,1796,134057,0,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,1139308,6269,1043
3,1796,141437,0,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,1325421,18107,2151
4,7314,202991,0,e0ff8bc8860f93,1608197,2,US>MI>505,US,MI,505,2016-06-14 00:08:21.809,1517459,23497,2043


In [50]:
data_train.rename(columns={'document_id_l':'document_id','document_id_r':'ad_document_id' },\
                    inplace=True)
data_test.rename(columns={'document_id_l':'document_id','document_id_r':'ad_document_id' },\
                    inplace=True)

In [51]:
data_train.head(3)

Unnamed: 0,display_id,ad_id,clicked,uuid,document_id,platform,geo_location,country,state,DMA,date_local,ad_document_id,campaign_id,advertiser_id
0,1796,9638,1,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,844724,1855,1197
1,1796,14120,0,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,700394,2496,1222
2,1796,134057,0,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,1139308,6269,1043


In [52]:
data_test.head(3)

Unnamed: 0,display_id,ad_id,clicked,uuid,document_id,platform,geo_location,country,state,DMA,date_local,ad_document_id,campaign_id,advertiser_id
0,65537,47575,0,416a527548d8f5,279813,2,US>TX>618,US,TX,618,2016-06-14 00:29:14.664,954537,6560,1890
1,65537,59603,0,416a527548d8f5,279813,2,US>TX>618,US,TX,618,2016-06-14 00:29:14.664,834542,7978,1521
2,65537,149931,0,416a527548d8f5,279813,2,US>TX>618,US,TX,618,2016-06-14 00:29:14.664,1357158,19011,1819


In [53]:
data_train[(data_train['document_id'] == 1047732) &\
             (data_train['ad_document_id'] == 700394)]


Unnamed: 0,display_id,ad_id,clicked,uuid,document_id,platform,geo_location,country,state,DMA,date_local,ad_document_id,campaign_id,advertiser_id
1,1796,14120,0,6ebcd6c9c6bc96,1047732,2,US>CA>807,US,CA,807,2016-06-13 21:02:05.717,700394,2496,1222


## Bring in Document Data
There are three sets of document data that attempt to classify the material inside a document page:
* Topic
* Category
* Entity <br>

Some documents have more than one of the above for categorization, in those cases the item with the highest confidence level will be added to the table.

In [54]:
d_top = pd.read_csv('./data/documents_topics.csv')
d_cat = pd.read_csv('./data/documents_categories.csv')
d_ent = pd.read_csv('./data/documents_entities.csv')

In [88]:
d_cat.head(2)

Unnamed: 0,document_id,category_id,confidence_level
0,1595802,1611,0.92
1,1595802,1610,0.07


In [89]:
d_ent.head(2)

Unnamed: 0,document_id,entity_id,confidence_level
0,1524246,f9eec25663db4cd83183f5c805186f16,0.672865
1,1524246,55ebcfbdaff1d6f60b3907151f38527a,0.399114


An example of how data is organized in the tables.
* document_id of source page: 1773554
* document_id's of ad pages:
    * 1328059
    * 1086095
    * 1414815
    * 1130922

For the source page, topic 199 will be chosen, for the first ad page 160 will be chose, and since the remainder have only one choice that will be what is used.

In [55]:
# the original page document_id
d_top[d_top['document_id'] == 1773554]

Unnamed: 0,document_id,topic_id,confidence_level
8371044,1773554,199,0.090703
8371045,1773554,183,0.013426
8371046,1773554,133,0.008202


In [67]:
# the corresponding ad doc_id's
d_top[d_top['document_id'] == 1328059]

Unnamed: 0,document_id,topic_id,confidence_level


In [57]:
d_top[d_top['document_id'] == 1086095]

Unnamed: 0,document_id,topic_id,confidence_level
10588629,1086095,26,0.259674


In [58]:
d_top[d_top['document_id'] == 1414815]

Unnamed: 0,document_id,topic_id,confidence_level
9395631,1414815,183,0.309797


In [59]:
d_top[d_top['document_id'] == 1130922]

Unnamed: 0,document_id,topic_id,confidence_level
10551623,1130922,107,0.261741


In [70]:
# function to gind the max confidence level 
def doc_max (x,df,df_col):
    try:
        d_index = df[df['document_id'] == x][df_col].idxmax(axis=0, skipna=True)
        return d_top.iloc[d_index][df_col]
    except:
        pass
    else:
        return np.nan


In [71]:
data_train['doc_id_topic'] = [doc_max(rows['document_id'],d_top,'topic_id') for index, \
                                rows in data_train.iterrows()]

In [75]:
start = datetime.datetime.now()

data_test['doc_id_topic'] = [doc_max(rows['document_id'],d_top,'topic_id') for index, \
                                rows in data_test.iterrows()]

finish = datetime.datetime.now()
print('Time to run: {0}'.format(finish - start))

Time to run: 0:00:30.748548


In [76]:
data_test['doc_id_topic'].isnull().sum()

253

Not all documents are showing a topic, about half are missing. Will still include 

In [84]:
data_test['display_id'].nunique()

500

In [86]:
data_test.dropna()['display_id'].nunique()

457

About 10% of sample set does not have doc_id, to critical to be imputed so will continue with reduced dataset and work on remediation on next round of analysis.

In [87]:
data_train = data_train.dropna()
data_test = data_test.dropna()

In [None]:
data_train.to_csv('./data/data_train.csv',index=False)
data_test.to_csv('./data/data_test.csv',index=False)


# <font color='blue'> maybe adjust this so it goes by index #, iterrows may be too slow...
Kickes off warnings, and 

time to execute below was: 9:15

In [90]:
start = datetime.datetime.now()

data_train['doc_id_topic'] = [doc_max(rows['document_id'],d_top,'topic_id') for index, \
                                rows in data_train.iterrows()]
data_test['doc_id_topic'] = [doc_max(rows['document_id'],d_top,'topic_id') for index, \
                                rows in data_test.iterrows()]

data_train['doc_id_cat'] = [doc_max(rows['document_id'],d_cat,'category_id') for index, \
                                rows in data_train.iterrows()]
data_test['doc_id_cat'] = [doc_max(rows['document_id'],d_cat,'category_id') for index, \
                                rows in data_test.iterrows()]

data_train['doc_id_ent'] = [doc_max(rows['document_id'],d_ent,'entity_id') for index, \
                                rows in data_train.iterrows()]
data_test['doc_id_ent'] = [doc_max(rows['document_id'],d_ent,'entity_id') for index, \
                                rows in data_test.iterrows()]


finish = datetime.datetime.now()
print('Time to run: {0}'.format(finish - start))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Time to run: 0:09:15.334621


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [91]:
start = datetime.datetime.now()

data_train['ad_doc_id_topic'] = [doc_max(rows['ad_document_id'],d_top,'topic_id') for index, \
                                rows in data_train.iterrows()]
data_test['ad_doc_id_topic'] = [doc_max(rows['ad_document_id'],d_top,'topic_id') for index, \
                                rows in data_test.iterrows()]

data_train['ad_doc_id_cat'] = [doc_max(rows['ad_document_id'],d_cat,'category_id') for index, \
                                rows in data_train.iterrows()]
data_test['ad_doc_id_cat'] = [doc_max(rows['ad_document_id'],d_cat,'category_id') for index, \
                                rows in data_test.iterrows()]

data_train['ad_doc_id_ent'] = [doc_max(rows['ad_document_id'],d_ent,'entity_id') for index, \
                                rows in data_train.iterrows()]
data_test['ad_doc_id_ent'] = [doc_max(rows['ad_document_id'],d_ent,'entity_id') for index, \
                                rows in data_test.iterrows()]


finish = datetime.datetime.now()
print('Time to run: {0}'.format(finish - start))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Time to run: 0:10:17.455862


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [92]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23179 entries, 0 to 25740
Data columns (total 20 columns):
display_id         23179 non-null int64
ad_id              23179 non-null int64
clicked            23179 non-null int64
uuid               23179 non-null object
document_id        23179 non-null int64
platform           23179 non-null object
geo_location       23179 non-null object
country            23179 non-null object
state              23179 non-null object
DMA                23179 non-null object
date_local         23179 non-null datetime64[ns]
ad_document_id     23179 non-null int64
campaign_id        23179 non-null int64
advertiser_id      23179 non-null int64
doc_id_topic       23179 non-null float64
doc_id_cat         0 non-null object
doc_id_ent         0 non-null object
ad_doc_id_topic    23017 non-null float64
ad_doc_id_cat      0 non-null object
ad_doc_id_ent      0 non-null object
dtypes: datetime64[ns](1), float64(2), int64(7), object(10)
memory usage: 3.7+ MB


# Save tables to .csv

In [93]:
data_train.to_csv('./data/data_train.csv',index=False)
data_test.to_csv('./data/data_test.csv',index=False)
