# Introduction

Through the initial analysis and exploration of the train_sample set (100K observations) we developed some initial expectations about the data. Here we are going to build a pipeline in order to effectively process data sets for modeling.

# Planing for the Pipeline

1. We will prepare 3 data sets from the training set. The training set provided is 200 million samples, we don't have computational power to use all of this data at the moment. Instead, we will extract 3 data sets each contain ~1 million observations. We will refer these sets as:

    - training_set (1:100,000th rows of the original training set)
    - validation_set1 (100,001:200,000th rows of the original training set)
    - validation_set2 (200,001:300,000th rows of the original training set)

2. Build the feature extraction and selection pipeline using the training set:

    - Using the insights we obtained from data exploration, the following features will be used to create dummy variables: device, app, os and channel. We will perform this by converting these features to string, tokenization and selecting 300 best features.
    - We will write custom processing functions to add the log_total_clicks and log_total_click_time features, and remove the unwanted base features
    
3. We will prepare the remainder of the pipeline to incorporate interaction terms and perform scaling and standardization.    

## Prepare training and validation sets


In [19]:
import pandas as pd
training_set = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           nrows=1000000,
                           dtype = "str")
print("Finished training_set")
validation_set1 = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           skiprows = 1000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set1")
validation_set2 = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           skiprows = 2000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set2")


Finished training_set
Finished validation_set1
Finished validation_set2


In [20]:
validation_set1.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,121848,24,1,19,105,2017-11-06 16:21:51,,0
1,2698,25,1,30,259,2017-11-06 16:21:51,,0
2,5729,2,1,19,237,2017-11-06 16:21:51,,0
3,122891,3,1,35,280,2017-11-06 16:21:51,,0
4,105433,15,2,25,245,2017-11-06 16:21:51,,0


In [2]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
ip                 1000000 non-null object
app                1000000 non-null object
device             1000000 non-null object
os                 1000000 non-null object
channel            1000000 non-null object
click_time         1000000 non-null object
attributed_time    1693 non-null object
is_attributed      1000000 non-null object
dtypes: object(8)
memory usage: 61.0+ MB


In [21]:
# Let's save them for future easier individual loading
training_set.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/training_set.csv")
print("Wrote training_set to disk")

validation_set1.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/validation_set1.csv")
print("Wrote validation_set1 to disk")

validation_set2.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/validation_set2.csv")
print("Wrote validation_set2 to disk")

Wrote training_set to disk
Wrote validation_set1 to disk
Wrote validation_set2 to disk


In [22]:
training_set = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/training_set.csv",
                          index_col = 0, dtype = "str")

In [23]:
training_set.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,83230,3,1,13,379,2017-11-06 14:32:21,,0
1,17357,3,1,19,379,2017-11-06 14:33:34,,0
2,35810,3,1,13,379,2017-11-06 14:34:12,,0
3,45745,14,1,13,478,2017-11-06 14:34:52,,0
4,161007,3,1,13,379,2017-11-06 14:35:08,,0


In [24]:
list(training_set.columns)

['ip',
 'app',
 'device',
 'os',
 'channel',
 'click_time',
 'attributed_time',
 'is_attributed']

### Seperate target labels from feature matrix 

We will seperate target labels from features for each of these data sets and pickle them for future use:

In [25]:
import os
import pandas as pd
import numpy as np
import pickle

X_train = training_set.drop(["is_attributed","attributed_time"], axis = 1)
y_train = pd.to_numeric(training_set.is_attributed) 

X_train.to_pickle("X_train.pkl")
y_train.to_pickle("y_train.pkl")

X_val1 = validation_set1.drop(["is_attributed","attributed_time"], axis = 1)
y_val1 = pd.to_numeric(validation_set1.is_attributed) 

X_val1.to_pickle("X_val1.pkl")
y_val1.to_pickle("y_val1.pkl")

X_val2 = validation_set2.drop(["is_attributed","attributed_time"], axis = 1)
y_val2 = pd.to_numeric(validation_set2.is_attributed) 

X_val2.to_pickle("X_val2.pkl")
y_val2.to_pickle("y_val2.pkl")

In [2]:
import os
os.listdir()

['.git',
 '.ipynb_checkpoints',
 '.Rhistory',
 'app_dummy.rds',
 'channel_dummy.rds',
 'device_dummy.rds',
 'os_dummy.rds',
 'test_processed.csv',
 'train_sample.csv',
 'User-click-detection-predictive-modeling.ipynb',
 'UserClickDetectionPredictiveModeling.Rmd',
 'X_train.pkl',
 'X_val1.pkl',
 'X_val2.pkl',
 'y_train.pkl',
 'y_val1.pkl',
 'y_val2.pkl']

## Build the feature extraction and selection pipeline using the training set



In [98]:
import pandas as pd
import pickle

# Read the pickled training set
X_train = pd.read_pickle("X_train.pkl")
y_train = pd.read_pickle("y_train.pkl")

In [7]:
# Label text features
Text_features = ["app","device","os","channel"]

In [30]:
# Define utility function to parse and process text features
# Note we avoid lambda functions since they don't pickle when we want to save the pipeline later   
def column_text_processer_nolambda(df,text_columns = Text_features):
    import pandas as pd
    import numpy as np
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Concatenate feature name to each category encoding for each row
    # E.g: encoding 3 at device column will read as device3 to make each encoding unique for a given feature
    for col_index in list(text_data.columns):
        text_data[col_index] = col_index + text_data[col_index].astype(str)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector)

In [188]:
# Define custom processing functions to add the log_total_clicks and 
# log_total_click_time features, and remove the unwanted base features
def column_time_processer(X_train):
    import pandas as pd
    import numpy as np

    # Convert click_time to datetime64 dtype 
    X_train.click_time = pd.to_datetime(X_train.click_time)

    # Calculate the log_total_clicks for each ip and add as a new feature to temp_data
    temp_data = pd.DataFrame(np.log(X_train.groupby(["ip"]).size()),
                                    columns = ["log_total_clicks"]).reset_index()


    # Calculate the log_total_click_time for each ip and add as a new feature to temp_data
    # First define a function to process selected ip group 
    def get_log_total_click_time(group):
        diff = (max(group.click_time) - min(group.click_time)).seconds
        return np.log(diff+1)

    # Then apply this function to each ip group and extract the total click time per ip group
    log_time_frame = pd.DataFrame(X_train.groupby(["ip"]).apply(get_log_total_click_time),
                                  columns=["log_total_click_time"]).reset_index()

    # Then add this new feature to the temp_data
    temp_data = pd.merge(temp_data,log_time_frame, how = "left",on = "ip")

    # Combine temp_data with X_train to maintain X_train key order
    temp_data = pd.merge(X_train,temp_data,how = "left",on = "ip")

    # Drop features that are not needed
    temp_data = temp_data[["log_total_clicks","log_total_click_time"]]

    # Return only the numeric features as a tensor to integrate into the numeric feature branch of the pipeline
    return temp_data

In [189]:
column_time_processer(X_train)

Unnamed: 0,log_total_clicks,log_total_click_time
0,4.787492,8.785234
1,4.969813,8.774004
2,4.290459,8.753213
3,7.415777,8.766862
4,0.000000,0.000000
5,2.079442,8.661294
6,4.189655,8.709960
7,3.401197,8.733594
8,3.496508,8.734721
9,1.609438,8.539346


In [187]:
temp_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
log_total_clicks        1000000 non-null float64
log_total_click_time    1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB


In [184]:
temp_data.head()

Unnamed: 0,ip,app,device,os,channel,click_time,log_total_clicks,get_log_total_click_time
0,83230,3,1,13,379,2017-11-06 14:32:21,4.787492,8.785234
1,17357,3,1,19,379,2017-11-06 14:33:34,4.969813,8.774004
2,35810,3,1,13,379,2017-11-06 14:34:12,4.290459,8.753213
3,45745,14,1,13,478,2017-11-06 14:34:52,7.415777,8.766862
4,161007,3,1,13,379,2017-11-06 14:35:08,0.0,0.0


In [185]:
X_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time
0,83230,3,1,13,379,2017-11-06 14:32:21
1,17357,3,1,19,379,2017-11-06 14:33:34
2,35810,3,1,13,379,2017-11-06 14:34:12
3,45745,14,1,13,478,2017-11-06 14:34:52
4,161007,3,1,13,379,2017-11-06 14:35:08


In [164]:
temp_data = pd.merge(temp_data,log_time_frame)
temp_data.head()

Unnamed: 0,ip,log_total_clicks,get_log_total_click_time
0,10,0.0,0.0
1,100002,3.713572,6.817831
2,100005,2.079442,6.234411
3,100009,3.401197,7.166266
4,100013,3.78419,7.094235


In [165]:
X_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time
0,83230,3,1,13,379,2017-11-06 14:32:21
1,17357,3,1,19,379,2017-11-06 14:33:34
2,35810,3,1,13,379,2017-11-06 14:34:12
3,45745,14,1,13,478,2017-11-06 14:34:52
4,161007,3,1,13,379,2017-11-06 14:35:08


In [180]:
pd.merge(X_train,temp_data,how = "left",on = "ip").shape

(1000000, 8)

In [71]:
X_train.click_time[0] - X_train.click_time[1]


Timedelta('-1 days +23:58:47')

In [70]:
(X_train.click_time[0] - X_train.click_time[1]).seconds


86327