# Introduction

Through the initial analysis and exploration of the train_sample set (100K observations) we developed some initial expectations about the data. Here we are going to build a pipeline in order to effectively process data sets for modeling.

# Planing for the Pipeline

1. We will prepare 3 data sets from the training set. The training set provided is 200 million samples, we don't have computational power to use all of this data at the moment. Instead, we will extract 3 data sets each contain ~1 million observations. We will refer these sets as:

    - training_set (1:100,000th rows of the original training set)
    - validation_set1 (100,001:200,000th rows of the original training set)
    - validation_set2 (200,001:300,000th rows of the original training set)

2. Build the feature extraction and selection pipeline using the training set:

    - Using the insights we obtained from data exploration, the following features will be used to create dummy variables: device, app, os and channel. We will perform this by converting these features to string, tokenization and selecting 300 best features.
    - We will write custom processing functions to add the log_total_clicks and log_total_click_time features, and remove the unwanted base features
    
3. We will prepare the remainder of the pipeline to incorporate interaction terms and perform scaling and standardization.    

## Prepare training and validation sets


In [19]:
import pandas as pd
training_set = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           nrows=1000000,
                           dtype = "str")
print("Finished training_set")
validation_set1 = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           skiprows = 1000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set1")
validation_set2 = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           skiprows = 2000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set2")


Finished training_set
Finished validation_set1
Finished validation_set2


In [20]:
validation_set1.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,121848,24,1,19,105,2017-11-06 16:21:51,,0
1,2698,25,1,30,259,2017-11-06 16:21:51,,0
2,5729,2,1,19,237,2017-11-06 16:21:51,,0
3,122891,3,1,35,280,2017-11-06 16:21:51,,0
4,105433,15,2,25,245,2017-11-06 16:21:51,,0


In [2]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
ip                 1000000 non-null object
app                1000000 non-null object
device             1000000 non-null object
os                 1000000 non-null object
channel            1000000 non-null object
click_time         1000000 non-null object
attributed_time    1693 non-null object
is_attributed      1000000 non-null object
dtypes: object(8)
memory usage: 61.0+ MB


In [21]:
# Let's save them for future easier individual loading
training_set.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/training_set.csv")
print("Wrote training_set to disk")

validation_set1.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/validation_set1.csv")
print("Wrote validation_set1 to disk")

validation_set2.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/validation_set2.csv")
print("Wrote validation_set2 to disk")

Wrote training_set to disk
Wrote validation_set1 to disk
Wrote validation_set2 to disk


In [22]:
training_set = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/training_set.csv",
                          index_col = 0, dtype = "str")

In [23]:
training_set.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,83230,3,1,13,379,2017-11-06 14:32:21,,0
1,17357,3,1,19,379,2017-11-06 14:33:34,,0
2,35810,3,1,13,379,2017-11-06 14:34:12,,0
3,45745,14,1,13,478,2017-11-06 14:34:52,,0
4,161007,3,1,13,379,2017-11-06 14:35:08,,0


In [24]:
list(training_set.columns)

['ip',
 'app',
 'device',
 'os',
 'channel',
 'click_time',
 'attributed_time',
 'is_attributed']

### Seperate target labels from feature matrix 

We will seperate target labels from features for each of these data sets and pickle them for future use:

In [25]:
import os
import pandas as pd
import numpy as np
import pickle

X_train = training_set.drop(["is_attributed","attributed_time"], axis = 1)
y_train = pd.to_numeric(training_set.is_attributed) 

X_train.to_pickle("X_train.pkl")
y_train.to_pickle("y_train.pkl")

X_val1 = validation_set1.drop(["is_attributed","attributed_time"], axis = 1)
y_val1 = pd.to_numeric(validation_set1.is_attributed) 

X_val1.to_pickle("X_val1.pkl")
y_val1.to_pickle("y_val1.pkl")

X_val2 = validation_set2.drop(["is_attributed","attributed_time"], axis = 1)
y_val2 = pd.to_numeric(validation_set2.is_attributed) 

X_val2.to_pickle("X_val2.pkl")
y_val2.to_pickle("y_val2.pkl")

In [2]:
import os
os.listdir()

['.git',
 '.ipynb_checkpoints',
 '.Rhistory',
 'app_dummy.rds',
 'channel_dummy.rds',
 'device_dummy.rds',
 'os_dummy.rds',
 'test_processed.csv',
 'train_sample.csv',
 'User-click-detection-predictive-modeling.ipynb',
 'UserClickDetectionPredictiveModeling.Rmd',
 'X_train.pkl',
 'X_val1.pkl',
 'X_val2.pkl',
 'y_train.pkl',
 'y_val1.pkl',
 'y_val2.pkl']

## Build the feature extraction and selection pipeline using the training set



In [217]:
import pandas as pd
import pickle

# Read the pickled training set
X_train = pd.read_pickle("X_train.pkl")
y_train = pd.read_pickle("y_train.pkl")

# Label text features
Text_features = ["app","device","os","channel"]

##############################################################
# Define utility function to parse and process text features
##############################################################
# Note we avoid lambda functions since they don't pickle when we want to save the pipeline later   
def column_text_processer_nolambda(df,text_columns = Text_features):
    import pandas as pd
    import numpy as np
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Concatenate feature name to each category encoding for each row
    # E.g: encoding 3 at device column will read as device3 to make each encoding unique for a given feature
    for col_index in list(text_data.columns):
        text_data[col_index] = col_index + text_data[col_index].astype(str)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector)

#######################################################################
# Define custom processing functions to add the log_total_clicks and 
# log_total_click_time features, and remove the unwanted base features
#######################################################################
def column_time_processer(X_train):
    import pandas as pd
    import numpy as np

    # Convert click_time to datetime64 dtype 
    X_train.click_time = pd.to_datetime(X_train.click_time)

    # Calculate the log_total_clicks for each ip and add as a new feature to temp_data
    temp_data = pd.DataFrame(np.log(X_train.groupby(["ip"]).size()),
                                    columns = ["log_total_clicks"]).reset_index()


    # Calculate the log_total_click_time for each ip and add as a new feature to temp_data
    # First define a function to process selected ip group 
    def get_log_total_click_time(group):
        diff = (max(group.click_time) - min(group.click_time)).seconds
        return np.log(diff+1)

    # Then apply this function to each ip group and extract the total click time per ip group
    log_time_frame = pd.DataFrame(X_train.groupby(["ip"]).apply(get_log_total_click_time),
                                  columns=["log_total_click_time"]).reset_index()

    # Then add this new feature to the temp_data
    temp_data = pd.merge(temp_data,log_time_frame, how = "left",on = "ip")

    # Combine temp_data with X_train to maintain X_train key order
    temp_data = pd.merge(X_train,temp_data,how = "left",on = "ip")

    # Drop features that are not needed
    temp_data = temp_data[["log_total_clicks","log_total_click_time"]]

    # Return only the numeric features as a tensor to integrate into the numeric feature branch of the pipeline
    return temp_data


#############################################################################
# We need to wrap these custom utility functions using FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
# FunctionTransformer wrapper of utility functions to parse text and numeric features
# Note how we avoid putting any arguments into column_text_processer or column_time_processer
#############################################################################
get_numeric_data = FunctionTransformer(func = column_time_processer, validate=False) 
get_text_data = FunctionTransformer(func = column_text_processer_nolambda,validate=False) 

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
# #Note this regex will match either a whitespace or a punctuation to tokenize 
# the string vector on these preferences, in our case we only have white spaces in our text  
#############################################################################
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   

###############################################
# Construct our feature extraction pipeline
###############################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer
from sklearn.feature_selection import SelectKBest, chi2 # We will use chi-squared as a scoring function to select features for classification
from sklearn.metrics import auc
from SparseInteractions import * #Load SparseInteractions (from : https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py) as a module since it was saved into working directory as SparseInteractions.py

userclick_pipeline1 = Pipeline([
    
    ("union",FeatureUnion(
        # Note that FeatureUnion() also accepts list of tuples, the first half of each tuple 
        # is the name of the transformer within the FeatureUnion
        
        transformer_list = [
            
            ("numeric_subpipeline",Pipeline([        # Note we have subpipeline branches inside the main pipeline
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()) # Step2: impute any missing data using default (mean), note we don't expect missing values in this case. 
            ])), # End of: numeric_subpipeline
            
            ("text_subpipeline",Pipeline([
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC, # Step2: use HashingVectorizer for automated tokenization and feature extraction
                                             ngram_range = (1,1),
                                             non_negative=True, 
                                             norm=None, binary=True )), # Note here we use binary=True since our hack is to use tokenization to generate dummy variables  
                ('dim_red', SelectKBest(chi2,300)) # Step3: use dimension reduction to select 300 best features using chi2 as scoring function
            ]))
        ]
        
    )),# End of step: union, this is the fusion point to main pipeline, all features are numeric at this stage
    
    # Common steps:
            
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms up to the second degree polynomial
    ("scaler",MaxAbsScaler()) # Scale the features between 0 and 1.       
            
])# End of: userclick_pipeline1


Develop the userclick_pipeline1 by using the training set:

In [239]:
import datetime
start = datetime.datetime.now()

userclick_pipeline1.fit(X_train,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


It took: 3.3333333333333335 minutes.


Having trained our pipeline using the training set, we will pickle it and store for reuse. This will ensure the consistency every time we want to process a data set, and we will extract the same set of features.

In [240]:
# Pickle and store the userclick_pipeline1
import pickle
with open("userclick_pipeline1.pkl","wb") as f:
    pickle.dump(userclick_pipeline1,f)    

## Transform the features in the training set using the established pipeline

In [242]:
# Re-load the userclick_pipeline1 to work with
import pickle
with open("userclick_pipeline1.pkl","rb") as f:
    userclick_pipeline1 = pickle.load(f)

In [243]:
import datetime
start = datetime.datetime.now()

# Transform the training set features
X_train_trans_pl1 = userclick_pipeline1.transform(X_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


It took: 3.0833333333333335 minutes.


In [245]:
X_train_trans_pl1.shape

(1000000, 45753)

In [246]:
type(X_train_trans_pl1)

scipy.sparse.csc.csc_matrix

We will pickle and save this transformed version of the features from the training set. We spent about 3 minutes by training the pipeline and an additional 3 minutes for transforming the features.

In the future, we will only use this pipeline with .transform method to process any datasets we would like to use in our models.

In [248]:
# Save the transformed version of training set features
import pickle
with open("X_train_trans_pl1.pkl","wb") as f:
    pickle.dump(X_train_trans_pl1,f)  

# Fitting Regularized Linear Model for predicting true events

We will start linear and train a Ridge classifier by setting the regularization parameter alpha to 0.5. We will see the untuned model performance first, then try to optimize the performance. 

In [255]:
# Read the transformed features and target labels from the training set
import pickle
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)

from sklearn.linear_model import RidgeClassifier

# Instantiate a Ridge classifier with a medium alpha 
Ridge = RidgeClassifier(alpha=0.5, random_state= 321)

# Train the model
import datetime
start = datetime.datetime.now()

Ridge.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

It took: 1.3333333333333333 minutes.


In [270]:
# Predict class labels using training set
import datetime
start = datetime.datetime.now()

# Predict class probabilities
# Note that there is no predict_proba on RidgeClassifier
# So we use the trick in https://stackoverflow.com/questions/22538080/scikit-learn-ridge-classifier-extracting-class-probabilities 

d = Ridge.decision_function(X= X_train_trans_pl1) # Predict confidence scores for samples
probs = np.exp(d) / np.sum(np.exp(d)) # Use softmax to convert them probabilities between 0 and 1

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

It took: 0.016666666666666666 minutes.


In [271]:
# Calculate ROC score between the predicted probability and the observed target
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train,probs)

0.94980780103952844

Even this naive attempt gave us an ROC score of 0.949, Next, we will try to perform hyperparameter optimization to see where we can get further: 

## Hyperparameter optimization using Ridge model

In [276]:
from sklearn.model_selection import RandomizedSearchCV

import datetime
start = datetime.datetime.now()

params_space = {"alpha": np.arange(0,1,0.01)}
RidgeSearch = RandomizedSearchCV(Ridge,cv = 2,verbose=10,
                                 n_iter=20,
                                 n_jobs=3,
                                 param_distributions=params_space,
                                 random_state= 321,
                                 scoring= "roc_auc")

RidgeSearch.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Fitting 2 folds for each of 20 candidates, totalling 40 fits
[CV] alpha=0.86 ......................................................
[CV] alpha=0.86 ......................................................
[CV] alpha=0.25 ......................................................
[CV] ............. alpha=0.86, score=0.8914263110278222, total=  37.6s
[CV] alpha=0.25 ......................................................
[CV] ............. alpha=0.86, score=0.8956513669854133, total= 1.1min
[CV] alpha=0.45 ......................................................


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:  1.2min


[CV] .............. alpha=0.25, score=0.884190883324053, total= 1.4min
[CV] alpha=0.45 ......................................................
[CV] ............. alpha=0.25, score=0.8914015886647204, total= 1.3min
[CV] alpha=0.32 ......................................................
[CV] ............. alpha=0.45, score=0.8886544167323791, total=  53.1s
[CV] alpha=0.32 ......................................................
[CV] ............. alpha=0.45, score=0.8941574912061789, total=  55.7s
[CV] alpha=0.2 .......................................................
[CV] ............. alpha=0.32, score=0.8875157179170297, total= 1.2min
[CV] alpha=0.2 .......................................................


[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:  3.2min


[CV] ............. alpha=0.32, score=0.8924351290802932, total= 1.2min
[CV] alpha=0.06 ......................................................
[CV] ............... alpha=0.2, score=0.882164720700341, total= 1.8min
[CV] alpha=0.06 ......................................................
[CV] .............. alpha=0.2, score=0.8905415407269656, total= 1.8min
[CV] alpha=0.15 ......................................................
[CV] ............. alpha=0.06, score=0.8786576036594884, total= 2.8min
[CV] alpha=0.15 ......................................................
[CV] .............. alpha=0.15, score=0.883062963076619, total= 1.8min
[CV] alpha=0.82 ......................................................


[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:  6.7min


[CV] ............. alpha=0.06, score=0.8759978611806225, total= 2.7min
[CV] alpha=0.82 ......................................................
[CV] ............. alpha=0.82, score=0.8913547259152571, total=  33.9s
[CV] alpha=0.08 ......................................................
[CV] ............. alpha=0.15, score=0.8884136760769671, total= 1.7min
[CV] alpha=0.08 ......................................................
[CV] ............. alpha=0.82, score=0.8954961726441188, total=  55.6s
[CV] alpha=0.38 ......................................................
[CV] ............. alpha=0.38, score=0.8886718559207138, total=  57.1s
[CV] alpha=0.38 ......................................................
[CV] .............. alpha=0.08, score=0.880409659046294, total= 2.2min
[CV] alpha=0.69 ......................................................
[CV] ............. alpha=0.38, score=0.8924743918088495, total= 1.0min
[CV] alpha=0.69 ......................................................


[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:  9.9min


[CV] ............. alpha=0.08, score=0.8830280871683156, total= 2.3min
[CV] alpha=0.48 ......................................................
[CV] ............. alpha=0.69, score=0.8903844868681919, total=  38.9s
[CV] alpha=0.48 ......................................................
[CV] .............. alpha=0.69, score=0.894370026181325, total=  40.2s
[CV] alpha=0.95 ......................................................
[CV] ............. alpha=0.48, score=0.8886348535251752, total=  48.8s
[CV] alpha=0.95 ......................................................
[CV] ............. alpha=0.48, score=0.8942835960807918, total=  50.9s
[CV] alpha=0.42 ......................................................
[CV] ............. alpha=0.95, score=0.8912950972786214, total=  29.2s
[CV] alpha=0.42 ......................................................
[CV] .............. alpha=0.42, score=0.888656706320648, total=  54.5s
[CV] alpha=0.67 ......................................................


[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed: 12.0min


[CV] ............. alpha=0.42, score=0.8933858578180848, total=  58.5s
[CV] alpha=0.67 ......................................................
[CV] ............. alpha=0.95, score=0.8976710720537423, total= 1.3min
[CV] alpha=0.74 ......................................................
[CV] ............. alpha=0.67, score=0.8904195284489833, total=  38.5s
[CV] alpha=0.74 ......................................................
[CV] ............. alpha=0.67, score=0.8943406596495505, total=  42.0s
[CV] alpha=0.7 .......................................................
[CV] ............. alpha=0.74, score=0.8902902991673008, total=  36.1s
[CV] alpha=0.7 .......................................................
[CV] ............. alpha=0.74, score=0.8951258526506547, total=  43.6s
[CV] alpha=0.44 ......................................................
[CV] .............. alpha=0.7, score=0.8904273622468626, total=  38.9s
[CV] alpha=0.44 ......................................................
[CV] .

[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed: 14.3min


[CV] ............. alpha=0.51, score=0.8885397528273091, total=  47.1s
[CV] alpha=0.53 ......................................................
[CV] ............. alpha=0.44, score=0.8940703814673404, total=  55.5s
[CV] alpha=0.53 ......................................................
[CV] ............. alpha=0.51, score=0.8943254518544484, total=  51.8s
[CV] ............. alpha=0.53, score=0.8885548291037833, total=  47.6s
[CV] ............. alpha=0.53, score=0.8943041770442346, total=  49.3s


[Parallel(n_jobs=3)]: Done  40 out of  40 | elapsed: 15.2min remaining:    0.0s
[Parallel(n_jobs=3)]: Done  40 out of  40 | elapsed: 15.2min finished


It took: 16.15 minutes.


In [277]:
RidgeSearch.best_score_

0.89448307829020712

In [278]:
RidgeSearch.best_params_

{'alpha': 0.95000000000000007}

In [280]:
# Use 'alpha': 0.95 to re-fit the Ridge classifier and calculate the performance
# Read the transformed features and target labels from the training set
import pickle
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)

from sklearn.linear_model import RidgeClassifier

# Instantiate a Ridge classifier with a medium alpha 
Ridge = RidgeClassifier(alpha=0.95, random_state= 321)

# Train the model
import datetime
start = datetime.datetime.now()

Ridge.fit(X_train_trans_pl1,y_train)
d = Ridge.decision_function(X= X_train_trans_pl1) # Predict confidence scores for samples
probs = np.exp(d) / np.sum(np.exp(d)) # Use softmax to convert them probabilities between 0 and 1

# Calculate ROC score between the predicted probability and the observed target
from sklearn.metrics import roc_auc_score
print( "The ROC score is:" + str(roc_auc_score(y_train,probs)))

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")


The ROC score is:0.949325548082
It took: 1.05 minutes.


We could not improve the performance of the Ridge model using this approach. Let's save this model and continue to explore other types of classifiers.

In [281]:
import pickle
with open("Ridge_classifier.pkl", "wb") as f:
    pickle.dump(Ridge, f)

# Training Naive Bayes Classifier

In [1]:
# Read the transformed features and target labels from the training set
import pickle
import numpy as np
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)    

In [2]:
# Train the model
import datetime
start = datetime.datetime.now()

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

It took: 0.016666666666666666 minutes.


In [3]:
# Get the probability predictions
probs = mnb.predict_proba(X_train_trans_pl1)  

In [6]:
probs # We need the second column which is the probabilities for class label: 1

array([[  1.00000000e+00,   6.32335454e-15],
       [  1.00000000e+00,   7.52020576e-15],
       [  1.00000000e+00,   1.22347330e-14],
       ..., 
       [  1.00000000e+00,   1.74420279e-12],
       [  1.00000000e+00,   5.90993405e-16],
       [  1.00000000e+00,   5.74730378e-13]])

In [8]:
probs = probs[:,1]

In [9]:
# Calculate the roc score for the training set
from sklearn.metrics import roc_auc_score
print("NB roc score is: " + str(roc_auc_score(y_train,probs)))

NB roc score is: 0.946744947288


This is our untuned classifier, which has similar performance to Ridge. Can we try to tune it to perform better? Looking into description, we find the following parameters that can be tuned:

Parameters

 |  alpha : float, optional (default=1.0)
 |      Additive (Laplace/Lidstone) smoothing parameter
 |      (0 for no smoothing).


 |  fit_prior : boolean, optional (default=True)
 |      Whether to learn class prior probabilities or not.
 |      If false, a uniform prior will be used.


 |  class_prior : array-like, size (n_classes,), optional (default=None)
 |      Prior probabilities of the classes. If specified the priors are not
 |      adjusted according to the data.
 

- We can make search across the alpha (0-1). 
- We will leave fit_prior = True
- We have some idea about the probability of being in class 1, which is (sum(y_train)/len(y_train)) (0.00169) in our training set. Why don't we use this information in our hyperparameter search and see if it makes any difference.

Since the classifier trained very fast, we can perform exhaustive GridSearch with 3-fold CV.

## Hyperparameter optimization for Naive Bayes Classifier

In [17]:
import datetime
start = datetime.datetime.now()

from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

params_space = {
    "alpha":np.arange(0,1.1,0.1),
    "fit_prior":[True],
    "class_prior":[None,[1-0.00169,0.00169]]
}

MNBsearch = GridSearchCV(mnb,
                         param_grid= params_space,
                         scoring="roc_auc",
                         cv =3, n_jobs=3,verbose=10)

MNBsearch.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Fitting 3 folds for each of 22 candidates, totalling 66 fits
[CV] alpha=0.0, class_prior=None, fit_prior=True .....................
[CV] alpha=0.0, class_prior=None, fit_prior=True .....................


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV] alpha=0.0, class_prior=None, fit_prior=True .....................


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=None, fit_prior=True, score=0.8959370258094743, total=   1.6s
[CV] alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True .......


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=None, fit_prior=True, score=0.8868648185207342, total=   1.9s
[CV] alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True .......


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:    5.2s
  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=None, fit_prior=True, score=0.8857097885869714, total=   2.0s
[CV] alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True .......


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.8959370204907351, total=   1.7s
[CV] alpha=0.1, class_prior=None, fit_prior=True .....................


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.886864834505243, total=   1.7s
[CV] alpha=0.1, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.8857097859228865, total=   1.6s
[CV] alpha=0.1, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.1, class_prior=None, fit_prior=True, score=0.9435606570419294, total=   1.5s
[CV] alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True .......


[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:    9.3s


[CV]  alpha=0.1, class_prior=None, fit_prior=True, score=0.9327180517680261, total=   1.7s
[CV] alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.1, class_prior=None, fit_prior=True, score=0.9505139967923142, total=   1.5s
[CV] alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9435606517231905, total=   1.5s
[CV] alpha=0.2, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9327180624243653, total=   1.6s
[CV] alpha=0.2, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9505139914641447, total=   1.7s
[CV] alpha=0.2, class_prior=None, fit_prior=True .....................


[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:   13.2s


[CV]  alpha=0.2, class_prior=None, fit_prior=True, score=0.9444671084783707, total=   1.6s
[CV] alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.2, class_prior=None, fit_prior=True, score=0.9326124714247604, total=   1.6s
[CV] alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.2, class_prior=None, fit_prior=True, score=0.9520073601628835, total=   1.5s
[CV] alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9444671270939577, total=   1.4s
[CV] alpha=0.3, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9326125300346252, total=   1.7s
[CV] alpha=0.3, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9520073814755616, total=   1.7s
[CV] alpha=0.3, class_prior=None, fit_prior=True ........

[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:   18.9s


[CV]  alpha=0.3, class_prior=None, fit_prior=True, score=0.931901677625326, total=   1.5s
[CV] alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.3, class_prior=None, fit_prior=True, score=0.9524465345361612, total=   1.6s
[CV] alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9446671994447662, total=   1.8s
[CV] alpha=0.4, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9319016589767328, total=   1.7s
[CV] alpha=0.4, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9524465345361612, total=   1.6s
[CV] alpha=0.4, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.4, class_prior=None, fit_prior=True, score=0.9445732731709474, total=   1.4s
[CV] alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=

[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:   24.6s


[CV]  alpha=0.4, class_prior=None, fit_prior=True, score=0.9524166168642662, total=   1.4s
[CV] alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9445732731709476, total=   1.4s
[CV] alpha=0.5, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9310744313527057, total=   1.4s
[CV] alpha=0.5, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9524165928875032, total=   1.5s
[CV] alpha=0.5, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.5, class_prior=None, fit_prior=True, score=0.9442627013551642, total=   1.5s
[CV] alpha=0.5, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.5, class_prior=None, fit_prior=True, score=0.9302755229497157, total=   1.5s
[CV] alpha=0.5, class_prior=[0.99831, 0.00169], fit_prior

[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:   31.6s


[CV]  alpha=0.5, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9521864452688615, total=   1.5s
[CV] alpha=0.6, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.6, class_prior=None, fit_prior=True, score=0.9438514218731604, total=   1.6s
[CV] alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.6, class_prior=None, fit_prior=True, score=0.9294995310038618, total=   1.7s
[CV] alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.6, class_prior=None, fit_prior=True, score=0.9518519294655212, total=   1.4s
[CV] alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9438514218731604, total=   1.5s
[CV] alpha=0.7, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9294995310038618, total=   1.6s
[CV] alpha=0.7, class_prior=None, fit_prior=True ........

[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:   38.8s


[CV]  alpha=0.7, class_prior=None, fit_prior=True, score=0.9514674114552534, total=   1.6s
[CV] alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9433635685037778, total=   1.6s
[CV] alpha=0.8, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9287725022718782, total=   1.6s
[CV] alpha=0.8, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9514674061270838, total=   1.8s
[CV] alpha=0.8, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.8, class_prior=None, fit_prior=True, score=0.942827681601234, total=   2.0s
[CV] alpha=0.8, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.8, class_prior=None, fit_prior=True, score=0.9280890526384574, total=   2.1s
[CV] alpha=0.8, class_prior=[0.99831, 0.00169], fit_prior=

[Parallel(n_jobs=3)]: Done  55 tasks      | elapsed:   48.8s


[CV]  alpha=0.9, class_prior=None, fit_prior=True, score=0.9274512973869017, total=   1.7s
[CV] alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.9, class_prior=None, fit_prior=True, score=0.9506482853129924, total=   1.5s
[CV] alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9422726206440957, total=   1.9s
[CV] alpha=1.0, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9274512973869015, total=   2.1s
[CV] alpha=1.0, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9506482853129923, total=   2.2s
[CV] alpha=1.0, class_prior=None, fit_prior=True .....................
[CV]  alpha=1.0, class_prior=None, fit_prior=True, score=0.9417146237429526, total=   2.3s
[CV] alpha=1.0, class_prior=[0.99831, 0.00169], fit_prior

[Parallel(n_jobs=3)]: Done  66 out of  66 | elapsed:   58.1s finished


It took: 1.0 minutes.


In [18]:
MNBsearch.best_params_

{'alpha': 0.20000000000000001,
 'class_prior': [0.99831, 0.00169],
 'fit_prior': True}

In [19]:
MNBsearch.best_score_

0.94302901430616237

In [24]:
from sklearn.metrics import roc_auc_score
probs = MNBsearch.best_estimator_.predict_proba(X_train_trans_pl1)[:,1]
roc_auc_score(y_train,probs)

0.95332437154555072

It looks like the hyperparameter search improved MNB model performance. We will store the best estimator we obtained for future use:

In [25]:
import pickle
with open("mnb.pkl","wb") as f:
    pickle.dump(MNBsearch.best_estimator_,f)

# Training Support Vector Machine Classifiers

In [3]:
# Read the transformed features and target labels from the training set
import pickle
import numpy as np
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f) 

In [4]:
import datetime
start = datetime.datetime.now()

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
# Note that we can't get probabilities directly from this LinearSVC function
# We need to wrap into Calibrated Classifier 
# (see: https://stackoverflow.com/questions/35212213/sklearn-how-to-get-decision-probabilities-for-linearsvc-classifier)

lsvc = LinearSVC(verbose=10)

cal_lsvc = CalibratedClassifierCV(base_estimator = lsvc,
                                  cv = 3, # Also performs cross-validation
                                  method= "sigmoid") # We use sigmoid function to get probabilities

cal_lsvc.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

[LibLinear][LibLinear][LibLinear]It took: 2.25 minutes.


In [7]:
probs = cal_lsvc.predict_proba(X_train_trans_pl1)[:,1] 

In [10]:
# Calculate ROC score
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train,probs)

0.95099179905081954

This is a good start for an untuned classifier, let's try to perform hyperparameter search to see if we can improve this performance.

## Hyperparameter tuning for SVC classifier

We need to understand what SVC parameters we can tune in the context of calibrated classifier wrapper:

In [12]:
cal_lsvc.get_params()

{'base_estimator': LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
      intercept_scaling=1, loss='squared_hinge', max_iter=1000,
      multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
      verbose=10),
 'base_estimator__C': 1.0,
 'base_estimator__class_weight': None,
 'base_estimator__dual': True,
 'base_estimator__fit_intercept': True,
 'base_estimator__intercept_scaling': 1,
 'base_estimator__loss': 'squared_hinge',
 'base_estimator__max_iter': 1000,
 'base_estimator__multi_class': 'ovr',
 'base_estimator__penalty': 'l2',
 'base_estimator__random_state': None,
 'base_estimator__tol': 0.0001,
 'base_estimator__verbose': 10,
 'cv': 3,
 'method': 'sigmoid'}

 Parameters
 
 |  penalty : string, 'l1' or 'l2' (default='l2')
 |      Specifies the norm used in the penalization. The 'l2'
 |      penalty is the standard used in SVC. The 'l1' leads to ``coef_``
 |      vectors that are sparse.
 |  
 
 
 |  loss : string, 'hinge' or 'squared_hinge' (default='squared_hinge')
 |      Specifies the loss function. 'hinge' is the standard SVM loss
 |      (used e.g. by the SVC class) while 'squared_hinge' is the
 |      square of the hinge loss.
 
 
 |  dual : bool, (default=True)
 |      Select the algorithm to either solve the dual or primal
 |      optimization problem. Prefer dual=False when n_samples > n_features.
 
 
 |  tol : float, optional (default=1e-4)
 |      Tolerance for stopping criteria.
 
 
 |  C : float, optional (default=1.0)
 |      Penalty parameter C of the error term.
 
 
 |  multi_class : string, 'ovr' or 'crammer_singer' (default='ovr')
 |      Determines the multi-class strategy if `y` contains more than
 |      two classes.
 |      ``"ovr"`` trains n_classes one-vs-rest classifiers, while
 |      ``"crammer_singer"`` optimizes a joint objective over all classes.
 |      While `crammer_singer` is interesting from a theoretical perspective
 |      as it is consistent, it is seldom used in practice as it rarely leads
 |      to better accuracy and is more expensive to compute.
 |      If ``"crammer_singer"`` is chosen, the options loss, penalty and dual
 |      will be ignored.
 
 
 |  fit_intercept : boolean, optional (default=True)
 |      Whether to calculate the intercept for this model. If set
 |      to false, no intercept will be used in calculations
 |      (i.e. data is expected to be already centered).
 
 
 |  intercept_scaling : float, optional (default=1)
 |      When self.fit_intercept is True, instance vector x becomes
 |      ``[x, self.intercept_scaling]``,
 |      i.e. a "synthetic" feature with constant value equals to
 |      intercept_scaling is appended to the instance vector.
 |      The intercept becomes intercept_scaling * synthetic feature weight
 |      Note! the synthetic feature weight is subject to l1/l2 regularization
 |      as all other features.
 |      To lessen the effect of regularization on synthetic feature weight
 |      (and therefore on the intercept) intercept_scaling has to be increased.
 
 
 |  class_weight : {dict, 'balanced'}, optional
 |      Set the parameter C of class i to ``class_weight[i]*C`` for
 |      SVC. If not given, all classes are supposed to have
 |      weight one.
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``
 
 
 |  verbose : int, (default=0)
 |      Enable verbose output. Note that this setting takes advantage of a
 |      per-process runtime setting in liblinear that, if enabled, may not work
 |      properly in a multithreaded context.
 
 
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      The seed of the pseudo random number generator to use when shuffling
 |      the data.  If int, random_state is the seed used by the random number
 |      generator; If RandomState instance, random_state is the random number
 |      generator; If None, the random number generator is the RandomState
 |      instance used by `np.random`.
 
 
 |  max_iter : int, (default=1000)
 |      The maximum number of iterations to be run.

In [29]:
import datetime
start = datetime.datetime.now()

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import RandomizedSearchCV

lsvc = LinearSVC(verbose=10)

cal_lsvc = CalibratedClassifierCV(base_estimator = lsvc,
                                  cv = 3, # Also performs cross-validation if needed
                                  method= "sigmoid") # We use sigmoid function to get probabilities

params_space = {
    "base_estimator__penalty":['l2'],
    "base_estimator__dual":[False,True],
    "base_estimator__C":np.logspace(0.1,100,base = 2, num=100)   
}

CAL_LSVC_search = RandomizedSearchCV(cal_lsvc,
                                     param_distributions= params_space,
                                     n_jobs=3, cv = 3, 
                                     n_iter = 10,verbose=10,
                                     scoring="roc_auc"  )

CAL_LSVC_search.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.21208098657e+17 
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.21208098657e+17 
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.21208098657e+17 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.21208098657e+17, score=0.8840566440686211, total=22.3min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.00778542013e+14 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.21208098657e+17, score=0.8177943609594873, total=22.5min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.00778542013e+14 


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed: 22.6min


[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.21208098657e+17, score=0.8700200891436635, total=22.6min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.00778542013e+14 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.00778542013e+14, score=0.8676939739830596, total=40.0min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=11334486647.2 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.00778542013e+14, score=0.8831522714858911, total=41.2min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=11334486647.2 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.00778542013e+14, score=0.8876377920585509, total=42.3min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=11334486647.2 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=11334486647.2, score=0.8613303863263513, total=25.2min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=3.51965492688e+25 


[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed: 87.9min


[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=11334486647.2, score=0.8810403140175892, total=24.8min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=3.51965492688e+25 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=11334486647.2, score=0.8418134881077068, total=24.7min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=3.51965492688e+25 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=3.51965492688e+25, score=0.8735556797658436, total=12.9min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=5.29541857683e+23 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=3.51965492688e+25, score=0.882629551122085, total=15.9min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=5.29541857683e+23 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=3.51965492688e+25, score=0.883657409654119, total=18.2min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=5.29541857683e+23 


[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed: 106.6min


[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=5.29541857683e+23, score=0.8835153339382337, total=15.5min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=314704.081426 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=5.29541857683e+23, score=0.8729862955856605, total=12.9min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=314704.081426 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=5.29541857683e+23, score=0.8886630011417841, total=18.2min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=314704.081426 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=314704.081426, score=0.8644759524778105, total=10.0min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=6.69835040938e+15 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=314704.081426, score=0.869363875594573, total= 9.9min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=6.69835040938e+15 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=314704.081426, score=0.8276504835452378, total=10.4min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=6.69835040938e+15 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=6.69835040938e+15, score=0.8703198586002563, total=12.4min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.9591531793e+19 


[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed: 140.5min


[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=6.69835040938e+15, score=0.8839019726354164, total=11.9min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.9591531793e+19 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=6.69835040938e+15, score=0.8127863398265177, total=23.0min
[CV] base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.9591531793e+19 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.9591531793e+19, score=0.8759856507864273, total=21.5min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=9.47617652913e+27 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.9591531793e+19, score=0.8800628027079634, total=21.1min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=9.47617652913e+27 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=True, base_estimator__C=2.9591531793e+19, score=0.8740074286192056, total=10.4min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=9.47617652913e+27 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=9.47617652913e+27, score=0.8870179021594196, total=14.9min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.07177346254 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=9.47617652913e+27, score=0.8828418273198472, total=15.8min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.07177346254 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=9.47617652913e+27, score=0.8754417665277526, total=13.4min
[CV] base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.07177346254 




[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.07177346254, score=0.8965921880965714, total=358.8min
[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.07177346254, score=0.9036990529221292, total=359.4min
[LibLinear][LibLinear][LibLinear][CV]  base_estimator__penalty=l2, base_estimator__dual=False, base_estimator__C=1.07177346254, score=0.8892588556681781, total=359.2min


[Parallel(n_jobs=3)]: Done  30 out of  30 | elapsed: 538.5min finished


[LibLinear][LibLinear][LibLinear]It took: 553.1833333333333 minutes.


In [30]:
CAL_LSVC_search.best_params_

{'base_estimator__C': 1.0717734625362931,
 'base_estimator__dual': False,
 'base_estimator__penalty': 'l2'}

In [31]:
CAL_LSVC_search.best_score_

0.8965166989711153

In [33]:
probs = CAL_LSVC_search.best_estimator_.predict_proba(X_train_trans_pl1)[:,1]
roc_auc_score(y_train,probs)

0.9509676015575883

In [None]:
import pickle
with open("svc.pkl","wb") as f:
    pickle.dump(CAL_LSVC_search.best_estimator_,f)

# Working on an early submission

Let's try to work on an early submission using the models we have in our hand.

## Processing test data set using the pipeline locked down

We need to first process the test data set using the same pipeline we trained to consistently extract the same features:

In [113]:
# Label text features
Text_features = ["app","device","os","channel"]

##############################################################
# Define utility function to parse and process text features
##############################################################
# Note we avoid lambda functions since they don't pickle when we want to save the pipeline later   
def column_text_processer_nolambda(df,text_columns = Text_features):
    import pandas as pd
    import numpy as np
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Concatenate feature name to each category encoding for each row
    # E.g: encoding 3 at device column will read as device3 to make each encoding unique for a given feature
    for col_index in list(text_data.columns):
        text_data[col_index] = col_index + text_data[col_index].astype(str)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector)

#######################################################################
# Define custom processing functions to add the log_total_clicks and 
# log_total_click_time features, and remove the unwanted base features
#######################################################################
def column_time_processer(X_train):
    import pandas as pd
    import numpy as np

    # Convert click_time to datetime64 dtype 
    X_train.click_time = pd.to_datetime(X_train.click_time)

    # Calculate the log_total_clicks for each ip and add as a new feature to temp_data
    temp_data = pd.DataFrame(np.log(X_train.groupby(["ip"]).size()),
                                    columns = ["log_total_clicks"]).reset_index()


    # Calculate the log_total_click_time for each ip and add as a new feature to temp_data
    # First define a function to process selected ip group 
    def get_log_total_click_time(group):
        diff = (max(group.click_time) - min(group.click_time)).seconds
        return np.log(diff+1)

    # Then apply this function to each ip group and extract the total click time per ip group
    log_time_frame = pd.DataFrame(X_train.groupby(["ip"]).apply(get_log_total_click_time),
                                  columns=["log_total_click_time"]).reset_index()

    # Then add this new feature to the temp_data
    temp_data = pd.merge(temp_data,log_time_frame, how = "left",on = "ip")

    # Combine temp_data with X_train to maintain X_train key order
    temp_data = pd.merge(X_train,temp_data,how = "left",on = "ip")

    # Drop features that are not needed
    temp_data = temp_data[["log_total_clicks","log_total_click_time"]]

    # Return only the numeric features as a tensor to integrate into the numeric feature branch of the pipeline
    return temp_data


#############################################################################
# We need to wrap these custom utility functions using FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
# FunctionTransformer wrapper of utility functions to parse text and numeric features
# Note how we avoid putting any arguments into column_text_processer or column_time_processer
#############################################################################
get_numeric_data = FunctionTransformer(func = column_time_processer, validate=False) 
get_text_data = FunctionTransformer(func = column_text_processer_nolambda,validate=False) 

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
# #Note this regex will match either a whitespace or a punctuation to tokenize 
# the string vector on these preferences, in our case we only have white spaces in our text  
#############################################################################
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 

# Read the pipeline
import pickle
with open('userclick_pipeline1.pkl',"rb") as f: 
    userclick_pipeline1 = pickle.load(f)

Note that this processing is computationally intense. We will try to perform this in chunks:

In [11]:
import pandas as pd

test_set = pd.read_csv("test.csv", dtype= "str",skiprows= 18 * (10**6) , nrows= 1000000)
test_set.shape

(790469, 7)

The test dataset has 18790469 samples. In order to avoid congesting the available memory, we will load the data in chunks of 1 million data points, this will give 18 chunks. As they ae loaded, we will process the chunks using the pipeline we trained to extract the same features. Recall that the sklearn pipeline returns a sparse matrix. We will aggregate the processed chunks of the test data set using the .vstack ("vertical stack") method of scipy.sparse.

In [114]:
import datetime
start = datetime.datetime.now()

import pandas as pd
from scipy.sparse import vstack

filename = "/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test.csv"

test_set = pd.read_csv(filename, dtype= "str", nrows= 1000000)

end = datetime.datetime.now()
process_time = end - start
print("Finished reading first chunk " + "so far it took : " + str(process_time.seconds/60) + " minutes.")


test_proc_p11 = userclick_pipeline1.transform(test_set)

end = datetime.datetime.now()
process_time = end - start
print("Finished processing first chunk " + "so far it took : " + str(process_time.seconds/60) + " minutes.")

print("Shape of the tensor is : " + str(test_proc_p11.shape))

column_names = test_set.columns

skip = (10**6)+1

end = datetime.datetime.now()
process_time = end - start
print("Added chunk : " + "1 " + "so far it took : " + str(process_time.seconds/60) + " minutes.")

for i in range(2,20): 
    test_set = pd.read_csv(filename, dtype= "str",skiprows= skip, nrows= 1000000,names = list(column_names)) 
    temp_stack = userclick_pipeline1.transform(test_set)
    test_proc_p11 = vstack([test_proc_p11,temp_stack]) 
    skip = skip + (10**6)
    end = datetime.datetime.now()
    process_time = end - start
    print("Added chunk: " + str(i) + " so far it took : " + str(process_time.seconds/60) + " minutes.")
    print("Shape of the tensor is : " + str(test_proc_p11.shape))

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Finished reading first chunk so far it took : 0.03333333333333333 minutes.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Finished processing first chunk so far it took : 2.85 minutes.
Shape of the tensor is : (1000000, 45753)
Added chunk : 1 so far it took : 2.85 minutes.




Added chunk: 2 so far it took : 6.133333333333334 minutes.
Shape of the tensor is : (2000000, 45753)




Added chunk: 3 so far it took : 10.55 minutes.
Shape of the tensor is : (3000000, 45753)




Added chunk: 4 so far it took : 13.966666666666667 minutes.
Shape of the tensor is : (4000000, 45753)




Added chunk: 5 so far it took : 17.25 minutes.
Shape of the tensor is : (5000000, 45753)




Added chunk: 6 so far it took : 20.333333333333332 minutes.
Shape of the tensor is : (6000000, 45753)




Added chunk: 7 so far it took : 23.55 minutes.
Shape of the tensor is : (7000000, 45753)




Added chunk: 8 so far it took : 27.216666666666665 minutes.
Shape of the tensor is : (8000000, 45753)




Added chunk: 9 so far it took : 30.716666666666665 minutes.
Shape of the tensor is : (9000000, 45753)




Added chunk: 10 so far it took : 34.1 minutes.
Shape of the tensor is : (10000000, 45753)




Added chunk: 11 so far it took : 37.81666666666667 minutes.
Shape of the tensor is : (11000000, 45753)




Added chunk: 12 so far it took : 41.7 minutes.
Shape of the tensor is : (12000000, 45753)




Added chunk: 13 so far it took : 45.61666666666667 minutes.
Shape of the tensor is : (13000000, 45753)




Added chunk: 14 so far it took : 49.483333333333334 minutes.
Shape of the tensor is : (14000000, 45753)




Added chunk: 15 so far it took : 53.666666666666664 minutes.
Shape of the tensor is : (15000000, 45753)




Added chunk: 16 so far it took : 57.833333333333336 minutes.
Shape of the tensor is : (16000000, 45753)




Added chunk: 17 so far it took : 62.15 minutes.
Shape of the tensor is : (17000000, 45753)




Added chunk: 18 so far it took : 66.55 minutes.
Shape of the tensor is : (18000000, 45753)




Added chunk: 19 so far it took : 70.73333333333333 minutes.
Shape of the tensor is : (18790469, 45753)
It took: 70.75 minutes.


In [8]:
# Note that writing files over 4GB through pickle stuck to a bug:
# https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb
# We need to write the sparse matrix in chunks,
# break the bytes object into chunks of size 2**31 - 1 to get it in or out of the file.
# Since the file is not feasible to fit into internal disk we need to save into an external space

import pickle
import os.path

file_path = "/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test_proc_pl1.pkl"
n_bytes = 2**31
max_bytes = 2**31 - 1


## write in chunks
bytes_out = pickle.dumps(test_proc_p11)
with open(file_path, 'wb') as f_out:
    for idx in range(0, n_bytes, max_bytes):
        f_out.write(bytes_out[idx:idx+max_bytes])

The history saving thread hit an unexpected error (OperationalError('unable to open database file',)).History will not be written to the database.


We finally saved the processed test data set into the external disk. Let's try to make predictions using the models we prepared so far:

### Reading the large sparse matrix in chunks

In [2]:
import pickle
import os.path
import pandas as pd

file_path = "/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test_proc_pl1.pkl"
n_bytes = 2**31
max_bytes = 2**31 - 1

## read in chunks
bytes_in = bytearray(0)
input_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f_in:
    for idx in range(0, input_size, max_bytes):
        bytes_in += f_in.read(max_bytes)
test_proc_p11 = pickle.loads(bytes_in)

UnpicklingError: pickle data was truncated

We are getting this error and it looks like the reason is not clear. We need to find a different solution to save and load the processed test set sparse martix.

### Saving and reading the large sparse matrix using scipy.sparse.save_npz and .load_npz:

In [115]:
test_proc_p11

<18790469x45753 sparse matrix of type '<class 'numpy.float64'>'
	with 272396216 stored elements in COOrdinate format>

In [116]:
# Save a sparse matrix to a file using .npz format
# See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html
import scipy.sparse as sp
sp.save_npz("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test_proc_pl1.npz",
            test_proc_p11)

In [117]:
# Load the sparse matrix 
import scipy.sparse as sp
test_proc_p11 = sp.load_npz("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test_proc_pl1.npz")


This method is much better way of saving the sparse matrix. The resulting file is significantly smaller compared to pickle (~700MB v.s. 4GB). 

Let's use the processed test set and make predictions using the models we prepared so far to establish some benchmarks.

## Make predictions using the Ridge, Naive Bayes and SVC classifiers

In [5]:
import os
os.listdir()

['.git',
 '.ipynb_checkpoints',
 '.Rhistory',
 '__pycache__',
 'app_dummy.rds',
 'channel_dummy.rds',
 'device_dummy.rds',
 'mnb.pkl',
 'os_dummy.rds',
 'Ridge_classifier.pkl',
 'sample_submission.csv',
 'SparseInteractions.py',
 'svc.pkl',
 'test.csv',
 'test_processed.csv',
 'train_sample.csv',
 'User-click-detection-predictive-modeling.ipynb',
 'userclick_pipeline1.pkl',
 'UserClickDetectionPredictiveModeling.Rmd',
 'X_train.pkl',
 'X_train_trans_pl1.pkl',
 'X_val1.pkl',
 'X_val2.pkl',
 'y_train.pkl',
 'y_val1.pkl',
 'y_val2.pkl']

In [21]:
# Load the sparse matrix 
import scipy.sparse as sp
test_proc_p11 = sp.load_npz("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test_proc_pl1.npz")

print("Loaded sparse matrix.")

# Load the model objects
import pickle
with open('Ridge_classifier.pkl',"rb") as f:
    Ridge_classifier = pickle.load(f)

import pickle
with open('mnb.pkl',"rb") as f:
    mnb = pickle.load(f)
    
import pickle
with open('svc.pkl',"rb") as f:
    svc = pickle.load(f) 

print("Loaded model objects.") 

# Collect predictions
import datetime
import numpy as np
import pandas as pd 
start = datetime.datetime.now()

# Ridge
d = Ridge_classifier.decision_function(X= test_proc_p11) # Predict confidence scores for samples
probs_Ridge = np.exp(d) / np.sum(np.exp(d)) # Use softmax to convert them probabilities between 0 and 1
end = datetime.datetime.now()
process_time = end-start
print("Completed Ridge predictions, it took: " + str((process_time.seconds)/60) + " minutes.")

# mnb
probs_mnb = mnb.predict_proba(test_proc_p11)[:,1]
end1 = datetime.datetime.now()
process_time = end1-end
print("Completed MNB predictions, it took: " + str((process_time.seconds)/60) + " minutes.")

# SVC
probs_svc = svc.predict_proba(test_proc_p11)[:,1]
end2 = datetime.datetime.now()
process_time = end2-end1
print("Completed SVC predictions, it took: " + str((process_time.seconds)/60) + " minutes.")

Loaded sparse matrix.
Loaded model objects.
Completed Ridge predictions, it took: 1.7166666666666666 minutes.
Completed MNB predictions, it took: 2.1 minutes.
Completed SVC predictions, it took: 6.05 minutes.


In [22]:
print(probs_Ridge.shape)
print(probs_mnb.shape)
print(probs_svc.shape)

(18790469,)
(18790469,)
(18790469,)


## A custom function to prepare submission files 

In [24]:
# Save the click_id from test set to disk to reuse
import pandas as pd
click_id = pd.read_csv("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test.csv", 
                       dtype = "str").click_id.astype('int64')
click_id.to_hdf("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/click_id.h5",
               "click_id")

In [33]:
import pandas as pd
click_id = pd.read_hdf("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/click_id.h5")
def prepare_submission(predictions,filename = "new_submission", click_id = click_id):
    """predictions: a list containing the predicted probabilities in the test set. """
    is_attributed = pd.Series(predictions)
    submission_frame = pd.DataFrame()
    submission_frame["click_id"] = click_id
    submission_frame["is_attributed"] = is_attributed.apply(lambda x: format(x,".9f"))  # Reformat the probabilities upto the 9th decimal point
    filename = filename + ".csv"
    submission_frame.to_csv(filename,index = False)
    print("File saved as :" + filename)

Prepare these 3 predictions for submission:

At the same time built a dataframe with the predictions we keep accumulating so far:

In [26]:
prediction_frames = pd.DataFrame()
prediction_frames["click_id"] = click_id
prediction_frames["probs_Ridge"] = probs_Ridge
prediction_frames["probs_mnb"] = probs_mnb
prediction_frames["probs_svc"] = probs_svc

In [27]:
prediction_frames.head()

Unnamed: 0,click_id,probs_Ridge,probs_mnb,probs_svc
0,0,5.279898e-08,1.263286e-07,0.000795
1,1,5.273787e-08,2.544532e-08,0.000837
2,2,5.268285e-08,5.561252e-11,0.000602
3,3,5.270626e-08,4.26183e-11,0.001191
4,4,5.275368e-08,4.717778e-11,0.001255


In [28]:
prediction_frames.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18790469 entries, 0 to 18790468
Data columns (total 4 columns):
click_id       int64
probs_Ridge    float64
probs_mnb      float64
probs_svc      float64
dtypes: float64(3), int64(1)
memory usage: 716.8 MB


In [29]:
prediction_frames.tail()

Unnamed: 0,click_id,probs_Ridge,probs_mnb,probs_svc
18790464,18790464,5.366131e-08,1.249719e-05,0.001435
18790465,18790465,5.269257e-08,2.442719e-14,0.000321
18790466,18790467,5.267693e-08,2.79052e-10,0.000945
18790467,18790466,5.283781e-08,2.501097e-07,0.001181
18790468,18790468,5.241018e-08,2.7599730000000003e-17,4e-05


In [31]:
prediction_frames.to_hdf("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/prediction_frames.h5",
                         'prediction_frames')

In [4]:
# Need to repeat these submission preparations!!
import pandas as pd
prediction_frames = pd.read_hdf("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/prediction_frames.h5")

prepare_submission(predictions= prediction_frames.probs_Ridge, 
                   filename= "/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/Ridge_submission")
prepare_submission(predictions= prediction_frames.probs_mnb, 
                   filename= "/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/MNB_submission")
prepare_submission(predictions= prediction_frames.probs_svc, 
                   filename= "/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/SVC_submission")

File saved as :/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/Ridge_submission.csv
File saved as :/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/MNB_submission.csv
File saved as :/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/SVC_submission.csv


These preliminary submissions provided us some benchmark. The best model was MNB classifier, which scored 0.9425 ROC-AUC using the test data set. SVC and Ridge classifiers gave a score of 0.88. Most likely we won't use these classifiers in the future ensemble models.

# Testing Bayesian Optimization to tune MNB classifier

It looks like MNB model has the best performance amongst the 3 classifiers we have trained. Let's try Bayesian Optimization to see if we can further improve this classifier by performing this type of hyperparameter optimization.

In [6]:
from bayes_opt import BayesianOptimization
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score

# Read the transformed features and target labels from the training set
import pickle
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)

In [5]:
help(MultinomialNB)

Help on class MultinomialNB in module sklearn.naive_bayes:

class MultinomialNB(BaseDiscreteNB)
 |  Naive Bayes classifier for multinomial models
 |  
 |  The multinomial Naive Bayes classifier is suitable for classification with
 |  discrete features (e.g., word counts for text classification). The
 |  multinomial distribution normally requires integer feature counts. However,
 |  in practice, fractional counts such as tf-idf may also work.
 |  
 |  Read more in the :ref:`User Guide <multinomial_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : float, optional (default=1.0)
 |      Additive (Laplace/Lidstone) smoothing parameter
 |      (0 for no smoothing).
 |  
 |  fit_prior : boolean, optional (default=True)
 |      Whether to learn class prior probabilities or not.
 |      If false, a uniform prior will be used.
 |  
 |  class_prior : array-like, size (n_classes,), optional (default=None)
 |      Prior probabilities of the classes. If specified the priors are not
 |   

Our earlier best paremeters were:

{'alpha': 0.20000000000000001,
 'class_prior': [0.99831, 0.00169],
 'fit_prior': True}
 


In [14]:
# We start by defining the score we want to be maximized using Bayesian Optimization
# Return MEAN cross validated 'roc_auc' score from Support Vector Machine Classifier
# Note that paameters we will optimize are called as generic arguments

def mnbcv(alpha):
    val = cross_val_score(MultinomialNB(alpha = alpha, fit_prior= True, class_prior= None),
                         X_train_trans_pl1,y_train, 'roc_auc', cv=5, n_jobs = 3).mean()
    return val


In [22]:
import warnings
warnings.filterwarnings('ignore')

# alpha is a parameter for the gaussian process
# Note that this is itself a hyperparemter that can be optimized.
gp_params = {"alpha": 1e-10}

# We create the BayesianOptimization objects using the functions that utilize
# the respective classifiers and return cross-validated scores to be optimized.

seed = 112 # Random seed

# We create the bayes_opt object and pass the function to be maximized
# together with the parameters names and their bounds.

mnbBO = BayesianOptimization(f = mnbcv, 
                             pbounds =  {'alpha': (0.2, 0.3)},
                             random_state = seed,
                             verbose = 10)

# Finally we call .maximize method of the optimizer with the appropriate arguments

mnbBO.maximize(init_points=5,n_iter=10,acq='ucb', kappa=3, **gp_params)

[31mInitialization[0m
[94m-----------------------------------------[0m
 Step |   Time |      Value |     alpha | 
    1 | 00m06s | [35m   0.94277[0m | [32m   0.2375[0m | 
    2 | 00m06s | [35m   0.94286[0m | [32m   0.2640[0m | 
    3 | 00m06s | [35m   0.94291[0m | [32m   0.2950[0m | 
    4 | 00m06s |    0.94264 |    0.2076 | 
    5 | 00m06s |    0.94288 |    0.2777 | 
[31mBayesian Optimization[0m
[94m-----------------------------------------[0m
 Step |   Time |      Value |     alpha | 
    6 | 00m12s |    0.94283 |    0.2541 | 
    7 | 00m13s | [35m   0.94291[0m | [32m   0.2957[0m | 
    8 | 00m13s |    0.94288 |    0.2776 | 
    9 | 00m10s |    0.94290 |    0.2858 | 
   10 | 00m13s |    0.94287 |    0.2679 | 
   11 | 00m14s |    0.94277 |    0.2371 | 
   12 | 00m14s |    0.94285 |    0.2589 | 
   13 | 00m14s |    0.94281 |    0.2484 | 
   14 | 00m08s |    0.94262 |    0.2046 | 
   15 | 00m08s |    0.94290 |    0.2899 | 


In [23]:
mnbBO.res

{'all': {'params': [{'alpha': 0.25405897371886416},
   {'alpha': 0.2957218967374855},
   {'alpha': 0.27762561466946939},
   {'alpha': 0.28580178477869017},
   {'alpha': 0.26787478048201852},
   {'alpha': 0.23709997691568072},
   {'alpha': 0.25885144513692321},
   {'alpha': 0.24839723916628981},
   {'alpha': 0.20455275849005611},
   {'alpha': 0.28992123720651092}],
  'values': [0.9428309561856455,
   0.94291343915763548,
   0.94288420657109628,
   0.94289622895547076,
   0.94286742251853073,
   0.94277156078197832,
   0.94284550808716239,
   0.9428131702186654,
   0.94262378824705217,
   0.94290411500060434]},
 'max': {'max_params': {'alpha': 0.2957218967374855},
  'max_val': 0.94291343915763548}}

In [26]:
# We re-fit the mnb using the optimized parameters
from sklearn.metrics import roc_auc_score

mnb_bopt = MultinomialNB(alpha=0.29572, class_prior= None, fit_prior= True)
mnb_bopt.fit(X_train_trans_pl1,y_train)

probs = mnb_bopt.predict_proba(X_train_trans_pl1)[:,1]

print("mnb roc score after bayesian optimization: " + str(roc_auc_score(y_train,probs)))

mnb roc score after bayesian optimization: 0.95235789271


In [31]:
# save the bayesian optimized model
import pickle
with open("mnb_bopt.pkl","wb") as f:
    pickle.dump(mnb_bopt,f)

In [32]:
# let's perform a prediction using the test set
# Load the sparse matrix 
import scipy.sparse as sp
test_proc_p11 = sp.load_npz("/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/test_proc_pl1.npz")

print("Loaded processed test sparse matrix.")

# Load the model object
import pickle
with open('mnb_bopt.pkl',"rb") as f:
    mnb_bopt = pickle.load(f)

print("Loaded model object.") 

# Collect predictions
import datetime
import numpy as np
import pandas as pd 
start = datetime.datetime.now()


# mnb_bopt
probs_mnb_bopt = mnb_bopt.predict_proba(test_proc_p11)[:,1]
end1 = datetime.datetime.now()
process_time = end1-start
print("Completed MNB predictions, it took: " + str((process_time.seconds)/60) + " minutes.")


Loaded processed test sparse matrix.
Loaded model object.
Completed MNB predictions, it took: 3.0833333333333335 minutes.


AttributeError: 'numpy.ndarray' object has no attribute 'apply'

In [34]:
# Prepare the submission file

prepare_submission(predictions= probs_mnb_bopt, 
                   filename= "/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/MNB_Bayesian_OPT_submission")

File saved as :/Volumes/Iomega_HDD/2016/Data science/Kaggle/User-click-detection-predictive-modeling/MNB_Bayesian_OPT_submission.csv
