# Introduction

Through the initial analysis and exploration of the train_sample set (100K observations) we developed some initial expectations about the data. Here we are going to build a pipeline in order to effectively process data sets for modeling.

# Planing for the Pipeline

1. We will prepare 3 data sets from the training set. The training set provided is 200 million samples, we don't have computational power to use all of this data at the moment. Instead, we will extract 3 data sets each contain ~1 million observations. We will refer these sets as:

    - training_set (1:100,000th rows of the original training set)
    - validation_set1 (100,001:200,000th rows of the original training set)
    - validation_set2 (200,001:300,000th rows of the original training set)

2. Build the feature extraction and selection pipeline using the training set:

    - Using the insights we obtained from data exploration, the following features will be used to create dummy variables: device, app, os and channel. We will perform this by converting these features to string, tokenization and selecting 300 best features.
    - We will write custom processing functions to add the log_total_clicks and log_total_click_time features, and remove the unwanted base features
    
3. We will prepare the remainder of the pipeline to incorporate interaction terms and perform scaling and standardization.    

## Prepare training and validation sets


In [19]:
import pandas as pd
training_set = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           nrows=1000000,
                           dtype = "str")
print("Finished training_set")
validation_set1 = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           skiprows = 1000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set1")
validation_set2 = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/train.csv",
                           skiprows = 2000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set2")


Finished training_set
Finished validation_set1
Finished validation_set2


In [20]:
validation_set1.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,121848,24,1,19,105,2017-11-06 16:21:51,,0
1,2698,25,1,30,259,2017-11-06 16:21:51,,0
2,5729,2,1,19,237,2017-11-06 16:21:51,,0
3,122891,3,1,35,280,2017-11-06 16:21:51,,0
4,105433,15,2,25,245,2017-11-06 16:21:51,,0


In [2]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
ip                 1000000 non-null object
app                1000000 non-null object
device             1000000 non-null object
os                 1000000 non-null object
channel            1000000 non-null object
click_time         1000000 non-null object
attributed_time    1693 non-null object
is_attributed      1000000 non-null object
dtypes: object(8)
memory usage: 61.0+ MB


In [21]:
# Let's save them for future easier individual loading
training_set.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/training_set.csv")
print("Wrote training_set to disk")

validation_set1.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/validation_set1.csv")
print("Wrote validation_set1 to disk")

validation_set2.to_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/validation_set2.csv")
print("Wrote validation_set2 to disk")

Wrote training_set to disk
Wrote validation_set1 to disk
Wrote validation_set2 to disk


In [22]:
training_set = pd.read_csv("/Volumes/500GB/Data_science/Kaggle/User-click-detection-predictive-modeling/training_set.csv",
                          index_col = 0, dtype = "str")

In [23]:
training_set.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,83230,3,1,13,379,2017-11-06 14:32:21,,0
1,17357,3,1,19,379,2017-11-06 14:33:34,,0
2,35810,3,1,13,379,2017-11-06 14:34:12,,0
3,45745,14,1,13,478,2017-11-06 14:34:52,,0
4,161007,3,1,13,379,2017-11-06 14:35:08,,0


In [24]:
list(training_set.columns)

['ip',
 'app',
 'device',
 'os',
 'channel',
 'click_time',
 'attributed_time',
 'is_attributed']

### Seperate target labels from feature matrix 

We will seperate target labels from features for each of these data sets and pickle them for future use:

In [25]:
import os
import pandas as pd
import numpy as np
import pickle

X_train = training_set.drop(["is_attributed","attributed_time"], axis = 1)
y_train = pd.to_numeric(training_set.is_attributed) 

X_train.to_pickle("X_train.pkl")
y_train.to_pickle("y_train.pkl")

X_val1 = validation_set1.drop(["is_attributed","attributed_time"], axis = 1)
y_val1 = pd.to_numeric(validation_set1.is_attributed) 

X_val1.to_pickle("X_val1.pkl")
y_val1.to_pickle("y_val1.pkl")

X_val2 = validation_set2.drop(["is_attributed","attributed_time"], axis = 1)
y_val2 = pd.to_numeric(validation_set2.is_attributed) 

X_val2.to_pickle("X_val2.pkl")
y_val2.to_pickle("y_val2.pkl")

In [2]:
import os
os.listdir()

['.git',
 '.ipynb_checkpoints',
 '.Rhistory',
 'app_dummy.rds',
 'channel_dummy.rds',
 'device_dummy.rds',
 'os_dummy.rds',
 'test_processed.csv',
 'train_sample.csv',
 'User-click-detection-predictive-modeling.ipynb',
 'UserClickDetectionPredictiveModeling.Rmd',
 'X_train.pkl',
 'X_val1.pkl',
 'X_val2.pkl',
 'y_train.pkl',
 'y_val1.pkl',
 'y_val2.pkl']

## Build the feature extraction and selection pipeline using the training set



In [217]:
import pandas as pd
import pickle

# Read the pickled training set
X_train = pd.read_pickle("X_train.pkl")
y_train = pd.read_pickle("y_train.pkl")

# Label text features
Text_features = ["app","device","os","channel"]

##############################################################
# Define utility function to parse and process text features
##############################################################
# Note we avoid lambda functions since they don't pickle when we want to save the pipeline later   
def column_text_processer_nolambda(df,text_columns = Text_features):
    import pandas as pd
    import numpy as np
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Concatenate feature name to each category encoding for each row
    # E.g: encoding 3 at device column will read as device3 to make each encoding unique for a given feature
    for col_index in list(text_data.columns):
        text_data[col_index] = col_index + text_data[col_index].astype(str)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector)

#######################################################################
# Define custom processing functions to add the log_total_clicks and 
# log_total_click_time features, and remove the unwanted base features
#######################################################################
def column_time_processer(X_train):
    import pandas as pd
    import numpy as np

    # Convert click_time to datetime64 dtype 
    X_train.click_time = pd.to_datetime(X_train.click_time)

    # Calculate the log_total_clicks for each ip and add as a new feature to temp_data
    temp_data = pd.DataFrame(np.log(X_train.groupby(["ip"]).size()),
                                    columns = ["log_total_clicks"]).reset_index()


    # Calculate the log_total_click_time for each ip and add as a new feature to temp_data
    # First define a function to process selected ip group 
    def get_log_total_click_time(group):
        diff = (max(group.click_time) - min(group.click_time)).seconds
        return np.log(diff+1)

    # Then apply this function to each ip group and extract the total click time per ip group
    log_time_frame = pd.DataFrame(X_train.groupby(["ip"]).apply(get_log_total_click_time),
                                  columns=["log_total_click_time"]).reset_index()

    # Then add this new feature to the temp_data
    temp_data = pd.merge(temp_data,log_time_frame, how = "left",on = "ip")

    # Combine temp_data with X_train to maintain X_train key order
    temp_data = pd.merge(X_train,temp_data,how = "left",on = "ip")

    # Drop features that are not needed
    temp_data = temp_data[["log_total_clicks","log_total_click_time"]]

    # Return only the numeric features as a tensor to integrate into the numeric feature branch of the pipeline
    return temp_data


#############################################################################
# We need to wrap these custom utility functions using FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
# FunctionTransformer wrapper of utility functions to parse text and numeric features
# Note how we avoid putting any arguments into column_text_processer or column_time_processer
#############################################################################
get_numeric_data = FunctionTransformer(func = column_time_processer, validate=False) 
get_text_data = FunctionTransformer(func = column_text_processer_nolambda,validate=False) 

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
# #Note this regex will match either a whitespace or a punctuation to tokenize 
# the string vector on these preferences, in our case we only have white spaces in our text  
#############################################################################
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   

###############################################
# Construct our feature extraction pipeline
###############################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer
from sklearn.feature_selection import SelectKBest, chi2 # We will use chi-squared as a scoring function to select features for classification
from sklearn.metrics import auc
from SparseInteractions import * #Load SparseInteractions (from : https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py) as a module since it was saved into working directory as SparseInteractions.py

userclick_pipeline1 = Pipeline([
    
    ("union",FeatureUnion(
        # Note that FeatureUnion() also accepts list of tuples, the first half of each tuple 
        # is the name of the transformer within the FeatureUnion
        
        transformer_list = [
            
            ("numeric_subpipeline",Pipeline([        # Note we have subpipeline branches inside the main pipeline
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()) # Step2: impute any missing data using default (mean), note we don't expect missing values in this case. 
            ])), # End of: numeric_subpipeline
            
            ("text_subpipeline",Pipeline([
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC, # Step2: use HashingVectorizer for automated tokenization and feature extraction
                                             ngram_range = (1,1),
                                             non_negative=True, 
                                             norm=None, binary=True )), # Note here we use binary=True since our hack is to use tokenization to generate dummy variables  
                ('dim_red', SelectKBest(chi2,300)) # Step3: use dimension reduction to select 300 best features using chi2 as scoring function
            ]))
        ]
        
    )),# End of step: union, this is the fusion point to main pipeline, all features are numeric at this stage
    
    # Common steps:
            
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms up to the second degree polynomial
    ("scaler",MaxAbsScaler()) # Scale the features between 0 and 1.       
            
])# End of: userclick_pipeline1


Develop the userclick_pipeline1 by using the training set:

In [239]:
import datetime
start = datetime.datetime.now()

userclick_pipeline1.fit(X_train,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


It took: 3.3333333333333335 minutes.


Having trained our pipeline using the training set, we will pickle it and store for reuse. This will ensure the consistency every time we want to process a data set, and we will extract the same set of features.

In [240]:
# Pickle and store the userclick_pipeline1
import pickle
with open("userclick_pipeline1.pkl","wb") as f:
    pickle.dump(userclick_pipeline1,f)    

## Transform the features in the training set using the established pipeline

In [242]:
# Re-load the userclick_pipeline1 to work with
import pickle
with open("userclick_pipeline1.pkl","rb") as f:
    userclick_pipeline1 = pickle.load(f)

In [243]:
import datetime
start = datetime.datetime.now()

# Transform the training set features
X_train_trans_pl1 = userclick_pipeline1.transform(X_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


It took: 3.0833333333333335 minutes.


In [245]:
X_train_trans_pl1.shape

(1000000, 45753)

In [246]:
type(X_train_trans_pl1)

scipy.sparse.csc.csc_matrix

We will pickle and save this transformed version of the features from the training set. We spent about 3 minutes by training the pipeline and an additional 3 minutes for transforming the features.

In the future, we will only use this pipeline with .transform method to process any datasets we would like to use in our models.

In [248]:
# Save the transformed version of training set features
import pickle
with open("X_train_trans_pl1.pkl","wb") as f:
    pickle.dump(X_train_trans_pl1,f)  

# Fitting Regularized Linear Model for predicting true events

We will start linear and train a Ridge classifier by setting the regularization parameter alpha to 0.5. We will see the untuned model performance first, then try to optimize the performance. 

In [255]:
# Read the transformed features and target labels from the training set
import pickle
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)

from sklearn.linear_model import RidgeClassifier

# Instantiate a Ridge classifier with a medium alpha 
Ridge = RidgeClassifier(alpha=0.5, random_state= 321)

# Train the model
import datetime
start = datetime.datetime.now()

Ridge.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

It took: 1.3333333333333333 minutes.


In [270]:
# Predict class labels using training set
import datetime
start = datetime.datetime.now()

# Predict class probabilities
# Note that there is no predict_proba on RidgeClassifier
# So we use the trick in https://stackoverflow.com/questions/22538080/scikit-learn-ridge-classifier-extracting-class-probabilities 

d = Ridge.decision_function(X= X_train_trans_pl1) # Predict confidence scores for samples
probs = np.exp(d) / np.sum(np.exp(d)) # Use softmax to convert them probabilities between 0 and 1

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

It took: 0.016666666666666666 minutes.


In [271]:
# Calculate ROC score between the predicted probability and the observed target
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train,probs)

0.94980780103952844

Even this naive attempt gave us an ROC score of 0.949, Next, we will try to perform hyperparameter optimization to see where we can get further: 

## Hyperparameter optimization using Ridge model

In [276]:
from sklearn.model_selection import RandomizedSearchCV

import datetime
start = datetime.datetime.now()

params_space = {"alpha": np.arange(0,1,0.01)}
RidgeSearch = RandomizedSearchCV(Ridge,cv = 2,verbose=10,
                                 n_iter=20,
                                 n_jobs=3,
                                 param_distributions=params_space,
                                 random_state= 321,
                                 scoring= "roc_auc")

RidgeSearch.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Fitting 2 folds for each of 20 candidates, totalling 40 fits
[CV] alpha=0.86 ......................................................
[CV] alpha=0.86 ......................................................
[CV] alpha=0.25 ......................................................
[CV] ............. alpha=0.86, score=0.8914263110278222, total=  37.6s
[CV] alpha=0.25 ......................................................
[CV] ............. alpha=0.86, score=0.8956513669854133, total= 1.1min
[CV] alpha=0.45 ......................................................


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:  1.2min


[CV] .............. alpha=0.25, score=0.884190883324053, total= 1.4min
[CV] alpha=0.45 ......................................................
[CV] ............. alpha=0.25, score=0.8914015886647204, total= 1.3min
[CV] alpha=0.32 ......................................................
[CV] ............. alpha=0.45, score=0.8886544167323791, total=  53.1s
[CV] alpha=0.32 ......................................................
[CV] ............. alpha=0.45, score=0.8941574912061789, total=  55.7s
[CV] alpha=0.2 .......................................................
[CV] ............. alpha=0.32, score=0.8875157179170297, total= 1.2min
[CV] alpha=0.2 .......................................................


[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:  3.2min


[CV] ............. alpha=0.32, score=0.8924351290802932, total= 1.2min
[CV] alpha=0.06 ......................................................
[CV] ............... alpha=0.2, score=0.882164720700341, total= 1.8min
[CV] alpha=0.06 ......................................................
[CV] .............. alpha=0.2, score=0.8905415407269656, total= 1.8min
[CV] alpha=0.15 ......................................................
[CV] ............. alpha=0.06, score=0.8786576036594884, total= 2.8min
[CV] alpha=0.15 ......................................................
[CV] .............. alpha=0.15, score=0.883062963076619, total= 1.8min
[CV] alpha=0.82 ......................................................


[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:  6.7min


[CV] ............. alpha=0.06, score=0.8759978611806225, total= 2.7min
[CV] alpha=0.82 ......................................................
[CV] ............. alpha=0.82, score=0.8913547259152571, total=  33.9s
[CV] alpha=0.08 ......................................................
[CV] ............. alpha=0.15, score=0.8884136760769671, total= 1.7min
[CV] alpha=0.08 ......................................................
[CV] ............. alpha=0.82, score=0.8954961726441188, total=  55.6s
[CV] alpha=0.38 ......................................................
[CV] ............. alpha=0.38, score=0.8886718559207138, total=  57.1s
[CV] alpha=0.38 ......................................................
[CV] .............. alpha=0.08, score=0.880409659046294, total= 2.2min
[CV] alpha=0.69 ......................................................
[CV] ............. alpha=0.38, score=0.8924743918088495, total= 1.0min
[CV] alpha=0.69 ......................................................


[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:  9.9min


[CV] ............. alpha=0.08, score=0.8830280871683156, total= 2.3min
[CV] alpha=0.48 ......................................................
[CV] ............. alpha=0.69, score=0.8903844868681919, total=  38.9s
[CV] alpha=0.48 ......................................................
[CV] .............. alpha=0.69, score=0.894370026181325, total=  40.2s
[CV] alpha=0.95 ......................................................
[CV] ............. alpha=0.48, score=0.8886348535251752, total=  48.8s
[CV] alpha=0.95 ......................................................
[CV] ............. alpha=0.48, score=0.8942835960807918, total=  50.9s
[CV] alpha=0.42 ......................................................
[CV] ............. alpha=0.95, score=0.8912950972786214, total=  29.2s
[CV] alpha=0.42 ......................................................
[CV] .............. alpha=0.42, score=0.888656706320648, total=  54.5s
[CV] alpha=0.67 ......................................................


[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed: 12.0min


[CV] ............. alpha=0.42, score=0.8933858578180848, total=  58.5s
[CV] alpha=0.67 ......................................................
[CV] ............. alpha=0.95, score=0.8976710720537423, total= 1.3min
[CV] alpha=0.74 ......................................................
[CV] ............. alpha=0.67, score=0.8904195284489833, total=  38.5s
[CV] alpha=0.74 ......................................................
[CV] ............. alpha=0.67, score=0.8943406596495505, total=  42.0s
[CV] alpha=0.7 .......................................................
[CV] ............. alpha=0.74, score=0.8902902991673008, total=  36.1s
[CV] alpha=0.7 .......................................................
[CV] ............. alpha=0.74, score=0.8951258526506547, total=  43.6s
[CV] alpha=0.44 ......................................................
[CV] .............. alpha=0.7, score=0.8904273622468626, total=  38.9s
[CV] alpha=0.44 ......................................................
[CV] .

[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed: 14.3min


[CV] ............. alpha=0.51, score=0.8885397528273091, total=  47.1s
[CV] alpha=0.53 ......................................................
[CV] ............. alpha=0.44, score=0.8940703814673404, total=  55.5s
[CV] alpha=0.53 ......................................................
[CV] ............. alpha=0.51, score=0.8943254518544484, total=  51.8s
[CV] ............. alpha=0.53, score=0.8885548291037833, total=  47.6s
[CV] ............. alpha=0.53, score=0.8943041770442346, total=  49.3s


[Parallel(n_jobs=3)]: Done  40 out of  40 | elapsed: 15.2min remaining:    0.0s
[Parallel(n_jobs=3)]: Done  40 out of  40 | elapsed: 15.2min finished


It took: 16.15 minutes.


In [277]:
RidgeSearch.best_score_

0.89448307829020712

In [278]:
RidgeSearch.best_params_

{'alpha': 0.95000000000000007}

In [280]:
# Use 'alpha': 0.95 to re-fit the Ridge classifier and calculate the performance
# Read the transformed features and target labels from the training set
import pickle
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)

from sklearn.linear_model import RidgeClassifier

# Instantiate a Ridge classifier with a medium alpha 
Ridge = RidgeClassifier(alpha=0.95, random_state= 321)

# Train the model
import datetime
start = datetime.datetime.now()

Ridge.fit(X_train_trans_pl1,y_train)
d = Ridge.decision_function(X= X_train_trans_pl1) # Predict confidence scores for samples
probs = np.exp(d) / np.sum(np.exp(d)) # Use softmax to convert them probabilities between 0 and 1

# Calculate ROC score between the predicted probability and the observed target
from sklearn.metrics import roc_auc_score
print( "The ROC score is:" + str(roc_auc_score(y_train,probs)))

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")


The ROC score is:0.949325548082
It took: 1.05 minutes.


We could not improve the performance of the Ridge model using this approach. Let's save this model and continue to explore other types of classifiers.

In [281]:
import pickle
with open("Ridge_classifier.pkl", "wb") as f:
    pickle.dump(Ridge, f)

# Training Naive Bayes Classifier

In [1]:
# Read the transformed features and target labels from the training set
import pickle
import numpy as np
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)    

In [2]:
# Train the model
import datetime
start = datetime.datetime.now()

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

It took: 0.016666666666666666 minutes.


In [3]:
# Get the probability predictions
probs = mnb.predict_proba(X_train_trans_pl1)  

In [6]:
probs # We need the second column which is the probabilities for class label: 1

array([[  1.00000000e+00,   6.32335454e-15],
       [  1.00000000e+00,   7.52020576e-15],
       [  1.00000000e+00,   1.22347330e-14],
       ..., 
       [  1.00000000e+00,   1.74420279e-12],
       [  1.00000000e+00,   5.90993405e-16],
       [  1.00000000e+00,   5.74730378e-13]])

In [8]:
probs = probs[:,1]

In [9]:
# Calculate the roc score for the training set
from sklearn.metrics import roc_auc_score
print("NB roc score is: " + str(roc_auc_score(y_train,probs)))

NB roc score is: 0.946744947288


This is our untuned classifier, which has similar performance to Ridge. Can we try to tune it to perform better? Looking into description, we find the following parameters that can be tuned:

Parameters

 |  alpha : float, optional (default=1.0)
 |      Additive (Laplace/Lidstone) smoothing parameter
 |      (0 for no smoothing).


 |  fit_prior : boolean, optional (default=True)
 |      Whether to learn class prior probabilities or not.
 |      If false, a uniform prior will be used.


 |  class_prior : array-like, size (n_classes,), optional (default=None)
 |      Prior probabilities of the classes. If specified the priors are not
 |      adjusted according to the data.
 

- We can make search across the alpha (0-1). 
- We will leave fit_prior = True
- We have some idea about the probability of being in class 1, which is (sum(y_train)/len(y_train)) (0.00169) in our training set. Why don't we use this information in our hyperparameter search and see if it makes any difference.

Since the classifier trained very fast, we can perform exhaustive GridSearch with 3-fold CV.

## Hyperparameter optimization for Naive Bayes Classifier

In [17]:
import datetime
start = datetime.datetime.now()

from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

params_space = {
    "alpha":np.arange(0,1.1,0.1),
    "fit_prior":[True],
    "class_prior":[None,[1-0.00169,0.00169]]
}

MNBsearch = GridSearchCV(mnb,
                         param_grid= params_space,
                         scoring="roc_auc",
                         cv =3, n_jobs=3,verbose=10)

MNBsearch.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Fitting 3 folds for each of 22 candidates, totalling 66 fits
[CV] alpha=0.0, class_prior=None, fit_prior=True .....................
[CV] alpha=0.0, class_prior=None, fit_prior=True .....................


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV] alpha=0.0, class_prior=None, fit_prior=True .....................


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=None, fit_prior=True, score=0.8959370258094743, total=   1.6s
[CV] alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True .......


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=None, fit_prior=True, score=0.8868648185207342, total=   1.9s
[CV] alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True .......


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:    5.2s
  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=None, fit_prior=True, score=0.8857097885869714, total=   2.0s
[CV] alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True .......


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.8959370204907351, total=   1.7s
[CV] alpha=0.1, class_prior=None, fit_prior=True .....................


  'setting alpha = %.1e' % _ALPHA_MIN)


[CV]  alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.886864834505243, total=   1.7s
[CV] alpha=0.1, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.0, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.8857097859228865, total=   1.6s
[CV] alpha=0.1, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.1, class_prior=None, fit_prior=True, score=0.9435606570419294, total=   1.5s
[CV] alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True .......


[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:    9.3s


[CV]  alpha=0.1, class_prior=None, fit_prior=True, score=0.9327180517680261, total=   1.7s
[CV] alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.1, class_prior=None, fit_prior=True, score=0.9505139967923142, total=   1.5s
[CV] alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9435606517231905, total=   1.5s
[CV] alpha=0.2, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9327180624243653, total=   1.6s
[CV] alpha=0.2, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.1, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9505139914641447, total=   1.7s
[CV] alpha=0.2, class_prior=None, fit_prior=True .....................


[Parallel(n_jobs=3)]: Done  12 tasks      | elapsed:   13.2s


[CV]  alpha=0.2, class_prior=None, fit_prior=True, score=0.9444671084783707, total=   1.6s
[CV] alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.2, class_prior=None, fit_prior=True, score=0.9326124714247604, total=   1.6s
[CV] alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.2, class_prior=None, fit_prior=True, score=0.9520073601628835, total=   1.5s
[CV] alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9444671270939577, total=   1.4s
[CV] alpha=0.3, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9326125300346252, total=   1.7s
[CV] alpha=0.3, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.2, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9520073814755616, total=   1.7s
[CV] alpha=0.3, class_prior=None, fit_prior=True ........

[Parallel(n_jobs=3)]: Done  19 tasks      | elapsed:   18.9s


[CV]  alpha=0.3, class_prior=None, fit_prior=True, score=0.931901677625326, total=   1.5s
[CV] alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.3, class_prior=None, fit_prior=True, score=0.9524465345361612, total=   1.6s
[CV] alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9446671994447662, total=   1.8s
[CV] alpha=0.4, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9319016589767328, total=   1.7s
[CV] alpha=0.4, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.3, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9524465345361612, total=   1.6s
[CV] alpha=0.4, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.4, class_prior=None, fit_prior=True, score=0.9445732731709474, total=   1.4s
[CV] alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=

[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:   24.6s


[CV]  alpha=0.4, class_prior=None, fit_prior=True, score=0.9524166168642662, total=   1.4s
[CV] alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9445732731709476, total=   1.4s
[CV] alpha=0.5, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9310744313527057, total=   1.4s
[CV] alpha=0.5, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.4, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9524165928875032, total=   1.5s
[CV] alpha=0.5, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.5, class_prior=None, fit_prior=True, score=0.9442627013551642, total=   1.5s
[CV] alpha=0.5, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.5, class_prior=None, fit_prior=True, score=0.9302755229497157, total=   1.5s
[CV] alpha=0.5, class_prior=[0.99831, 0.00169], fit_prior

[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:   31.6s


[CV]  alpha=0.5, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9521864452688615, total=   1.5s
[CV] alpha=0.6, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.6, class_prior=None, fit_prior=True, score=0.9438514218731604, total=   1.6s
[CV] alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.6, class_prior=None, fit_prior=True, score=0.9294995310038618, total=   1.7s
[CV] alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.6, class_prior=None, fit_prior=True, score=0.9518519294655212, total=   1.4s
[CV] alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9438514218731604, total=   1.5s
[CV] alpha=0.7, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.6, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9294995310038618, total=   1.6s
[CV] alpha=0.7, class_prior=None, fit_prior=True ........

[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:   38.8s


[CV]  alpha=0.7, class_prior=None, fit_prior=True, score=0.9514674114552534, total=   1.6s
[CV] alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9433635685037778, total=   1.6s
[CV] alpha=0.8, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9287725022718782, total=   1.6s
[CV] alpha=0.8, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.7, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9514674061270838, total=   1.8s
[CV] alpha=0.8, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.8, class_prior=None, fit_prior=True, score=0.942827681601234, total=   2.0s
[CV] alpha=0.8, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.8, class_prior=None, fit_prior=True, score=0.9280890526384574, total=   2.1s
[CV] alpha=0.8, class_prior=[0.99831, 0.00169], fit_prior=

[Parallel(n_jobs=3)]: Done  55 tasks      | elapsed:   48.8s


[CV]  alpha=0.9, class_prior=None, fit_prior=True, score=0.9274512973869017, total=   1.7s
[CV] alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.9, class_prior=None, fit_prior=True, score=0.9506482853129924, total=   1.5s
[CV] alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True .......
[CV]  alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9422726206440957, total=   1.9s
[CV] alpha=1.0, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9274512973869015, total=   2.1s
[CV] alpha=1.0, class_prior=None, fit_prior=True .....................
[CV]  alpha=0.9, class_prior=[0.99831, 0.00169], fit_prior=True, score=0.9506482853129923, total=   2.2s
[CV] alpha=1.0, class_prior=None, fit_prior=True .....................
[CV]  alpha=1.0, class_prior=None, fit_prior=True, score=0.9417146237429526, total=   2.3s
[CV] alpha=1.0, class_prior=[0.99831, 0.00169], fit_prior

[Parallel(n_jobs=3)]: Done  66 out of  66 | elapsed:   58.1s finished


It took: 1.0 minutes.


In [18]:
MNBsearch.best_params_

{'alpha': 0.20000000000000001,
 'class_prior': [0.99831, 0.00169],
 'fit_prior': True}

In [19]:
MNBsearch.best_score_

0.94302901430616237

In [24]:
from sklearn.metrics import roc_auc_score
probs = MNBsearch.best_estimator_.predict_proba(X_train_trans_pl1)[:,1]
roc_auc_score(y_train,probs)

0.95332437154555072

It looks like the hyperparameter search improved MNB model performance. We will store the best estimator we obtained for future use:

In [25]:
import pickle
with open("mnb.pkl","wb") as f:
    pickle.dump(MNBsearch.best_estimator_,f)

# Training Support Vector Machine Classifiers

In [3]:
# Read the transformed features and target labels from the training set
import pickle
import numpy as np
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f) 

In [4]:
import datetime
start = datetime.datetime.now()

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
# Note that we can't get probabilities directly from this LinearSVC function
# We need to wrap into Calibrated Classifier 
# (see: https://stackoverflow.com/questions/35212213/sklearn-how-to-get-decision-probabilities-for-linearsvc-classifier)

lsvc = LinearSVC(verbose=10)

cal_lsvc = CalibratedClassifierCV(base_estimator = lsvc,
                                  cv = 3, # Also performs cross-validation
                                  method= "sigmoid") # We use sigmoid function to get probabilities

cal_lsvc.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

[LibLinear][LibLinear][LibLinear]It took: 2.25 minutes.


In [7]:
probs = cal_lsvc.predict_proba(X_train_trans_pl1)[:,1] 

In [10]:
# Calculate ROC score
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train,probs)

0.95099179905081954

This is a good start for an untuned classifier, let's try to perform hyperparameter search to see if we can improve this performance.

## Hyperparameter tuning for SVC classifier

We need to understand what SVC parameters we can tune in the context of calibrated classifier wrapper:

In [12]:
cal_lsvc.get_params()

{'base_estimator': LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
      intercept_scaling=1, loss='squared_hinge', max_iter=1000,
      multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
      verbose=10),
 'base_estimator__C': 1.0,
 'base_estimator__class_weight': None,
 'base_estimator__dual': True,
 'base_estimator__fit_intercept': True,
 'base_estimator__intercept_scaling': 1,
 'base_estimator__loss': 'squared_hinge',
 'base_estimator__max_iter': 1000,
 'base_estimator__multi_class': 'ovr',
 'base_estimator__penalty': 'l2',
 'base_estimator__random_state': None,
 'base_estimator__tol': 0.0001,
 'base_estimator__verbose': 10,
 'cv': 3,
 'method': 'sigmoid'}

 Parameters
 
 |  penalty : string, 'l1' or 'l2' (default='l2')
 |      Specifies the norm used in the penalization. The 'l2'
 |      penalty is the standard used in SVC. The 'l1' leads to ``coef_``
 |      vectors that are sparse.
 |  
 
 
 |  loss : string, 'hinge' or 'squared_hinge' (default='squared_hinge')
 |      Specifies the loss function. 'hinge' is the standard SVM loss
 |      (used e.g. by the SVC class) while 'squared_hinge' is the
 |      square of the hinge loss.
 
 
 |  dual : bool, (default=True)
 |      Select the algorithm to either solve the dual or primal
 |      optimization problem. Prefer dual=False when n_samples > n_features.
 
 
 |  tol : float, optional (default=1e-4)
 |      Tolerance for stopping criteria.
 
 
 |  C : float, optional (default=1.0)
 |      Penalty parameter C of the error term.
 
 
 |  multi_class : string, 'ovr' or 'crammer_singer' (default='ovr')
 |      Determines the multi-class strategy if `y` contains more than
 |      two classes.
 |      ``"ovr"`` trains n_classes one-vs-rest classifiers, while
 |      ``"crammer_singer"`` optimizes a joint objective over all classes.
 |      While `crammer_singer` is interesting from a theoretical perspective
 |      as it is consistent, it is seldom used in practice as it rarely leads
 |      to better accuracy and is more expensive to compute.
 |      If ``"crammer_singer"`` is chosen, the options loss, penalty and dual
 |      will be ignored.
 
 
 |  fit_intercept : boolean, optional (default=True)
 |      Whether to calculate the intercept for this model. If set
 |      to false, no intercept will be used in calculations
 |      (i.e. data is expected to be already centered).
 
 
 |  intercept_scaling : float, optional (default=1)
 |      When self.fit_intercept is True, instance vector x becomes
 |      ``[x, self.intercept_scaling]``,
 |      i.e. a "synthetic" feature with constant value equals to
 |      intercept_scaling is appended to the instance vector.
 |      The intercept becomes intercept_scaling * synthetic feature weight
 |      Note! the synthetic feature weight is subject to l1/l2 regularization
 |      as all other features.
 |      To lessen the effect of regularization on synthetic feature weight
 |      (and therefore on the intercept) intercept_scaling has to be increased.
 
 
 |  class_weight : {dict, 'balanced'}, optional
 |      Set the parameter C of class i to ``class_weight[i]*C`` for
 |      SVC. If not given, all classes are supposed to have
 |      weight one.
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``
 
 
 |  verbose : int, (default=0)
 |      Enable verbose output. Note that this setting takes advantage of a
 |      per-process runtime setting in liblinear that, if enabled, may not work
 |      properly in a multithreaded context.
 
 
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      The seed of the pseudo random number generator to use when shuffling
 |      the data.  If int, random_state is the seed used by the random number
 |      generator; If RandomState instance, random_state is the random number
 |      generator; If None, the random number generator is the RandomState
 |      instance used by `np.random`.
 
 
 |  max_iter : int, (default=1000)
 |      The maximum number of iterations to be run.

In [25]:
import datetime
start = datetime.datetime.now()

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import RandomizedSearchCV

lsvc = LinearSVC(verbose=10)

cal_lsvc = CalibratedClassifierCV(base_estimator = lsvc,
                                  cv = 3, # Also performs cross-validation if needed
                                  method= "sigmoid") # We use sigmoid function to get probabilities

params_space = {
    "base_estimator__penalty":['l1','l2'],
    "base_estimator__dual":[False,True],
    "base_estimator__C":np.logspace(0.1,100,base = 2, num=100)   
}

CAL_LSVC_search = RandomizedSearchCV(cal_lsvc,
                                     param_distributions= params_space,
                                     n_jobs=3, cv = 3, 
                                     n_iter = 20,verbose=10)

CAL_LSVC_search.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] base_estimator__penalty=l1, base_estimator__dual=True, base_estimator__C=1.65361574567e+15 
[CV] base_estimator__penalty=l1, base_estimator__dual=True, base_estimator__C=1.65361574567e+15 
[CV] base_estimator__penalty=l1, base_estimator__dual=True, base_estimator__C=1.65361574567e+15 


JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/runpy.py in _run_module_as_main(mod_name='ipykernel_launcher', alter_argv=1)
    188         sys.exit(msg)
    189     main_globals = sys.modules["__main__"].__dict__
    190     if alter_argv:
    191         sys.argv[0] = mod_spec.origin
    192     return _run_code(code, main_globals, None,
--> 193                      "__main__", mod_spec)
        mod_spec = ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py')
    194 
    195 def run_module(mod_name, init_globals=None,
    196                run_name=None, alter_sys=False):
    197     """Execute a module's code without importing it

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/runpy.py in _run_code(code=<code object <module> at 0x10659a6f0, file "/Use...3.6/site-packages/ipykernel_launcher.py", line 5>, run_globals={'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/Users/OZANA.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}, init_globals=None, mod_name='__main__', mod_spec=ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), pkg_name='', script_name=None)
     80                        __cached__ = cached,
     81                        __doc__ = None,
     82                        __loader__ = loader,
     83                        __package__ = pkg_name,
     84                        __spec__ = mod_spec)
---> 85     exec(code, run_globals)
        code = <code object <module> at 0x10659a6f0, file "/Use...3.6/site-packages/ipykernel_launcher.py", line 5>
        run_globals = {'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/Users/OZANA.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}
     86     return run_globals
     87 
     88 def _run_module_code(code, init_globals=None,
     89                     mod_name=None, mod_spec=None,

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py in <module>()
     11     # This is added back by InteractiveShellApp.init_path()
     12     if sys.path[0] == '':
     13         del sys.path[0]
     14 
     15     from ipykernel import kernelapp as app
---> 16     app.launch_new_instance()

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    653 
    654         If a global instance already exists, this reinitializes and starts it
    655         """
    656         app = cls.instance(**kwargs)
    657         app.initialize(argv)
--> 658         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    659 
    660 #-----------------------------------------------------------------------------
    661 # utility functions, for convenience
    662 #-----------------------------------------------------------------------------

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    472             return self.subapp.start()
    473         if self.poller is not None:
    474             self.poller.start()
    475         self.kernel.start()
    476         try:
--> 477             ioloop.IOLoop.instance().start()
    478         except KeyboardInterrupt:
    479             pass
    480 
    481 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/zmq/eventloop/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    172             )
    173         return loop
    174     
    175     def start(self):
    176         try:
--> 177             super(ZMQIOLoop, self).start()
        self.start = <bound method ZMQIOLoop.start of <zmq.eventloop.ioloop.ZMQIOLoop object>>
    178         except ZMQError as e:
    179             if e.errno == ETERM:
    180                 # quietly return on ETERM
    181                 pass

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/tornado/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    883                 self._events.update(event_pairs)
    884                 while self._events:
    885                     fd, events = self._events.popitem()
    886                     try:
    887                         fd_obj, handler_func = self._handlers[fd]
--> 888                         handler_func(fd_obj, events)
        handler_func = <function wrap.<locals>.null_wrapper>
        fd_obj = <zmq.sugar.socket.Socket object>
        events = 1
    889                     except (OSError, IOError) as e:
    890                         if errno_from_exception(e) == errno.EPIPE:
    891                             # Happens when the client closes the connection
    892                             pass

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    272         # Fast path when there are no active contexts.
    273         def null_wrapper(*args, **kwargs):
    274             try:
    275                 current_state = _state.contexts
    276                 _state.contexts = cap_contexts[0]
--> 277                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    278             finally:
    279                 _state.contexts = current_state
    280         null_wrapper._wrapped = True
    281         return null_wrapper

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    435             # dispatch events:
    436             if events & IOLoop.ERROR:
    437                 gen_log.error("got POLLERR event on ZMQStream, which doesn't make sense")
    438                 return
    439             if events & IOLoop.READ:
--> 440                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    441                 if not self.socket:
    442                     return
    443             if events & IOLoop.WRITE:
    444                 self._handle_send()

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    467                 gen_log.error("RECV Error: %s"%zmq.strerror(e.errno))
    468         else:
    469             if self._recv_callback:
    470                 callback = self._recv_callback
    471                 # self._recv_callback = None
--> 472                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function wrap.<locals>.null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    473                 
    474         # self.update_state()
    475         
    476 

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function wrap.<locals>.null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    409         close our socket."""
    410         try:
    411             # Use a NullContext to ensure that all StackContexts are run
    412             # inside our blanket exception handler rather than outside.
    413             with stack_context.NullContext():
--> 414                 callback(*args, **kwargs)
        callback = <function wrap.<locals>.null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    415         except:
    416             gen_log.error("Uncaught exception, closing connection.",
    417                           exc_info=True)
    418             # Close the socket on an uncaught exception from a user callback

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    272         # Fast path when there are no active contexts.
    273         def null_wrapper(*args, **kwargs):
    274             try:
    275                 current_state = _state.contexts
    276                 _state.contexts = cap_contexts[0]
--> 277                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    278             finally:
    279                 _state.contexts = current_state
    280         null_wrapper._wrapped = True
    281         return null_wrapper

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    278         if self.control_stream:
    279             self.control_stream.on_recv(self.dispatch_control, copy=False)
    280 
    281         def make_dispatcher(stream):
    282             def dispatcher(msg):
--> 283                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    284             return dispatcher
    285 
    286         for s in self.shell_streams:
    287             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {'allow_stdin': True, 'code': 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 3, 30, 15, 20, 51, 429826, tzinfo=tzutc()), 'msg_id': '51FA63EB565D4EA58DDEABB29D33E8F3', 'msg_type': 'execute_request', 'session': '2298C9A937DC4557997947C4492FD686', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '51FA63EB565D4EA58DDEABB29D33E8F3', 'msg_type': 'execute_request', 'parent_header': {}})
    230             self.log.warn("Unknown message type: %r", msg_type)
    231         else:
    232             self.log.debug("%s: %s", msg_type, msg)
    233             self.pre_handler_hook()
    234             try:
--> 235                 handler(stream, idents, msg)
        handler = <bound method Kernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = [b'2298C9A937DC4557997947C4492FD686']
        msg = {'buffers': [], 'content': {'allow_stdin': True, 'code': 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 3, 30, 15, 20, 51, 429826, tzinfo=tzutc()), 'msg_id': '51FA63EB565D4EA58DDEABB29D33E8F3', 'msg_type': 'execute_request', 'session': '2298C9A937DC4557997947C4492FD686', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '51FA63EB565D4EA58DDEABB29D33E8F3', 'msg_type': 'execute_request', 'parent_header': {}}
    236             except Exception:
    237                 self.log.error("Exception in message handler:", exc_info=True)
    238             finally:
    239                 self.post_handler_hook()

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=[b'2298C9A937DC4557997947C4492FD686'], parent={'buffers': [], 'content': {'allow_stdin': True, 'code': 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 3, 30, 15, 20, 51, 429826, tzinfo=tzutc()), 'msg_id': '51FA63EB565D4EA58DDEABB29D33E8F3', 'msg_type': 'execute_request', 'session': '2298C9A937DC4557997947C4492FD686', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '51FA63EB565D4EA58DDEABB29D33E8F3', 'msg_type': 'execute_request', 'parent_header': {}})
    394         if not silent:
    395             self.execution_count += 1
    396             self._publish_execute_input(code, parent, self.execution_count)
    397 
    398         reply_content = self.do_execute(code, silent, store_history,
--> 399                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    400 
    401         # Flush output before sending the reply.
    402         sys.stdout.flush()
    403         sys.stderr.flush()

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code='import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    191 
    192         self._forward_input(allow_stdin)
    193 
    194         reply_content = {}
    195         try:
--> 196             res = shell.run_cell(code, store_history=store_history, silent=silent)
        res = undefined
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")'
        store_history = True
        silent = False
    197         finally:
    198             self._restore_input()
    199 
    200         if res.error_before_exec is not None:

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=('import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")',), **kwargs={'silent': False, 'store_history': True})
    528             )
    529         self.payload_manager.write_payload(payload)
    530 
    531     def run_cell(self, *args, **kwargs):
    532         self._last_traceback = None
--> 533         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
        self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        args = ('import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")',)
        kwargs = {'silent': False, 'store_history': True}
    534 
    535     def _showtraceback(self, etype, evalue, stb):
    536         # try to preserve ordering of tracebacks and print statements
    537         sys.stdout.flush()

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell='import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', store_history=True, silent=False, shell_futures=True)
   2723                 self.displayhook.exec_result = result
   2724 
   2725                 # Execute the user code
   2726                 interactivity = "none" if silent else self.ast_node_interactivity
   2727                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2728                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler object>
   2729                 
   2730                 self.last_execution_succeeded = not has_raised
   2731                 self.last_execution_result = result
   2732 

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.Import object>, <_ast.Assign object>, <_ast.ImportFrom object>, <_ast.ImportFrom object>, <_ast.ImportFrom object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Expr object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Expr object>], cell_name='<ipython-input-25-de1c77550da9>', interactivity='last', compiler=<IPython.core.compilerop.CachingCompiler object>, result=<ExecutionResult object at 1a290950f0, execution..._before_exec=None error_in_exec=None result=None>)
   2845 
   2846         try:
   2847             for i, node in enumerate(to_run_exec):
   2848                 mod = ast.Module([node])
   2849                 code = compiler(mod, cell_name, "exec")
-> 2850                 if self.run_code(code, result):
        self.run_code = <bound method InteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x10bac21e0, file "<ipython-input-25-de1c77550da9>", line 25>
        result = <ExecutionResult object at 1a290950f0, execution..._before_exec=None error_in_exec=None result=None>
   2851                     return True
   2852 
   2853             for i, node in enumerate(to_run_interactive):
   2854                 mod = ast.Interactive([node])

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x10bac21e0, file "<ipython-input-25-de1c77550da9>", line 25>, result=<ExecutionResult object at 1a290950f0, execution..._before_exec=None error_in_exec=None result=None>)
   2905         outflag = True  # happens in more places, so it's easier as default
   2906         try:
   2907             try:
   2908                 self.hooks.pre_run_code_hook()
   2909                 #rprint('Running code', repr(code_obj)) # dbg
-> 2910                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x10bac21e0, file "<ipython-input-25-de1c77550da9>", line 25>
        self.user_global_ns = {'CAL_LSVC_search': RandomizedSearchCV(cv=3, error_score='raise',
  ...urn_train_score='warn', scoring=None, verbose=10), 'CalibratedClassifierCV': <class 'sklearn.calibration.CalibratedClassifierCV'>, 'In': ['', 'from sklearn.svm import LinearSVC\nfrom sklearn.calibration import CalibratedClassifierCV', 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', '# Read the transformed features and target label...in.pkl","rb") as f:\n    y_train = pickle.load(f) ', 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', 'probs = cal_lsvc.predict_proba(X_train_trans_pl1) ', 'probs', 'probs = cal_lsvc.predict_proba(X_train_trans_pl1)[:,1] ', 'probs', '# Calculate ROC score\nfrom sklearn.metrics import roc_auc_score', '# Calculate ROC score\nfrom sklearn.metrics import roc_auc_score\nroc_auc_score(y_train,probs)', 'cal_lsvc.get_params', 'cal_lsvc.get_params()', 'help(LinearSVC)', 'np.logspace(0.1,1000)', 'np.logspace(1,1000)', 'np.logspace(1000)', 'np.logspace(0.1,1000,2)', 'np.logspace(0.1,1000,100)', 'np.logspace(0.1,1000,10)', ...], 'LinearSVC': <class 'sklearn.svm.classes.LinearSVC'>, 'Out': {6: array([[  9.98604863e-01,   1.39513749e-03],
   ...3],
       [  9.98843163e-01,   1.15683681e-03]]), 8: array([  1.39513749e-03,   1.37266016e-03,   1.3...1099911e-05,   1.23671390e-03,   1.15683681e-03]), 10: 0.95099179905081954, 11: <bound method BaseEstimator.get_params of Calibr...verbose=10),
            cv=3, method='sigmoid')>, 12: {'base_estimator': LinearSVC(C=1.0, class_weight=None, dual=True, f..., random_state=None, tol=0.0001,
     verbose=10), 'base_estimator__C': 1.0, 'base_estimator__class_weight': None, 'base_estimator__dual': True, 'base_estimator__fit_intercept': True, 'base_estimator__intercept_scaling': 1, 'base_estimator__loss': 'squared_hinge', 'base_estimator__max_iter': 1000, 'base_estimator__multi_class': 'ovr', 'base_estimator__penalty': 'l2', ...}, 14: array([  1.25892541e+000,   3.20717346e+020,   8...nf,
                     inf,               inf]), 15: array([  1.00000000e+001,   2.44205309e+021,   5...nf,
                     inf,               inf]), 17: array([ 1.25892541,         inf]), 18: array([  1.25892541e+000,   1.58489319e+010,   1...nf,               inf,
                     inf]), 19: array([  1.25892541e+000,   1.58489319e+111,   1...nf,               inf,
                     inf]), ...}, 'RandomizedSearchCV': <class 'sklearn.model_selection._search.RandomizedSearchCV'>, 'X_train_trans_pl1': <1000000x45753 sparse matrix of type '<class 'nu...ored elements in Compressed Sparse Column format>, '_': array([  1.07177346e+00,   2.15709679e+00,   4.3...e+29,   6.29843910e+29,
         1.26765060e+30]), '_10': 0.95099179905081954, '_11': <bound method BaseEstimator.get_params of Calibr...verbose=10),
            cv=3, method='sigmoid')>, ...}
        self.user_ns = {'CAL_LSVC_search': RandomizedSearchCV(cv=3, error_score='raise',
  ...urn_train_score='warn', scoring=None, verbose=10), 'CalibratedClassifierCV': <class 'sklearn.calibration.CalibratedClassifierCV'>, 'In': ['', 'from sklearn.svm import LinearSVC\nfrom sklearn.calibration import CalibratedClassifierCV', 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', '# Read the transformed features and target label...in.pkl","rb") as f:\n    y_train = pickle.load(f) ', 'import datetime\nstart = datetime.datetime.now()\n...: " + str(process_time.seconds/60) + " minutes.")', 'probs = cal_lsvc.predict_proba(X_train_trans_pl1) ', 'probs', 'probs = cal_lsvc.predict_proba(X_train_trans_pl1)[:,1] ', 'probs', '# Calculate ROC score\nfrom sklearn.metrics import roc_auc_score', '# Calculate ROC score\nfrom sklearn.metrics import roc_auc_score\nroc_auc_score(y_train,probs)', 'cal_lsvc.get_params', 'cal_lsvc.get_params()', 'help(LinearSVC)', 'np.logspace(0.1,1000)', 'np.logspace(1,1000)', 'np.logspace(1000)', 'np.logspace(0.1,1000,2)', 'np.logspace(0.1,1000,100)', 'np.logspace(0.1,1000,10)', ...], 'LinearSVC': <class 'sklearn.svm.classes.LinearSVC'>, 'Out': {6: array([[  9.98604863e-01,   1.39513749e-03],
   ...3],
       [  9.98843163e-01,   1.15683681e-03]]), 8: array([  1.39513749e-03,   1.37266016e-03,   1.3...1099911e-05,   1.23671390e-03,   1.15683681e-03]), 10: 0.95099179905081954, 11: <bound method BaseEstimator.get_params of Calibr...verbose=10),
            cv=3, method='sigmoid')>, 12: {'base_estimator': LinearSVC(C=1.0, class_weight=None, dual=True, f..., random_state=None, tol=0.0001,
     verbose=10), 'base_estimator__C': 1.0, 'base_estimator__class_weight': None, 'base_estimator__dual': True, 'base_estimator__fit_intercept': True, 'base_estimator__intercept_scaling': 1, 'base_estimator__loss': 'squared_hinge', 'base_estimator__max_iter': 1000, 'base_estimator__multi_class': 'ovr', 'base_estimator__penalty': 'l2', ...}, 14: array([  1.25892541e+000,   3.20717346e+020,   8...nf,
                     inf,               inf]), 15: array([  1.00000000e+001,   2.44205309e+021,   5...nf,
                     inf,               inf]), 17: array([ 1.25892541,         inf]), 18: array([  1.25892541e+000,   1.58489319e+010,   1...nf,               inf,
                     inf]), 19: array([  1.25892541e+000,   1.58489319e+111,   1...nf,               inf,
                     inf]), ...}, 'RandomizedSearchCV': <class 'sklearn.model_selection._search.RandomizedSearchCV'>, 'X_train_trans_pl1': <1000000x45753 sparse matrix of type '<class 'nu...ored elements in Compressed Sparse Column format>, '_': array([  1.07177346e+00,   2.15709679e+00,   4.3...e+29,   6.29843910e+29,
         1.26765060e+30]), '_10': 0.95099179905081954, '_11': <bound method BaseEstimator.get_params of Calibr...verbose=10),
            cv=3, method='sigmoid')>, ...}
   2911             finally:
   2912                 # Reset our crash handler in place
   2913                 sys.excepthook = old_excepthook
   2914         except SystemExit as e:

...........................................................................
/Users/OZANAYGUN/Desktop/2016/Data_science/Kaggle/User-click-detection-predictive-modeling/<ipython-input-25-de1c77550da9> in <module>()
     20 CAL_LSVC_search = RandomizedSearchCV(cal_lsvc,
     21                                      param_distributions= params_space,
     22                                      n_jobs=3, cv = 3, 
     23                                      n_iter = 20,verbose=10)
     24 
---> 25 CAL_LSVC_search.fit(X_train_trans_pl1,y_train)
     26 
     27 end = datetime.datetime.now()
     28 process_time = end - start
     29 print("It took: " + str(process_time.seconds/60) + " minutes.")

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self=RandomizedSearchCV(cv=3, error_score='raise',
  ...urn_train_score='warn', scoring=None, verbose=10), X=<1000000x45753 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, y=0         0
1         0
2         0
3         0
...ame: is_attributed, Length: 1000000, dtype: int64, groups=None, **fit_params={})
    634                                   return_train_score=self.return_train_score,
    635                                   return_n_test_samples=True,
    636                                   return_times=True, return_parameters=False,
    637                                   error_score=self.error_score)
    638           for parameters, (train, test) in product(candidate_params,
--> 639                                                    cv.split(X, y, groups)))
        cv.split = <bound method StratifiedKFold.split of Stratifie...ld(n_splits=3, random_state=None, shuffle=False)>
        X = <1000000x45753 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>
        y = 0         0
1         0
2         0
3         0
...ame: is_attributed, Length: 1000000, dtype: int64
        groups = None
    640 
    641         # if one choose to see train score, "out" will contain train score info
    642         if self.return_train_score:
    643             (train_score_dicts, test_score_dicts, test_sample_counts, fit_time,

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=3), iterable=<generator object BaseSearchCV.fit.<locals>.<genexpr>>)
    784             if pre_dispatch == "all" or n_jobs == 1:
    785                 # The iterable was consumed all at once by the above for loop.
    786                 # No need to wait for async callbacks to trigger to
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=3)>
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time
    792             self._print('Done %3i out of %3i | elapsed: %s finished',
    793                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Fri Mar 30 11:20:57 2018
PID: 3028                Python 3.6.1: /Users/OZANAYGUN/anaconda/bin/python
...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _fit_and_score>, (CalibratedClassifierCV(base_estimator=LinearSVC(... verbose=10),
            cv=3, method='sigmoid'), <1000000x45753 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, 0         0
1         0
2         0
3         0
...ame: is_attributed, Length: 1000000, dtype: int64, {'score': <function _passthrough_scorer>}, memmap([332819, 333335, 333336, ..., 999997, 999998, 999999]), memmap([     0,      1,      2, ..., 333332, 333333, 333334]), 10, {'base_estimator__C': 1653615745673469.2, 'base_estimator__dual': True, 'base_estimator__penalty': 'l1'}), {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': False, 'return_times': True, 'return_train_score': 'warn'})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (CalibratedClassifierCV(base_estimator=LinearSVC(... verbose=10),
            cv=3, method='sigmoid'), <1000000x45753 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, 0         0
1         0
2         0
3         0
...ame: is_attributed, Length: 1000000, dtype: int64, {'score': <function _passthrough_scorer>}, memmap([332819, 333335, 333336, ..., 999997, 999998, 999999]), memmap([     0,      1,      2, ..., 333332, 333333, 333334]), 10, {'base_estimator__C': 1653615745673469.2, 'base_estimator__dual': True, 'base_estimator__penalty': 'l1'})
        kwargs = {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': False, 'return_times': True, 'return_train_score': 'warn'}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator=CalibratedClassifierCV(base_estimator=LinearSVC(... verbose=10),
            cv=3, method='sigmoid'), X=<1000000x45753 sparse matrix of type '<class 'nu... stored elements in Compressed Sparse Row format>, y=0         0
1         0
2         0
3         0
...ame: is_attributed, Length: 1000000, dtype: int64, scorer={'score': <function _passthrough_scorer>}, train=memmap([332819, 333335, 333336, ..., 999997, 999998, 999999]), test=memmap([     0,      1,      2, ..., 333332, 333333, 333334]), verbose=10, parameters={'base_estimator__C': 1653615745673469.2, 'base_estimator__dual': True, 'base_estimator__penalty': 'l1'}, fit_params={}, return_train_score='warn', return_parameters=False, return_n_test_samples=True, return_times=True, error_score='raise')
    453 
    454     try:
    455         if y_train is None:
    456             estimator.fit(X_train, **fit_params)
    457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method CalibratedClassifierCV.fit of Cali...verbose=10),
            cv=3, method='sigmoid')>
        X_train = <666666x45753 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>
        y_train = 332819    1
333335    0
333336    0
333337    0
...Name: is_attributed, Length: 666666, dtype: int64
        fit_params = {}
    459 
    460     except Exception as e:
    461         # Note fit time as time until error
    462         fit_time = time.time() - start_time

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/calibration.py in fit(self=CalibratedClassifierCV(base_estimator=LinearSVC(... verbose=10),
            cv=3, method='sigmoid'), X=<666666x45753 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, y=array([1, 0, 0, ..., 0, 0, 0]), sample_weight=None)
    176                 if base_estimator_sample_weight is not None:
    177                     this_estimator.fit(
    178                         X[train], y[train],
    179                         sample_weight=base_estimator_sample_weight[train])
    180                 else:
--> 181                     this_estimator.fit(X[train], y[train])
        this_estimator.fit = <bound method LinearSVC.fit of LinearSVC(C=16536... random_state=None,
     tol=0.0001, verbose=10)>
        X = <666666x45753 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>
        train = array([215398, 215495, 216584, ..., 666663, 666664, 666665])
        y = array([1, 0, 0, ..., 0, 0, 0])
    182 
    183                 calibrated_classifier = _CalibratedClassifier(
    184                     this_estimator, method=self.method,
    185                     classes=self.classes_)

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/svm/classes.py in fit(self=LinearSVC(C=1653615745673469.2, class_weight=Non..., random_state=None,
     tol=0.0001, verbose=10), X=<444444x45753 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, y=array([1, 1, 1, ..., 0, 0, 0]), sample_weight=None)
    230 
    231         self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
    232             X, y, self.C, self.fit_intercept, self.intercept_scaling,
    233             self.class_weight, self.penalty, self.dual, self.verbose,
    234             self.max_iter, self.tol, self.random_state, self.multi_class,
--> 235             self.loss, sample_weight=sample_weight)
        self.loss = 'squared_hinge'
        sample_weight = None
    236 
    237         if self.multi_class == "crammer_singer" and len(self.classes_) == 2:
    238             self.coef_ = (self.coef_[1] - self.coef_[0]).reshape(1, -1)
    239             if self.fit_intercept:

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/svm/base.py in _fit_liblinear(X=<444444x45753 sparse matrix of type '<class 'num... stored elements in Compressed Sparse Row format>, y=array([1, 1, 1, ..., 0, 0, 0]), C=1653615745673469.2, fit_intercept=True, intercept_scaling=1, class_weight=None, penalty='l1', dual=True, verbose=10, max_iter=1000, tol=0.0001, random_state=None, multi_class='ovr', loss='squared_hinge', epsilon=0.1, sample_weight=array([ 1.,  1.,  1., ...,  1.,  1.,  1.]))
    881         sample_weight = np.ones(X.shape[0])
    882     else:
    883         sample_weight = np.array(sample_weight, dtype=np.float64, order='C')
    884         check_consistent_length(sample_weight, X)
    885 
--> 886     solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
        solver_type = undefined
        multi_class = 'ovr'
        penalty = 'l1'
        loss = 'squared_hinge'
        dual = True
    887     raw_coef_, n_iter_ = liblinear.train_wrap(
    888         X, y_ind, sp.isspmatrix(X), solver_type, tol, bias, C,
    889         class_weight_, max_iter, rnd.randint(np.iinfo('i').max),
    890         epsilon, sample_weight)

...........................................................................
/Users/OZANAYGUN/anaconda/lib/python3.6/site-packages/sklearn/svm/base.py in _get_liblinear_solver_type(multi_class='ovr', penalty='l1', loss='squared_hinge', dual=True)
    742                                 % (penalty, loss, dual))
    743             else:
    744                 return solver_num
    745     raise ValueError('Unsupported set of arguments: %s, '
    746                      'Parameters: penalty=%r, loss=%r, dual=%r'
--> 747                      % (error_string, penalty, loss, dual))
        error_string = "The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True"
        penalty = 'l1'
        loss = 'squared_hinge'
        dual = True
    748 
    749 
    750 def _fit_liblinear(X, y, C, fit_intercept, intercept_scaling, class_weight,
    751                    penalty, dual, verbose, max_iter, tol,

ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True
___________________________________________________________________________