<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [164]:
import warnings
import pickle
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [174]:
PATH_TO_DATA = '/home/andrei/Desktop/alice_kaggle'
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Separate target feature 

In [175]:
y = train_df['target']
train_df.drop('target', axis=1, inplace=True)
times = ['time%s' % i for i in range(1, 11)]
train_df[times] = train_df[times].apply(pd.to_datetime)
test_df[times] = test_df[times].apply(pd.to_datetime)

train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,718,2014-02-20 10:02:45,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,3846.0,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,39.0,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,782.0,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,178.0,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11


In [176]:
# загрузим словарик сайтов
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# датафрейм словарика сайтов
sites_dict_df = pd.DataFrame(list(site_dict.keys()), 
                          index=list(site_dict.values()), 
                          columns=['site'])
print(u'всего сайтов:', sites_dict_df.shape[0])
sites_dict_df.head()

всего сайтов: 48371


Unnamed: 0,site
4861,i1-js-14-3-01-10005-485187784-i.init.cedexis-r...
42718,bouletcorp-new.typhon.net
45534,images.askmen.com
21782,i1-js-14-3-01-10013-547470276-i.init.cedexis-r...
47142,i1-js-14-3-01-10005-933748847-i.init.cedexis-r...


In [177]:
text_columns = ['site_column%s' % i for i in range(1, 11)]
train_df['text_col'] = ''
test_df['text_col'] = ''

for i in range(1, 11):
    site_c = 'site{}'.format(i)
    site_name_c = 'site_column{}'.format(i)
    train_df[site_name_c] = sites_dict_df.loc[train_df[site_c]].values
    test_df[site_name_c] = sites_dict_df.loc[test_df[site_c]].values
    train_df['text_col'] += train_df[site_name_c]
    test_df['text_col'] += test_df[site_name_c]
    
print(train_df.head())

all_train_text = pd.concat([train_df['site_column{}'.format(i)].astype('U') for i in range(1, 11)])  
char_vec = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=100000)
word_vec = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), max_features=100000)
# fix wordofvec
word_vec.fit(all_train_text.astype('U'))

            site1               time1  site2               time2    site3  \
session_id                                                                  
1             718 2014-02-20 10:02:45    NaN                 NaT      NaN   
2             890 2014-02-22 11:19:50  941.0 2014-02-22 11:19:50   3847.0   
3           14769 2013-12-16 16:40:17   39.0 2013-12-16 16:40:18  14768.0   
4             782 2014-03-28 10:52:12  782.0 2014-03-28 10:52:42    782.0   
5              22 2014-02-28 10:53:05  177.0 2014-02-28 10:55:22    175.0   

                         time3    site4               time4  site5  \
session_id                                                           
1                          NaT      NaN                 NaT    NaN   
2          2014-02-22 11:19:51    941.0 2014-02-22 11:19:51  942.0   
3          2013-12-16 16:40:19  14769.0 2013-12-16 16:40:19   37.0   
4          2014-03-28 10:53:12    782.0 2014-03-28 10:53:42  782.0   
5          2014-02-28 10:55:22    178.0 

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=100000, min_df=1,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [178]:
train_tfidf = [word_vec.transform(train_df['text_col'].values.astype('U'))]
train_X = hstack(train_tfidf, format='csr')
print(train_X.shape)
test_tfidf = [word_vec.transform(test_df['text_col'].values.astype('U'))]
test_X = hstack(test_tfidf, format='csr')
print(test_X.shape)

(253561, 100000)
(82797, 100000)


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

Add features based on the session start time: hour, whether it's morning, day or night and so on.

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [181]:
def get_part_from_hour(h):
    if h < 6:
        return 0
    if h < 12:
        return 1
    if h < 18:
        return 2
    return 2


def get_session_duration(row):
    time_values = row[['time%s' % i for i in range(1, 10)]].dropna().values
    duration = (time_values.max() - time_values.min()).total_seconds()
    return duration


def add_extra_features(df):
    df['hour'] = df['time1'].dt.hour
    df['part_of_day'] = np.array(list(map(lambda v: get_part_from_hour(v), df['time1'].dt.hour)))

    for i in range(2, 11):
        df['delta%s' % (i-1)] = (df['time%s' % i] - df['time%s' % (i-1)]).dt.total_seconds()

    df['duration'] = df.apply(get_session_duration, axis=1)
    return df


def get_scaled_features(df):
    print(df.columns)
    scalable_columns = ['hour', 'part_of_day', 'delta1', 'delta2', 'delta3',
       'delta4', 'delta5', 'delta6', 'delta7', 'delta8', 'delta9', 'duration']
    return StandardScaler().fit_transform(df[scalable_columns].fillna(-1))


def get_extra_features(df):
    return get_scaled_features(add_extra_features(df))

Perform cross-validation with logistic regression.

In [182]:
print(train_df.columns)
train_X = hstack([train_X, get_extra_features(train_df)], format='csr')
test_X = hstack([test_X, get_extra_features(test_df)], format='csr')

Index(['site1', 'time1', 'site2', 'time2', 'site3', 'time3', 'site4', 'time4',
       'site5', 'time5', 'site6', 'time6', 'site7', 'time7', 'site8', 'time8',
       'site9', 'time9', 'site10', 'time10', 'text_col', 'site_column1',
       'site_column2', 'site_column3', 'site_column4', 'site_column5',
       'site_column6', 'site_column7', 'site_column8', 'site_column9',
       'site_column10', 'hour', 'part_of_day', 'delta1', 'delta2', 'delta3',
       'duration'],
      dtype='object')
Index(['site1', 'time1', 'site2', 'time2', 'site3', 'time3', 'site4', 'time4',
       'site5', 'time5', 'site6', 'time6', 'site7', 'time7', 'site8', 'time8',
       'site9', 'time9', 'site10', 'time10', 'text_col', 'site_column1',
       'site_column2', 'site_column3', 'site_column4', 'site_column5',
       'site_column6', 'site_column7', 'site_column8', 'site_column9',
       'site_column10', 'hour', 'part_of_day', 'delta1', 'delta2', 'delta3',
       'duration', 'delta4', 'delta5', 'delta6', 'delta7',

Make prediction for the test set and form a submission file.

In [183]:
logit = LogisticRegression(n_jobs=-1, random_state=17)

searchCV = LogisticRegressionCV(
        Cs=list(np.power(10.0, np.arange(-3, 3)))
        ,penalty='l2'
        ,scoring='roc_auc'
        ,cv=5
        ,random_state=17
        ,verbose=2
    )
searchCV.fit(X, y)

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.9min finished


LogisticRegressionCV(Cs=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
           class_weight=None, cv=5, dual=False, fit_intercept=True,
           intercept_scaling=1.0, max_iter=100, multi_class='ovr',
           n_jobs=1, penalty='l2', random_state=17, refit=True,
           scoring='roc_auc', solver='lbfgs', tol=0.0001, verbose=2)

In [201]:
searchCV.score(test_X, y.loc[:test_X.shape[0]])

ValueError: X has 100012 features per sample; expecting 100024

In [202]:
test_pred = searchCV.predict_proba(test_X)[:,1]

ValueError: X has 100012 features per sample; expecting 100024

In [162]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [163]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")