<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import seaborn as sns
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt

Reading original data

In [2]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [3]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [4]:
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,,,,,,,,...,,,,,,,,,,0
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0


Separate target feature 

In [6]:
y_train = train_df['target']
train_df.drop('target',axis=1)

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,718,2014-02-20 10:02:45,,,,,,,,,,,,,,,,,,
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,3846.0,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,39.0,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,782.0,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,178.0,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11
6,570,2014-03-18 15:18:31,21.0,2014-03-18 15:18:39,570.0,2014-03-18 15:23:02,21.0,2014-03-18 15:23:43,21.0,2014-03-18 15:29:57,,,,,,,,,,
7,803,2014-02-13 16:45:35,23.0,2014-02-13 16:45:35,5956.0,2014-02-13 16:45:35,17513.0,2014-02-13 16:45:35,37.0,2014-02-13 16:46:05,21.0,2014-02-13 16:47:14,803.0,2014-02-13 16:47:14,17514.0,2014-02-13 16:47:15,17514.0,2014-02-13 16:47:16,17514.0,2014-02-13 16:47:17
8,22,2013-04-12 10:27:26,21.0,2013-04-12 10:27:26,29.0,2013-04-12 10:27:28,5041.0,2013-04-12 10:27:29,14422.0,2013-04-12 10:27:29,23.0,2013-04-12 10:27:29,21.0,2013-04-12 10:27:29,5041.0,2013-04-12 10:27:31,14421.0,2013-04-12 10:27:31,14421.0,2013-04-12 10:27:32
9,668,2014-03-17 16:23:08,940.0,2014-03-17 16:23:35,942.0,2014-03-17 16:23:35,941.0,2014-03-17 16:23:35,941.0,2014-03-17 16:23:36,942.0,2014-03-17 16:23:36,940.0,2014-03-17 16:23:36,23.0,2014-03-17 16:23:52,21.0,2014-03-17 16:23:52,22.0,2014-03-17 16:23:53
10,3700,2014-02-20 16:09:13,229.0,2014-02-20 16:10:08,570.0,2014-02-20 16:10:08,21.0,2014-02-20 16:10:08,229.0,2014-02-20 16:10:24,21.0,2014-02-20 16:10:24,21.0,2014-02-20 16:10:29,21.0,2014-02-20 16:10:39,2336.0,2014-02-20 16:10:40,2044.0,2014-02-20 16:10:40


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [7]:
import pickle

sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

# Load websites dictionary
with open(os.path.join(PATH_TO_DATA, 'site_dic.pkl'), 'rb') as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
sites_dict.head()

Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [8]:
def vec(train, test):
    def trans(x):
        return ' '.join(x)
    
    train_sites = train.apply(lambda x: x.map(sites_dict['site'])).fillna('')
    train_sites = train_sites.apply(lambda x: trans(x), axis=1)
    test_sites = test.apply(lambda x: x.map(sites_dict['site'])).fillna('')
    test_sites = test_sites.apply(lambda x: trans(x), axis=1)
    
    vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
    vectorizer.fit(pd.concat([train_sites, test_sites]))
    
    return vectorizer.transform(train_sites), vectorizer.transform(test_sites)

In [9]:
X_train_sites, X_test_sites = vec(train_df[sites], test_df[sites])

Add features based on the session start time: hour, whether it's morning, day or night and so on.

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [10]:
times = ['time%s' % i for i in range(1, 11)]

train_df[times] = train_df[times].fillna(pd.to_datetime('2001-01-01 00:00:00'))
test_df[times] = test_df[times].fillna(pd.to_datetime('2001-01-01 00:00:00'))

def full_times(train_times, test_times):
    
    train_times_, test_times_ = pd.DataFrame(), pd.DataFrame()
    
    for n in times:
        #train
        train_times_[n] = pd.DatetimeIndex(train_times[n]).hour
        #test
        test_times_[n] = pd.DatetimeIndex(test_times[n]).hour
        
    scaler = OneHotEncoder(sparse=False)
    
    scaler.fit(pd.concat([train_times_, test_times_]))
    
    return scaler.transform(train_times_), scaler.transform(test_times_) 

In [32]:
train_times = train_df[times][0:3]

train_times_ = pd.DataFrame()
train_times_['0'] = 0

for n in times:
    #train
    data = pd.DatetimeIndex(train_times[n]).hour
    train_times_['0'] = np.where((train_times_['0'])|(data>=0)&(data<8),1,0)
    train_times_['1'] = np.where((data>=8)&(data<16),1,0)
    train_times_['2'] = np.where((data>=16)&(data<24),1,0)
    
train_times_
        
#X_train_times, X_test_times = full_times(train_df[times][0:3], test_df[times][0:3]) 

ValueError: operands could not be broadcast together with shapes (0,) (3,) 

In [12]:
X_train = csr_matrix(hstack([X_train_sites, X_train_times]))
X_test = csr_matrix(hstack([X_test_sites, X_test_times]))

Perform cross-validation with logistic regression.

In [13]:
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(C=0.5, random_state=17, n_jobs=-1, class_weight='balanced')

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

cv_aucs = cross_val_score(lr, X_train, y_train, scoring="roc_auc", cv=skf)

print(np.mean(cv_aucs))

0.985868558196


Make prediction for the test set and form a submission file.

In [15]:
lr.fit(X_train, y_train)
test_pred = lr.predict_proba(X_test)[:, 1]

In [219]:
write_to_submission_file(test_pred, os.path.join(PATH_TO_DATA, "assignment6_alice_submission.csv"))