<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
%load_ext watermark

In [2]:
%watermark -v -m -p numpy,scipy,pandas,matplotlib,sklearn -g

CPython 3.6.3
IPython 6.2.1

numpy 1.14.0
scipy 1.0.0
pandas 0.22.0
matplotlib 2.1.2
sklearn 0.19.1

compiler   : GCC 7.2.0
system     : Linux
release    : 4.13.0-38-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : c92a5f908fe34981eca12bed2d62691791bf6dfc


In [3]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
import pickle
from collections import Counter
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


Reading original data

In [5]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

train_df.sample(frac=1).reset_index(drop=True)
y = train_df['target']
ratio = 0.9
idx = int(round(train_df.shape[0] * ratio))
skf = StratifiedKFold(n_splits=7, shuffle=True, random_state=17)
train_split = train_df.shape[0]

In [6]:
sum(y)

2297

In [7]:
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)
    sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])

In [8]:
sites_dict.head(5)

Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [9]:
X = pd.concat([train_df.drop(columns='target'), test_df], axis = 0)

In [10]:
X.shape

(336358, 20)

In [11]:
times = ['time%s' % i for i in range(1, 11)]
sites = ['site%s' % i for i in range(1, 11)]
X[times] = X[times].apply(pd.to_datetime)
X[sites] = X[sites].fillna(0).astype('int')
X['sites'] = X[sites].apply(lambda x: " ".join(x.astype('str')), axis = 1)

In [12]:
X['hour'] = X['time1'].apply(lambda x: x.hour)

In [13]:
X['month'] = X['time1'].apply(lambda x: x.month)

In [14]:
X['year'] = X['time1'].apply(lambda x: x.year)

In [15]:
X['yearmonth'] = 12*(X['year'] - 2013) + X['month']

In [16]:
X['hour1618'] = X['time1'].apply(lambda x: 16<=x.hour<=18).astype('int')

In [17]:
X['hour1213'] = X['time1'].apply(lambda x: 12<=x.hour<=13).astype('int')

In [18]:
X['hour915'] = X['time1'].apply(lambda x: (x.hour == 9 or x.hour == 15)).astype('int')

In [19]:
X['minute'] = X['time1'].apply(lambda x: x.minute)

In [20]:
X['len'] = (X['time10'] - X['time1']).apply(lambda x: np.log1p(x.total_seconds())).fillna(0)

In [21]:
X['hourminute'] = X['hour'] * 60 + X['minute']

In [22]:
X['weekday'] = X['time1'].apply(lambda x: x.weekday())

In [23]:
def get_part_day(x):
    h = x.hour
    if 0 <= h <= 5:
        return 0
    if 6 <= h <= 11:
        return 1
    if 12 <= h <= 17:
        return 2
    if 18 <= h <= 23:
        return 3

In [24]:
X['partday'] = X['time1'].apply(lambda x: get_part_day(x))

In [25]:
X['saturday'] = (X['weekday'] == 5).astype('int')

In [26]:
X['sunday'] = (X['weekday'] == 6).astype('int')

In [27]:
X['week'] = X['time1'].apply(lambda x: x.isocalendar()[1])

In [28]:
X.head(3)

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,hour1213,hour915,minute,len,hourminute,weekday,partday,saturday,sunday,week
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,0,NaT,0,NaT,0,NaT,0,NaT,...,0,0,2,0.0,602,3,1,0,0,8
2,890,2014-02-22 11:19:50,941,2014-02-22 11:19:50,3847,2014-02-22 11:19:51,941,2014-02-22 11:19:51,942,2014-02-22 11:19:51,...,0,0,19,3.295837,679,5,1,1,0,8
3,14769,2013-12-16 16:40:17,39,2013-12-16 16:40:18,14768,2013-12-16 16:40:19,14769,2013-12-16 16:40:19,37,2013-12-16 16:40:19,...,0,0,40,2.079442,1000,0,2,0,0,51


In [29]:
full_sites = X[sites]
sites_flatten = full_sites.values.flatten()
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]
full_sites_sparse

<336358x48371 sparse matrix of type '<class 'numpy.int64'>'
	with 3195430 stored elements in Compressed Sparse Row format>

In [30]:
features_ohe = ['hour', 'weekday', 'partday', 'month', 'week', 'yearmonth']

In [31]:
ohe = OneHotEncoder().fit_transform(X[features_ohe])

In [32]:
train_df = X[:train_split]
test_df = X[train_split:]

In [33]:
print(train_df.shape,test_df.shape)

(253561, 36) (82797, 36)


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [34]:
tf = TfidfVectorizer(ngram_range=(1,7), max_features = 200000)
tf.fit(train_df['sites'].values)
tf_idf = tf.transform(X['sites'])

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [35]:
features = ['saturday', 'sunday', 'len', 'hour1618', 'hour1213', 'hour915']
full_df = hstack([tf_idf, X[features], ohe], format='csr')
X_train = full_df[:train_split]
X_test = full_df[train_split:]

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

Perform cross-validation with logistic regression.

In [44]:
np.mean(cross_val_score(LogisticRegression(C=10), X_train, y, n_jobs=-1, scoring='roc_auc', cv=skf))

0.9849491484422331

In [48]:
%%time
lr = LogisticRegressionCV(scoring = 'roc_auc', random_state=17, n_jobs=-1, cv=skf)
lr.fit(X_train[:idx, :], y[:idx])
y_pred = lr.predict_proba(X_train[idx:, :])[:, 1]

CPU times: user 10.9 s, sys: 485 ms, total: 11.3 s
Wall time: 4min 5s


In [36]:
roc_auc_score(y[idx:], y_pred)

NameError: name 'y_pred' is not defined

In [50]:
lr.C_

array([21.5443469])

In [51]:
sum(lr.predict(X_train))

2129

In [None]:
#TODO : lr.C_ set to 20, 21, 23, 24 ... TUNING

In [44]:
%%time
lr.fit(X_train, y)

192
CPU times: user 12.3 s, sys: 526 ms, total: 12.8 s
Wall time: 4min 50s


In [55]:
test_pred = lr.predict_proba(X_test)[:, 1]
print(sum(lr.predict(X_test)))

181


Make prediction for the test set and form a submission file.

In [56]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")

In [38]:
%%time
#best C = 5
for C in [7, 8]:
    linear = LogisticRegression(n_jobs=-1, C=C)
    linear.fit(X_train, y)
    test_pred = linear.predict_proba(X_test)[:, 1]
    print(sum(linear.predict(X_test)))
    write_to_submission_file(test_pred, "assignment6_alice_submission_{}.csv".format(C))

289
297
CPU times: user 2min 6s, sys: 6.46 s, total: 2min 13s
Wall time: 35.7 s
