<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

Reading original data

In [2]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [3]:
with open(r"../../data/site_dic.pkl", "rb") as file:
    site_dict = pickle.load(file)

sites_dict_df = pd.DataFrame(list(site_dict.keys()),
                            index=list(site_dict.values()),
                            columns=['site'])
sites_dict_df.head()

Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


Separate target feature 

In [4]:
y = train_df['target']

full_df = pd.concat([train_df.drop('target', axis=1), test_df])

index_split = train_df.shape[0]

In [5]:
times = ['time%s' % i for i in range(1, 11)]
full_df[times] = full_df[times].apply(pd.to_datetime)

In [6]:
sites = ['site%s' % i for i in range(1, 11)]
full_df[sites] = full_df[sites].fillna(0).astype('int')

In [7]:
sites_df = full_df[sites]
sites_df.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,718,0,0,0,0,0,0,0,0,0
2,890,941,3847,941,942,3846,3847,3846,1516,1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768
4,782,782,782,782,782,782,782,782,782,782
5,22,177,175,178,177,178,175,177,177,178


In [10]:
sites_df['str'] = sites_df.apply(lambda row: ' '.join(str(x) for x in row.values.tolist()), axis=1)

sites_df['str'].head()

session_id
1                                718 0 0 0 0 0 0 0 0 0
2        890 941 3847 941 942 3846 3847 3846 1516 1518
3    14769 39 14768 14769 37 39 14768 14768 14768 1...
4              782 782 782 782 782 782 782 782 782 782
5               22 177 175 178 177 178 175 177 177 178
Name: str, dtype: object

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [11]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
model = vectorizer.fit_transform(sites_df['str'])
model.shape

(336358, 100000)

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [12]:
full_time = pd.DataFrame()
full_time['duration'] = full_df[times].apply(lambda row: int((row.max() - row.min()).total_seconds()), axis=1)

In [13]:
time_features = pd.DataFrame(index=full_df.index)
day_features = pd.DataFrame(index=full_df.index)
days = ['day%s' % i for i in range(1, 11)]
hours = ['hour%s' % i for i in range(1, 11)]

In [14]:
for i in range(10):
    day_features[days[i]] = full_df[times[i]].apply(lambda day: day.weekday())
    time_features[hours[i]] = full_df[times[i]].apply(lambda day: day.hour)

In [15]:
time_features = pd.get_dummies(time_features, columns=hours, sparse=False)
day_features = pd.get_dummies(day_features, columns=days, sparse=False)

In [16]:
full_time['start_hour'] = full_df['time1'].apply(lambda ts: ts.hour)
full_time['weekday'] = full_df['time1'].apply(lambda ts: ts.weekday())
full_time['work_time'] = full_time['start_hour'].apply(lambda i: 1 if 9 <= i <= 17 else 0)
full_time['date_time'] = full_time['start_hour'].apply(lambda i: 1 if i in range(6, 12) else
                                                      (2 if i in range(12, 18) else
                                                      (3 if i in range(18, 24) else 0)))

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [17]:
from scipy.sparse import hstack
matrix = hstack([model, time_features, day_features, full_time]).tocsr(copy=False)

Perform cross-validation with logistic regression.

In [18]:
X_train = matrix[:index_split, :]
X_test = matrix[index_split:, :]

In [21]:
logit_c_values = np.logspace(-4, 2, 10)

skf = StratifiedKFold(n_splits=3, random_state=17)

logit_grid_searcher = LogisticRegressionCV(Cs=logit_c_values, cv=skf, n_jobs=-1)
logit_grid_searcher.fit(X_train, y)

LogisticRegressionCV(Cs=array([  1.00000e-04,   4.64159e-04,   2.15443e-03,   1.00000e-02,
         4.64159e-02,   2.15443e-01,   1.00000e+00,   4.64159e+00,
         2.15443e+01,   1.00000e+02]),
           class_weight=None,
           cv=StratifiedKFold(n_splits=3, random_state=17, shuffle=False),
           dual=False, fit_intercept=True, intercept_scaling=1.0,
           max_iter=100, multi_class='ovr', n_jobs=-1, penalty='l2',
           random_state=None, refit=True, scoring=None, solver='lbfgs',
           tol=0.0001, verbose=0)

In [22]:
logit_mean_cv_scores = next (iter (logit_grid_searcher.scores_.values())).mean(axis=0)
pd.Series(logit_mean_cv_scores, index=logit_grid_searcher.Cs_).sort_values(ascending=False)

100.000000    0.992471
21.544347     0.992108
4.641589      0.991726
1.000000      0.991442
0.215443      0.991434
0.046416      0.990941
0.010000      0.990941
0.002154      0.990941
0.000464      0.990941
0.000100      0.990941
dtype: float64

In [23]:
train_len = int(round(X_train.shape[0] * 0.9))
lr = LogisticRegression(C=1.8, random_state=17, n_jobs=-1).fit(X_train[:train_len, :], y[:train_len])
y_pred = lr.predict_proba(X_train[train_len:, :])[:, 1]

In [24]:
roc_auc_score(y[train_len:], y_pred)

0.98506307589921216

Make prediction for the test set and form a submission file.

In [25]:
lr = LogisticRegression(C=1.8, random_state=17).fit(X_train, y)

test_pred = lr.predict_proba(X_test)[:, 1]

In [26]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [27]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")

In [28]:
test_pred.shape

(82797,)

In [30]:
test_pred

array([  5.32057040e-06,   1.29387774e-06,   7.99552861e-05, ...,
         1.22617684e-05,   1.10717050e-06,   6.34136914e-09])