<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [20]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
import pickle
from scipy.sparse import csr_matrix, hstack, vstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [2]:
train_df = pd.read_csv('../../data/train_sessions.csv', index_col='session_id')
X_test = pd.read_csv('../../data/test_sessions.csv', index_col='session_id')

Separate target feature 

In [4]:
sites = ['site%s' % i for i in range(1, 11)]
times = ['time%s' % i for i in range(1, 11)]

train_df[times] = train_df[times].apply(pd.to_datetime)
X_test[times] = X_test[times].apply(pd.to_datetime)

with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

In [5]:
site_dict = dict([(v, k) for (k, v) in site_dict.items()])

In [6]:
X, y = train_df.drop('target', axis=1), train_df['target']
train_size = int(0.7 * train_df.shape[0])

In [7]:
X_train, X_valid = X.iloc[:train_size, :], X.iloc[train_size:, :]

y_train, y_valid = y.iloc[:train_size], y.iloc[train_size:]

In [8]:
for i in sites:
    X_train[i] = X_train[i].map(site_dict)
    X_valid[i] = X_valid[i].map(site_dict)
    X_test[i] = X_test[i].map(site_dict)

In [9]:
X_train.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,rr.office.microsoft.com,2014-02-20 10:02:45,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT
2,maps.google.com,2014-02-22 11:19:50,mts0.google.com,2014-02-22 11:19:50,khms0.google.com,2014-02-22 11:19:51,mts0.google.com,2014-02-22 11:19:51,mts1.google.com,2014-02-22 11:19:51,khms1.google.com,2014-02-22 11:19:51,khms0.google.com,2014-02-22 11:19:52,khms1.google.com,2014-02-22 11:19:52,193.164.197.30,2014-02-22 11:20:15,193.164.196.60,2014-02-22 11:20:16
3,cbk1.googleapis.com,2013-12-16 16:40:17,accounts.google.com,2013-12-16 16:40:18,cbk0.googleapis.com,2013-12-16 16:40:19,cbk1.googleapis.com,2013-12-16 16:40:19,twitter.com,2013-12-16 16:40:19,accounts.google.com,2013-12-16 16:40:19,cbk0.googleapis.com,2013-12-16 16:40:20,cbk0.googleapis.com,2013-12-16 16:40:21,cbk0.googleapis.com,2013-12-16 16:40:22,cbk0.googleapis.com,2013-12-16 16:40:24
4,annotathon.org,2014-03-28 10:52:12,annotathon.org,2014-03-28 10:52:42,annotathon.org,2014-03-28 10:53:12,annotathon.org,2014-03-28 10:53:42,annotathon.org,2014-03-28 10:54:12,annotathon.org,2014-03-28 10:54:42,annotathon.org,2014-03-28 10:55:12,annotathon.org,2014-03-28 10:55:42,annotathon.org,2014-03-28 10:56:12,annotathon.org,2014-03-28 10:56:42
5,apis.google.com,2014-02-28 10:53:05,fr.wikipedia.org,2014-02-28 10:55:22,bits.wikimedia.org,2014-02-28 10:55:22,meta.wikimedia.org,2014-02-28 10:55:23,fr.wikipedia.org,2014-02-28 10:55:23,meta.wikimedia.org,2014-02-28 10:55:59,bits.wikimedia.org,2014-02-28 10:55:59,fr.wikipedia.org,2014-02-28 10:55:59,fr.wikipedia.org,2014-02-28 10:57:06,meta.wikimedia.org,2014-02-28 10:57:11


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [10]:
X_train['sites'] = X_train['site1']
X_valid['sites'] = X_valid['site1']
X_test['sites'] = X_test['site1']
for i in sites[1:]:
    X_train['sites'] += ' ' + X_train[i].fillna('')
    X_valid['sites'] += ' ' + X_valid[i].fillna('')
    X_test['sites'] += ' ' + X_test[i].fillna('')

In [11]:
# You code here
vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)

X_train_vect = vectorizer.fit_transform(X_train['sites'])
X_valid_vect = vectorizer.transform(X_valid['sites'])
X_test_vect = vectorizer.transform(X_test['sites'])

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [12]:
# You code here
X_train['start_time'] = [d.hour for d in X_train['time1']]
X_valid['start_time'] = [d.hour for d in X_valid['time1']]
X_test['start_time'] = [d.hour for d in X_test['time1']]

In [13]:
X_train['morning'] = [1 if 4 < d < 12 else 0 for d in X_train.start_time]
X_train['day'] = [1 if 12 < d < 20 else 0 for d in X_train.start_time]
X_train['night'] = [1 if d < 4 or d > 20 else 0 for d in X_train.start_time]

X_valid['morning'] = [1 if 4 < d < 12 else 0 for d in X_valid.start_time]
X_valid['day'] = [1 if 12 < d < 20 else 0 for d in X_valid.start_time]
X_valid['night'] = [1 if d < 4 or d > 20 else 0 for d in X_valid.start_time]

X_test['morning'] = [1 if 4 < d < 12 else 0 for d in X_test.start_time]
X_test['day'] = [1 if 12 < d < 20 else 0 for d in X_test.start_time]
X_test['night'] = [1 if d < 4 or d > 20 else 0 for d in X_test.start_time]

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [14]:
scaler = StandardScaler()
features = ['morning', 'day', 'night']
X_train_dates_scaled = scaler.fit_transform(X_train[features])
X_valid_dates_scaled = scaler.transform(X_valid[features])
X_test_dates_scaled = scaler.transform(X_test[features])

In [15]:
# You code here
X_train_new = hstack([X_train_vect, X_train_dates_scaled])
X_valid_new = hstack([X_valid_vect, X_valid_dates_scaled])
X_test_new = hstack([X_test_vect, X_test_dates_scaled])

Perform cross-validation with logistic regression.

In [35]:
# You code here
lr_cv = LogisticRegressionCV(cv=3)
lr_cv.fit(X_train_new, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [36]:
y_pred = lr_cv.predict_proba(X_valid_new)[:, 1]

In [37]:
roc_auc_score(y_valid, y_pred)

0.9757491935911209

Make prediction for the test set and form a submission file.

In [38]:
X_all = vstack([X_train_new, X_valid_new])
lr_cv.fit(X_all, y)

LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [39]:
test_pred = lr_cv.predict_proba(X_test_new)[:, 1]

In [40]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [41]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")