# Catch Me, If You Can !
*[Kaggle Competition - Intruder detection](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2)*

Web-user identification is a hot research topic on the brink of sequential pattern mining and behavioral psychology.

Here we try to identify a user on the Internet tracking his/her sequence of attended Web pages. The algorithm to be built will take a webpage session (a sequence of webpages attended consequently by the same person) and predict whether it belongs to Alice or somebody else.

The data comes from Blaise Pascal University proxy servers. Paper "A Tool for Classification of Sequential Data" by Giacomo Kahn, Yannick Loiseau and Olivier Raynaud.


## Libraries

In [2]:
import numpy as np
import pandas as pd
import pickle

# Matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

## Datasets

In [3]:
import os

folder_path = "./datasets/"
train_path = os.path.join(folder_path, "train_sessions.csv")
test_path = os.path.join(folder_path, "test_sessions.csv")

## Data-preprocessing

In [4]:
# Load the training and test datasets
train_df = pd.read_csv(train_path, index_col='session_id')
test_df = pd.read_csv(test_path, index_col='session_id')

train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,,,,,,,,...,,,,,,,,,,0
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0


## Tidy up datasets

In [5]:
times = ['time%s' % i for i in range(1, 11)]

# Convert time1...time10 to datetime type then sort
train_df[times] = train_df[times].apply(pd.to_datetime)
test_df[times] = test_df[times].apply(pd.to_datetime)

# Sort the data by time
train_df = train_df.sort_values(by='time1')
# Look at the first rows of the training set
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55.0,2013-01-12 08:05:57,,NaT,,NaT,,NaT,...,NaT,,NaT,,NaT,,NaT,,NaT,0
54843,56,2013-01-12 08:37:23,55.0,2013-01-12 08:37:23,56.0,2013-01-12 09:07:07,55.0,2013-01-12 09:07:09,,NaT,...,NaT,,NaT,,NaT,,NaT,,NaT,0
77292,946,2013-01-12 08:50:13,946.0,2013-01-12 08:50:14,951.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948.0,2013-01-12 08:50:16,784.0,2013-01-12 08:50:16,949.0,2013-01-12 08:50:17,946.0,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948.0,2013-01-12 08:50:17,949.0,2013-01-12 08:50:18,948.0,2013-01-12 08:50:18,945.0,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947.0,2013-01-12 08:50:19,945.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950.0,2013-01-12 08:50:20,948.0,2013-01-12 08:50:20,947.0,2013-01-12 08:50:21,950.0,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946.0,2013-01-12 08:50:21,951.0,2013-01-12 08:50:22,946.0,2013-01-12 08:50:22,947.0,2013-01-12 08:50:22,0


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253561 entries, 21669 to 204762
Data columns (total 21 columns):
site1     253561 non-null int64
time1     253561 non-null datetime64[ns]
site2     250098 non-null float64
time2     250098 non-null datetime64[ns]
site3     246919 non-null float64
time3     246919 non-null datetime64[ns]
site4     244321 non-null float64
time4     244321 non-null datetime64[ns]
site5     241829 non-null float64
time5     241829 non-null datetime64[ns]
site6     239495 non-null float64
time6     239495 non-null datetime64[ns]
site7     237297 non-null float64
time7     237297 non-null datetime64[ns]
site8     235224 non-null float64
time8     235224 non-null datetime64[ns]
site9     233084 non-null float64
time9     233084 non-null datetime64[ns]
site10    231052 non-null float64
time10    231052 non-null datetime64[ns]
target    253561 non-null int64
dtypes: datetime64[ns](10), float64(9), int64(2)
memory usage: 42.6 MB


In [7]:
# Change sites to type int and fillna with 0
sites = ["site%s" %i for i in range(1, 11)]

train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

In [9]:
# path to site_pkl
site_path = os.path.join(folder_path, 'site_dic.pkl')

# Load website - site%s mapping from pkl file
with open(site_path, 'rb') as input_file:
  site_dict = pickle.load(input_file)

# Create dataframe to represent dictionary
web_table = pd.DataFrame(list(site_dict.keys()),
                          index=list(site_dict.values()), columns=["website"])
web_table.head()

Unnamed: 0,website
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


## Sparse Matrix Representation

In [10]:
# Combine train and test data to get sparse matrix
idx_train = train_df.shape[0]
y_train = train_df.target
full_df = pd.concat([train_df.drop('target', axis=1),
                      test_df]).reset_index(drop=True)

print(full_df.loc[:idx_train, :].shape)

(253562, 20)


In [11]:
# Take the sites variable only
full_sites = full_df[sites]
full_sites.head()

Unnamed: 0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
0,56,55,0,0,0,0,0,0,0,0
1,56,55,56,55,0,0,0,0,0,0
2,946,946,951,946,946,945,948,784,949,946
3,945,948,949,948,945,946,947,945,946,946
4,947,950,948,947,950,952,946,951,946,947


In [12]:
# How many unique sites?
pd.unique(full_sites.values.ravel()).shape

(48372,)

In [13]:
# Create flatten representation of dataframe
sites_flatten = full_sites.values.flatten()

# Create sparse matrix
from scipy.sparse import csr_matrix

full_sites_sparse = csr_matrix(([1]*sites_flatten.size, 
                                sites_flatten,
                                range(0, sites_flatten.size+10, 10)))

full_sites_sparse = full_sites_sparse[:, 1:]

## Training & Fitting: Logistic Regression

In [14]:
# Prepare data ready for training
X_train = full_sites_sparse[:idx_train, :]

print(X_train.shape, y_train.shape)

(253561, 48371) (253561,)


In [16]:
!pip install sklearn

Collecting sklearn
  Using cached https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn (from sklearn)
  Downloading https://files.pythonhosted.org/packages/c1/1c/8fa5aefe23a2fc254e9faadc10a30052c63d92f05fb59127ff0e65e4171c/scikit_learn-0.20.2-cp36-cp36m-win_amd64.whl (4.8MB)
Building wheels for collected packages: sklearn
  Running setup.py bdist_wheel for sklearn: started
  Running setup.py bdist_wheel for sklearn: finished with status 'done'
  Stored in directory: C:\Users\Andre\AppData\Local\pip\Cache\wheels\76\03\bb\589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
Installing collected packages: scikit-learn, sklearn
Successfully installed scikit-learn-0.20.2 sklearn-0.0


You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [17]:
from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [18]:
def get_auc_score(X, y, C=1.0, seed=42, test_size=0.85):
  # Splitting for train and validation
  idx_split = int(round(X.shape[0]*test_size))
  # Logistic Regression fit
  log_reg = LogisticRegression(C=C, random_state=42, solver='lbfgs', n_jobs=-1).fit(X[:idx_split,:], y[:idx_split])
  # Prediction
  y_pred = log_reg.predict_proba(X[idx_split:,:])
  y_pred_1 = y_pred[:,1]
  # ROC AUC
  score = roc_auc_score(y[idx_split:], y_pred_1)
  
  return score

# Get ROC AUC score 
score = get_auc_score(X_train, y_train)
print("auc test score: ", score)

auc test score:  0.9193060425403747


## Submission 

In [19]:
def write_submission(y_pred_test, outfile, target_name='target', index_label='session_id'):
  submission_df = pd.DataFrame(y_pred_test, 
                               index=np.arange(1, y_pred_test.size+1), 
                               columns=[target_name])
  submission_df.to_csv(outfile, index_label=index_label)
  return

In [20]:
# Setup path for outfile
outfile_path = os.path.join(folder_path, 'baseline1.csv') #The output file will be 'baseline1.csv'

In [21]:
# Get test dataset
X_test = full_sites_sparse[idx_train:, :]

# Logistic Regression fit
log_reg = LogisticRegression(C=1.0, random_state=42, solver='lbfgs', n_jobs=-1).fit(X_train, y_train)
y_pred_test = log_reg.predict_proba(X_test)[:,1]

# Write submission
write_submission(y_pred_test, outfile_path)

## Further Readings:
* Sparse Matrix: 1.[Scipy documentation](https://docs.scipy.org/doc/scipy/reference/sparse.html), 2.[Intro to Sparse Matrix](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/), 3.[CSR Matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)
* A nice and concise overview of linear models is given in the book “Deep Learning” (I. Goodfellow, Y. Bengio, and A. Courville).
* Linear models are covered practically in every ML book. We recommend “Pattern Recognition and Machine Learning” (C. Bishop) and “Machine Learning: A Probabilistic Perspective” (K. Murphy).
* [Scikit-learn](https://scikit-learn.org/stable/documentation.html) library. These guys work hard on writing really clear documentation.
* [Scipy 2017 scikit-learn tutorial](https://github.com/amueller/scipy-2017-sklearn) by Alex Gramfort and Andreas Mueller.
