# Catch Me If You Can ("Alice") - Kaggle Competition
### Intruder Detection through Webpage Session Tracking
https://www.kaggle.com/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2

User sessions are chosen in the way they are not longer than 30 min or/and contain more than 10 websites.
I.e. a session is considered as ended either if a user has visited 10 websites or if a session has lasted over 30 minutes.

## 1. Initialization

### 1.1 Import and read files

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit, cross_val_score
from sklearn.linear_model import LogisticRegression
from scipy.sparse import csr_matrix, hstack

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, TargetEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, r2_score, roc_curve, roc_auc_score, \
                            precision_score, precision_recall_curve, auc, f1_score, accuracy_score, \
                            log_loss, recall_score

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [2]:
seed_value = 42
np.random.seed(seed_value)

In [3]:
train_df = pd.read_csv('train_sessions.csv')
# set id as index
train_df.set_index('session_id', inplace=True)
# sort df by the time that is never missing, time1 (otherwise there wouldn't be a session starting)
train_df = train_df.sort_values(by='time1')

train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55.0,2013-01-12 08:05:57,,,,,,,...,,,,,,,,,,0
54843,56,2013-01-12 08:37:23,55.0,2013-01-12 08:37:23,56.0,2013-01-12 09:07:07,55.0,2013-01-12 09:07:09,,,...,,,,,,,,,,0
77292,946,2013-01-12 08:50:13,946.0,2013-01-12 08:50:14,951.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948.0,2013-01-12 08:50:16,784.0,2013-01-12 08:50:16,949.0,2013-01-12 08:50:17,946.0,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948.0,2013-01-12 08:50:17,949.0,2013-01-12 08:50:18,948.0,2013-01-12 08:50:18,945.0,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947.0,2013-01-12 08:50:19,945.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950.0,2013-01-12 08:50:20,948.0,2013-01-12 08:50:20,947.0,2013-01-12 08:50:21,950.0,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946.0,2013-01-12 08:50:21,951.0,2013-01-12 08:50:22,946.0,2013-01-12 08:50:22,947.0,2013-01-12 08:50:22,0


In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 253561 entries, 21669 to 204762
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   site1   253561 non-null  int64  
 1   time1   253561 non-null  object 
 2   site2   250098 non-null  float64
 3   time2   250098 non-null  object 
 4   site3   246919 non-null  float64
 5   time3   246919 non-null  object 
 6   site4   244321 non-null  float64
 7   time4   244321 non-null  object 
 8   site5   241829 non-null  float64
 9   time5   241829 non-null  object 
 10  site6   239495 non-null  float64
 11  time6   239495 non-null  object 
 12  site7   237297 non-null  float64
 13  time7   237297 non-null  object 
 14  site8   235224 non-null  float64
 15  time8   235224 non-null  object 
 16  site9   233084 non-null  float64
 17  time9   233084 non-null  object 
 18  site10  231052 non-null  float64
 19  time10  231052 non-null  object 
 20  target  253561 non-null  int64  
dtypes: float64(

### 1.2 Dictionary

In [5]:
# READ dictionary from pkl file
import pickle

# Load the dictionary with the ColumnTransformer object from the file
with open('site_dic.pkl', 'rb') as file:
    loaded_dict = pickle.load(file)

In [6]:
no_int = 0
for key, value in loaded_dict.items():
    if not isinstance(value, int):
        no_int += 1
        print(f'Here {value} is no integer.')
        print(type(value))

if no_int == 0: print('The values in loaded_dict are all integers -> I can transform all the float64 of sites into int.')

The values in loaded_dict are all integers -> I can transform all the float64 of sites into int.


In [7]:
# Create dataframe for the dictionary
loaded_dict_df = pd.DataFrame(list(loaded_dict.keys()), index=list(loaded_dict.values()), 
                          columns=['site'])
print('Tot websites:', loaded_dict_df.shape[0])
loaded_dict_df.head()

Tot websites: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [8]:
loaded_dict_df.loc[loaded_dict_df.index == 1]

Unnamed: 0,site
1,fpdownload2.macromedia.com


### 1.3 Missing values

In [9]:
# train_df.isnull().sum()
train_df.isna().sum()

site1         0
time1         0
site2      3463
time2      3463
site3      6642
time3      6642
site4      9240
time4      9240
site5     11732
time5     11732
site6     14066
time6     14066
site7     16264
time7     16264
site8     18337
time8     18337
site9     20477
time9     20477
site10    22509
time10    22509
target        0
dtype: int64

In [10]:
for column in train_df.columns:
    if column.startswith('site'):
        train_df[column] = train_df[column].fillna(0)

In [11]:
# train_df.isnull().sum()
train_df.isna().sum()

site1         0
time1         0
site2         0
time2      3463
site3         0
time3      6642
site4         0
time4      9240
site5         0
time5     11732
site6         0
time6     14066
site7         0
time7     16264
site8         0
time8     18337
site9         0
time9     20477
site10        0
time10    22509
target        0
dtype: int64

### 1.4 Transform datatypes

In [12]:
for column in train_df.columns:
    if column.startswith('site'):
        train_df[column] = train_df[column].astype(int)
    elif column.startswith('time'):
        train_df[column] = pd.to_datetime(train_df[column])

In [13]:
train_df.info()
train_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 253561 entries, 21669 to 204762
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   site1   253561 non-null  int64         
 1   time1   253561 non-null  datetime64[ns]
 2   site2   253561 non-null  int64         
 3   time2   250098 non-null  datetime64[ns]
 4   site3   253561 non-null  int64         
 5   time3   246919 non-null  datetime64[ns]
 6   site4   253561 non-null  int64         
 7   time4   244321 non-null  datetime64[ns]
 8   site5   253561 non-null  int64         
 9   time5   241829 non-null  datetime64[ns]
 10  site6   253561 non-null  int64         
 11  time6   239495 non-null  datetime64[ns]
 12  site7   253561 non-null  int64         
 13  time7   237297 non-null  datetime64[ns]
 14  site8   253561 non-null  int64         
 15  time8   235224 non-null  datetime64[ns]
 16  site9   253561 non-null  int64         
 17  time9   233084 non-null  datet

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55,2013-01-12 08:05:57,0,NaT,0,NaT,0,NaT,...,NaT,0,NaT,0,NaT,0,NaT,0,NaT,0
54843,56,2013-01-12 08:37:23,55,2013-01-12 08:37:23,56,2013-01-12 09:07:07,55,2013-01-12 09:07:09,0,NaT,...,NaT,0,NaT,0,NaT,0,NaT,0,NaT,0
77292,946,2013-01-12 08:50:13,946,2013-01-12 08:50:14,951,2013-01-12 08:50:15,946,2013-01-12 08:50:15,946,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948,2013-01-12 08:50:16,784,2013-01-12 08:50:16,949,2013-01-12 08:50:17,946,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948,2013-01-12 08:50:17,949,2013-01-12 08:50:18,948,2013-01-12 08:50:18,945,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947,2013-01-12 08:50:19,945,2013-01-12 08:50:19,946,2013-01-12 08:50:19,946,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950,2013-01-12 08:50:20,948,2013-01-12 08:50:20,947,2013-01-12 08:50:21,950,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946,2013-01-12 08:50:21,951,2013-01-12 08:50:22,946,2013-01-12 08:50:22,947,2013-01-12 08:50:22,0


### Function to write submission file

In [14]:
def write_submission_file(predicted_test, test_df, csv_filename):
    submission_file = pd.DataFrame({'session_id': test_df.index, 'target': pd.Series(predicted_test)})
    submission_file.to_csv(csv_filename, index=None)

### Function to print scores

In [15]:
def print_scores(model):
    train_rocauc = roc_auc_score(y_train, model.predict_proba(X_train)[:,1])
    test_rocauc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    train_f1 = f1_score(y_train, model.predict(X_train))
    test_f1 = f1_score(y_test, model.predict(X_test))
    train_log_loss = log_loss(y_train, model.predict_proba(X_train)[:,1])
    test_log_loss = log_loss(y_test, model.predict_proba(X_test)[:,1])
    train_ps = precision_score(y_train, model.predict(X_train))
    test_ps = precision_score(y_test, model.predict(X_test))
    train_rs = recall_score(y_train, model.predict(X_train))
    test_rs = recall_score(y_test, model.predict(X_test))
    
    print('ROC-AUC \t F1 \t\t LogLoss \t precision \t recall')
    print(round(train_rocauc,4), round(test_rocauc,4), '\t', round(train_f1,4), round(test_f1,4), '\t', \
          round(train_log_loss,4), round(test_log_loss,4), '\t',round(train_ps,4), round(test_ps,4), '\t', \
          round(train_rs,4), round(test_rs,4))

### test_df

In [16]:
# test_df
test_df = pd.read_csv('test_sessions.csv')
test_df.set_index('session_id', inplace=True)

# replace NaN in site columns with 0
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].fillna(0)

# change datatype to integers for sites and to datatime for timestamps
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].astype(int)
    elif column.startswith('time'):
        test_df[column] = pd.to_datetime(test_df[column])

test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 82797 entries, 1 to 82797
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   site1   82797 non-null  int64         
 1   time1   82797 non-null  datetime64[ns]
 2   site2   82797 non-null  int64         
 3   time2   81308 non-null  datetime64[ns]
 4   site3   82797 non-null  int64         
 5   time3   80075 non-null  datetime64[ns]
 6   site4   82797 non-null  int64         
 7   time4   79182 non-null  datetime64[ns]
 8   site5   82797 non-null  int64         
 9   time5   78341 non-null  datetime64[ns]
 10  site6   82797 non-null  int64         
 11  time6   77566 non-null  datetime64[ns]
 12  site7   82797 non-null  int64         
 13  time7   76840 non-null  datetime64[ns]
 14  site8   82797 non-null  int64         
 15  time8   76151 non-null  datetime64[ns]
 16  site9   82797 non-null  int64         
 17  time9   75484 non-null  datetime64[ns]
 18  site10  827

### all_websites_sparse

In [17]:
site_columns = [col for col in train_df.columns if col.startswith('site')]

train_test_df = pd.concat([train_df.drop('target', axis=1), test_df])
all_websites_df = train_test_df[site_columns]

print(train_df.shape[0] + test_df.shape[0])
print(train_df.shape, test_df.shape, all_websites_df.shape)

# Index to split the training and test data sets
idx_split = train_df.shape[0]

all_websites_series = pd.Series(all_websites_df.values.flatten())
all_websites_series.nunique(), loaded_dict_df.shape[0], all_websites_series.shape[0]
# number of different websites in train_df + test_df = n. of websites in dictionary (there is also 0 in all_websites_series)

# I create a sparse matrix with: (columns x rows) = (n. session_ids in train_df+test_df x n. of ALL different websites in train_df+test_df)
data = [1] * all_websites_series.shape[0]
indices = all_websites_series#.values
ind_ptr = range(0, all_websites_series.shape[0] + 10, 10)

all_websites_sparse = csr_matrix((data, indices, ind_ptr))[:, 1:]

all_websites_sparse.shape

336358
(253561, 21) (82797, 20) (336358, 10)


(336358, 48371)

# Models

## 0. Dummy model: only sites w OneHotEncoder

In [28]:
site_columns = [col for col in train_df.columns if col.startswith('site')]
new_train = train_df[site_columns].copy()

new_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 253561 entries, 21669 to 204762
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   site1   253561 non-null  int64
 1   site2   253561 non-null  int64
 2   site3   253561 non-null  int64
 3   site4   253561 non-null  int64
 4   site5   253561 non-null  int64
 5   site6   253561 non-null  int64
 6   site7   253561 non-null  int64
 7   site8   253561 non-null  int64
 8   site9   253561 non-null  int64
 9   site10  253561 non-null  int64
dtypes: int64(10)
memory usage: 21.3 MB


In [29]:
X = new_train
y = train_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=seed_value)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(202848, 10) (50713, 10) (202848,) (50713,)


In [30]:
pipeline_logReg = make_pipeline(OneHotEncoder(
                                    sparse_output=False,
                                    handle_unknown='infrequent_if_exist',
                                    min_frequency=0.03
                                ),
                                 # StandardScaler(),
                                 LogisticRegression(
                                     random_state=seed_value,
                                     max_iter=5000
                                 )
                                )

pipeline_logReg.fit(X_train, y_train)

### roc_auc_score = 0.61 ... f1 = 0.0

In [40]:
model = pipeline_logReg

train_rocauc = roc_auc_score(y_train, model.predict_proba(X_train)[:,1])
test_rocauc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
print(round(train_rocauc,4), round(test_rocauc,4))

0.616 0.6074


In [47]:
train_f1 = f1_score(y_train, model.predict(X_train))
test_f1 = f1_score(y_test, model.predict(X_test))
print(round(train_f1,4), round(test_f1,4))

0.0 0.0


#### SUBMISSION 0: ROC-AUC = 0.60926

In [33]:
# test_df
test_df = pd.read_csv('test_sessions.csv')
test_df.set_index('session_id', inplace=True)

# replace NaN in site columns with 0
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].fillna(0)

# change datatype to integers for sites and to datatime for timestamps
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].astype(int)

new_test = test_df[site_columns].copy()

In [34]:
new_train.shape, new_train.shape

((253561, 10), (253561, 10))

In [37]:
predicted_test = pipeline_logReg.predict_proba(new_test)[:, 1]
predicted_test.shape

(82797,)

In [38]:
write_submission_file(predicted_test, new_test, 'submission.csv')
check_submission = pd.read_csv('submission.csv')
check_submission

Unnamed: 0,session_id,target
0,1,0.008624
1,2,0.000038
2,3,0.012235
3,4,0.012235
4,5,0.012235
...,...,...
82792,82793,0.001204
82793,82794,0.012235
82794,82795,0.007205
82795,82796,0.003588


## 1. Dummy model + Time: all features w OneHotEncoder

In [144]:
time_columns = train_df.select_dtypes(exclude="number").columns
site_columns = train_df.drop('target', axis=1).select_dtypes(include="number").columns
time_columns, site_columns

new_train = train_df[site_columns].copy()

for time_clmn in time_columns:
    new_train[f'{time_clmn}_year'] = train_df[time_clmn].dt.year.fillna(0).astype(int)
    new_train[f'{time_clmn}_month'] = train_df[time_clmn].dt.month.fillna(0).astype(int)
    new_train[f'{time_clmn}_day'] = train_df[time_clmn].dt.day.fillna(0).astype(int)
    new_train[f'{time_clmn}_hour'] = train_df[time_clmn].dt.hour.fillna(0).astype(int)
    new_train[f'{time_clmn}_minute'] = train_df[time_clmn].dt.minute.fillna(0).astype(int)

new_train.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,...,time9_year,time9_month,time9_day,time9_hour,time9_minute,time10_year,time10_month,time10_day,time10_hour,time10_minute
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,55,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
54843,56,55,56,55,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
77292,946,946,951,946,946,945,948,784,949,946,...,2013,1,12,8,50,2013,1,12,8,50
114021,945,948,949,948,945,946,947,945,946,946,...,2013,1,12,8,50,2013,1,12,8,50
146670,947,950,948,947,950,952,946,951,946,947,...,2013,1,12,8,50,2013,1,12,8,50


In [145]:
# for clmn in new_train.columns:
#     new_train[clmn] = new_train[clmn].astype(str)

# new_train.head()

In [146]:
X = new_train
y = train_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=seed_value)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(202848, 60) (50713, 60) (202848,) (50713,)


In [147]:
time_columns = [col for col in new_train.columns if col.startswith('time')]
site_columns = [col for col in new_train.columns if col.startswith('site')]

# times_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value=0))

# sites_pipe = make_pipeline(
#     # SimpleImputer(strategy="constant", fill_value="N_A"),
#     OneHotEncoder(sparse_output=False,handle_unknown='infrequent_if_exist',min_frequency=0.03)
# )

# preprocessor = ColumnTransformer(transformers=[
#     ('times_pipe', times_pipe, time_columns),
#     ('sites_pipe', sites_pipe, site_columns),
# ])

full_pipeline_LR = make_pipeline(#preprocessor,
                                 OneHotEncoder(sparse_output=False,handle_unknown='infrequent_if_exist',min_frequency=0.03),
                                 StandardScaler(),
                                 LogisticRegression(
                                     random_state=seed_value,
                                     max_iter=5000
                                 )
                                )

full_pipeline_LR.fit(X_train, y_train)

### roc_auc_score = 0.94 ... f1 = 0.0

In [52]:
model = full_pipeline_LR

train_rocauc = roc_auc_score(y_train, model.predict_proba(X_train)[:,1])
test_rocauc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
print(round(train_rocauc,4), round(test_rocauc,4))

0.9369 0.9345


In [53]:
train_f1 = f1_score(y_train, model.predict(X_train))
test_f1 = f1_score(y_test, model.predict(X_test))
print(round(train_f1,4), round(test_f1,4))

0.0 0.0


#### SUBMISSION 1: ROC-AUC = 0.75423

In [81]:
# test_df
test_df = pd.read_csv('test_sessions.csv')
test_df.set_index('session_id', inplace=True)

# replace NaN in site columns with 0
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].fillna(0)

# change datatype to integers for sites and to datatime for timestamps
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].astype(int)
    elif column.startswith('time'):
        test_df[column] = pd.to_datetime(test_df[column])

In [91]:
time_columns = [col for col in test_df.columns if col.startswith('time')]
site_columns = [col for col in test_df.columns if col.startswith('site')]

new_test = test_df[site_columns].copy()

for time_clmn in time_columns:
    new_test[f'{time_clmn}_year'] = test_df[time_clmn].dt.year.fillna(0).astype(int)
    new_test[f'{time_clmn}_month'] = test_df[time_clmn].dt.month.fillna(0).astype(int)
    new_test[f'{time_clmn}_day'] = test_df[time_clmn].dt.day.fillna(0).astype(int)
    new_test[f'{time_clmn}_hour'] = test_df[time_clmn].dt.hour.fillna(0).astype(int)
    new_test[f'{time_clmn}_minute'] = test_df[time_clmn].dt.minute.fillna(0).astype(int)

for clmn in new_train.columns:
    new_test[clmn] = new_test[clmn].astype(str)

new_test.info()
new_test.head()

<class 'pandas.core.frame.DataFrame'>
Index: 82797 entries, 1 to 82797
Data columns (total 60 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   site1          82797 non-null  object
 1   site2          82797 non-null  object
 2   site3          82797 non-null  object
 3   site4          82797 non-null  object
 4   site5          82797 non-null  object
 5   site6          82797 non-null  object
 6   site7          82797 non-null  object
 7   site8          82797 non-null  object
 8   site9          82797 non-null  object
 9   site10         82797 non-null  object
 10  time1_year     82797 non-null  object
 11  time1_month    82797 non-null  object
 12  time1_day      82797 non-null  object
 13  time1_hour     82797 non-null  object
 14  time1_minute   82797 non-null  object
 15  time2_year     82797 non-null  object
 16  time2_month    82797 non-null  object
 17  time2_day      82797 non-null  object
 18  time2_hour     82797 non-null  

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,...,time9_year,time9_month,time9_day,time9_hour,time9_minute,time10_year,time10_month,time10_day,time10_hour,time10_minute
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,29,35,22,321,23,2211,6730,21,44582,15336,...,2014,10,4,11,20,2014,10,4,11,20
2,782,782,782,782,782,782,782,782,782,782,...,2014,7,3,11,1,2014,7,3,11,1
3,55,55,55,55,55,55,55,55,1445,1445,...,2014,12,5,15,56,2014,12,5,15,56
4,1023,1022,50,222,202,3374,50,48,48,3374,...,2014,11,4,10,3,2014,11,4,10,3
5,301,301,301,66,67,69,70,68,71,167,...,2014,5,16,15,5,2014,5,16,15,5


In [92]:
new_train.shape, new_test.shape

((253561, 60), (82797, 60))

In [93]:
predicted_test = full_pipeline_LR.predict_proba(new_test)[:, 1]
predicted_test.shape

(82797,)

In [96]:
write_submission_file(predicted_test, new_test, 'submission.csv')
check_submission = pd.read_csv('submission.csv')
check_submission

Unnamed: 0,session_id,target
0,1,4.030621e-08
1,2,2.603474e-13
2,3,1.500893e-06
3,4,2.856246e-11
4,5,5.662895e-03
...,...,...
82792,82793,7.434874e-04
82793,82794,1.231837e-10
82794,82795,9.159234e-08
82795,82796,3.548438e-08


## 2. Sites w TargetEncoder + Times (50clmns) w OHE

In [207]:
time_columns = [col for col in train_df.columns if col.startswith('time')]
site_columns = [col for col in train_df.columns if col.startswith('site')]

new_train = train_df[site_columns].copy()

for time_clmn in time_columns:
    new_train[f'{time_clmn}_year'] = train_df[time_clmn].dt.year.fillna(0).astype(int)
    new_train[f'{time_clmn}_month'] = train_df[time_clmn].dt.month.fillna(0).astype(int)
    new_train[f'{time_clmn}_day'] = train_df[time_clmn].dt.day.fillna(0).astype(int)
    new_train[f'{time_clmn}_hour'] = train_df[time_clmn].dt.hour.fillna(0).astype(int)
    new_train[f'{time_clmn}_minute'] = train_df[time_clmn].dt.minute.fillna(0).astype(int)

new_train.info()
new_train.head()

<class 'pandas.core.frame.DataFrame'>
Index: 253561 entries, 21669 to 204762
Data columns (total 60 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   site1          253561 non-null  int64
 1   site2          253561 non-null  int64
 2   site3          253561 non-null  int64
 3   site4          253561 non-null  int64
 4   site5          253561 non-null  int64
 5   site6          253561 non-null  int64
 6   site7          253561 non-null  int64
 7   site8          253561 non-null  int64
 8   site9          253561 non-null  int64
 9   site10         253561 non-null  int64
 10  time1_year     253561 non-null  int64
 11  time1_month    253561 non-null  int64
 12  time1_day      253561 non-null  int64
 13  time1_hour     253561 non-null  int64
 14  time1_minute   253561 non-null  int64
 15  time2_year     253561 non-null  int64
 16  time2_month    253561 non-null  int64
 17  time2_day      253561 non-null  int64
 18  time2_hour     253561 non

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,...,time9_year,time9_month,time9_day,time9_hour,time9_minute,time10_year,time10_month,time10_day,time10_hour,time10_minute
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,55,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
54843,56,55,56,55,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
77292,946,946,951,946,946,945,948,784,949,946,...,2013,1,12,8,50,2013,1,12,8,50
114021,945,948,949,948,945,946,947,945,946,946,...,2013,1,12,8,50,2013,1,12,8,50
146670,947,950,948,947,950,952,946,951,946,947,...,2013,1,12,8,50,2013,1,12,8,50


In [208]:
X = new_train
y = train_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=seed_value)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(202848, 60) (50713, 60) (202848,) (50713,)


In [209]:
time_columns = [col for col in new_train.columns if col.startswith('time')]
site_columns = [col for col in new_train.columns if col.startswith('site')]

times_pipe = make_pipeline(
    # SimpleImputer(strategy="constant", fill_value="0"),
    OneHotEncoder(sparse_output=False,handle_unknown='infrequent_if_exist',min_frequency=0.03)
)

sites_pipe = make_pipeline(
    # SimpleImputer(strategy="constant", fill_value="N_A"),
    TargetEncoder()
)

preprocessor = ColumnTransformer(transformers=[
    ('times_pipe', times_pipe, time_columns),
    ('sites_pipe', sites_pipe, site_columns),
])

targetEnc_LR = make_pipeline(preprocessor,
                                 # OneHotEncoder(sparse_output=False,handle_unknown='infrequent_if_exist',min_frequency=0.03),
                                 StandardScaler(),
                                 LogisticRegression(
                                     random_state=seed_value,
                                     max_iter=5000
                                 )
                                )

targetEnc_LR.fit(X_train, y_train)

### roc_auc_score = 0.98 ... f1 = 0.63

In [58]:
model = targetEnc_LR

train_rocauc = roc_auc_score(y_train, model.predict_proba(X_train)[:,1])
test_rocauc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
print(round(train_rocauc,4), round(test_rocauc,4))

0.9852 0.9709


In [59]:
train_f1 = f1_score(y_train, model.predict(X_train))
test_f1 = f1_score(y_test, model.predict(X_test))
print(round(train_f1,4), round(test_f1,4))

0.6336 0.4817


#### SUBMISSION 2: ROC-AUC = 0.83550

In [212]:
# test_df
test_df = pd.read_csv('test_sessions.csv')
test_df.set_index('session_id', inplace=True)

# replace NaN in site columns with 0
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].fillna(0)

# change datatype to integers for sites and to datatime for timestamps
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].astype(int)
    elif column.startswith('time'):
        test_df[column] = pd.to_datetime(test_df[column])

In [213]:
time_columns = [col for col in test_df.columns if col.startswith('time')]
site_columns = [col for col in test_df.columns if col.startswith('site')]

new_test = test_df[site_columns].copy()

for time_clmn in time_columns:
    new_test[f'{time_clmn}_year'] = test_df[time_clmn].dt.year.fillna(0).astype(int)
    new_test[f'{time_clmn}_month'] = test_df[time_clmn].dt.month.fillna(0).astype(int)
    new_test[f'{time_clmn}_day'] = test_df[time_clmn].dt.day.fillna(0).astype(int)
    new_test[f'{time_clmn}_hour'] = test_df[time_clmn].dt.hour.fillna(0).astype(int)
    new_test[f'{time_clmn}_minute'] = test_df[time_clmn].dt.minute.fillna(0).astype(int)

# for clmn in new_train.columns:
#     new_test[clmn] = new_test[clmn].astype(str)

new_test.info()
new_test.head()

<class 'pandas.core.frame.DataFrame'>
Index: 82797 entries, 1 to 82797
Data columns (total 60 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   site1          82797 non-null  int64
 1   site2          82797 non-null  int64
 2   site3          82797 non-null  int64
 3   site4          82797 non-null  int64
 4   site5          82797 non-null  int64
 5   site6          82797 non-null  int64
 6   site7          82797 non-null  int64
 7   site8          82797 non-null  int64
 8   site9          82797 non-null  int64
 9   site10         82797 non-null  int64
 10  time1_year     82797 non-null  int64
 11  time1_month    82797 non-null  int64
 12  time1_day      82797 non-null  int64
 13  time1_hour     82797 non-null  int64
 14  time1_minute   82797 non-null  int64
 15  time2_year     82797 non-null  int64
 16  time2_month    82797 non-null  int64
 17  time2_day      82797 non-null  int64
 18  time2_hour     82797 non-null  int64
 19  time2_min

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,...,time9_year,time9_month,time9_day,time9_hour,time9_minute,time10_year,time10_month,time10_day,time10_hour,time10_minute
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,29,35,22,321,23,2211,6730,21,44582,15336,...,2014,10,4,11,20,2014,10,4,11,20
2,782,782,782,782,782,782,782,782,782,782,...,2014,7,3,11,1,2014,7,3,11,1
3,55,55,55,55,55,55,55,55,1445,1445,...,2014,12,5,15,56,2014,12,5,15,56
4,1023,1022,50,222,202,3374,50,48,48,3374,...,2014,11,4,10,3,2014,11,4,10,3
5,301,301,301,66,67,69,70,68,71,167,...,2014,5,16,15,5,2014,5,16,15,5


In [214]:
new_train.shape, new_test.shape

((253561, 60), (82797, 60))

In [215]:
predicted_test = targetEnc_LR.predict_proba(new_test)[:, 1]
predicted_test.shape

(82797,)

In [216]:
write_submission_file(predicted_test, new_test, 'submission.csv')
check_submission = pd.read_csv('submission.csv')
check_submission

Unnamed: 0,session_id,target
0,1,9.002416e-09
1,2,7.731760e-09
2,3,8.684670e-07
3,4,7.031104e-12
4,5,1.009593e-03
...,...,...
82792,82793,1.165102e-03
82793,82794,4.550354e-13
82794,82795,2.167851e-08
82795,82796,9.469215e-09


## 3. Sites w TargetEncoder + Times min/max (7clm) w OHE

In [18]:
time_columns = [col for col in train_df.columns if col.startswith('time')]
site_columns = [col for col in train_df.columns if col.startswith('site')]

new_train = train_df[site_columns].copy()

new_train['year'] = train_df['time1'].dt.year.apply(lambda x: 0 if x == 2013 else 1)

time_train_df = pd.DataFrame(index= train_df.index,
                       data = {'min': train_df[time_columns].min(axis=1, skipna=True),
                               'max': train_df[time_columns].max(axis=1, skipna=True)
                              },
                      )

# for time_clmn in time_columns:
for time_clmn in ['min', 'max']:
    # new_train[f'{time_clmn}_year'] = train_df[time_clmn].dt.year.fillna(0).astype(int)
    new_train[f'{time_clmn}_month'] = time_train_df[time_clmn].dt.month.fillna(0).astype(int)
    new_train[f'{time_clmn}_day'] = time_train_df[time_clmn].dt.dayofweek.fillna(0).astype(int)
    new_train[f'{time_clmn}_hour'] = time_train_df[time_clmn].dt.hour.fillna(0).astype(int)
    # new_train[f'{time_clmn}_minute'] = time_train_df[time_clmn].dt.minute.fillna(0).astype(int)


new_train.info()
new_train.head()

<class 'pandas.core.frame.DataFrame'>
Index: 253561 entries, 21669 to 204762
Data columns (total 17 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   site1      253561 non-null  int64
 1   site2      253561 non-null  int64
 2   site3      253561 non-null  int64
 3   site4      253561 non-null  int64
 4   site5      253561 non-null  int64
 5   site6      253561 non-null  int64
 6   site7      253561 non-null  int64
 7   site8      253561 non-null  int64
 8   site9      253561 non-null  int64
 9   site10     253561 non-null  int64
 10  year       253561 non-null  int64
 11  min_month  253561 non-null  int64
 12  min_day    253561 non-null  int64
 13  min_hour   253561 non-null  int64
 14  max_month  253561 non-null  int64
 15  max_day    253561 non-null  int64
 16  max_hour   253561 non-null  int64
dtypes: int64(17)
memory usage: 34.8 MB


Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,year,min_month,min_day,min_hour,max_month,max_day,max_hour
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
21669,56,55,0,0,0,0,0,0,0,0,0,1,5,8,1,5,8
54843,56,55,56,55,0,0,0,0,0,0,0,1,5,8,1,5,9
77292,946,946,951,946,946,945,948,784,949,946,0,1,5,8,1,5,8
114021,945,948,949,948,945,946,947,945,946,946,0,1,5,8,1,5,8
146670,947,950,948,947,950,952,946,951,946,947,0,1,5,8,1,5,8


In [19]:
X = new_train
y = train_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=seed_value)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(202848, 17) (50713, 17) (202848,) (50713,)


In [112]:
# time_columns = [col for col in new_train.columns if col.startswith('time')]
# time_columns += ['year']
time_columns = new_train.columns.drop(site_columns)
site_columns = [col for col in new_train.columns if col.startswith('site')]

times_pipe = make_pipeline(
    # SimpleImputer(strategy="constant", fill_value="0"),
    OneHotEncoder(sparse_output=False,handle_unknown='infrequent_if_exist',min_frequency=0.03)
    # 'passthrough'
)

sites_pipe = make_pipeline(
    # SimpleImputer(strategy="constant", fill_value="N_A"),
    TargetEncoder()
)

preprocessor = ColumnTransformer(transformers=[
    ('times_pipe', times_pipe, time_columns),
    ('sites_pipe', sites_pipe, site_columns),
])

targetEnc_LR = make_pipeline(preprocessor,
                                 # OneHotEncoder(sparse_output=False,handle_unknown='infrequent_if_exist',min_frequency=0.03),
                                 StandardScaler(),
                                 LogisticRegression(
                                     random_state=seed_value,
                                     max_iter=5000
                                 )
                                )

targetEnc_LR.fit(X_train, y_train)

In [113]:
print_scores(model = targetEnc_LR)

ROC-AUC 	 F1 		 LogLoss 	 precision 	 recall
0.9841 0.9702 	 0.6541 0.4897 	 0.018 0.0232 	 0.9269 0.8191 	 0.5054 0.3492


### roc_auc_score = 0.9839

In [96]:
print_scores(model = targetEnc_LR)

ROC-AUC 	 F1 		 LogLoss 	 precision 	 recall
0.9839 0.9699 	 0.6451 0.496 	 0.0181 0.0232 	 0.9273 0.8424 	 0.4946 0.3515


#### SUBMISSION 3: ROC-AUC = 0.93445

In [97]:
# test_df
time_columns = [col for col in test_df.columns if col.startswith('time')]

test_df = pd.read_csv('test_sessions.csv')
test_df.set_index('session_id', inplace=True)

# replace NaN in site columns with 0
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].fillna(0)

# change datatype to integers for sites and to datatime for timestamps
for column in test_df.columns:
    if column.startswith('site'):
        test_df[column] = test_df[column].astype(int)
    elif column.startswith('time'):
        test_df[column] = pd.to_datetime(test_df[column])

time_test_df = pd.DataFrame(index= test_df.index,
                       data = {'min': test_df[time_columns].min(axis=1, skipna=True),
                               'max': test_df[time_columns].max(axis=1, skipna=True)
                              },
                      )

In [98]:
time_columns = [col for col in test_df.columns if col.startswith('time')]
site_columns = [col for col in test_df.columns if col.startswith('site')]

new_test = test_df[site_columns].copy()

new_test['year'] = train_df['time1'].dt.year.apply(lambda x: 0 if x == 2013 else 1)

time_test_df = pd.DataFrame(index= test_df.index,
                       data = {'min': test_df[time_columns].min(axis=1, skipna=True),
                               'max': test_df[time_columns].max(axis=1, skipna=True)
                              },
                      )

# for time_clmn in time_columns:
for time_clmn in ['min', 'max']:
    # new_test[f'{time_clmn}_year'] = time_test_df[time_clmn].dt.year.fillna(0).astype(int)
    new_test[f'{time_clmn}_month'] = time_test_df[time_clmn].dt.month.fillna(0).astype(int)
    new_test[f'{time_clmn}_day'] = time_test_df[time_clmn].dt.dayofweek.fillna(0).astype(int)
    new_test[f'{time_clmn}_hour'] = time_test_df[time_clmn].dt.hour.fillna(0).astype(int)
    # new_test[f'{time_clmn}_minute'] = time_test_df[time_clmn].dt.minute.fillna(0).astype(int)

new_test.info()
new_test.head()

<class 'pandas.core.frame.DataFrame'>
Index: 82797 entries, 1 to 82797
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   site1      82797 non-null  int64
 1   site2      82797 non-null  int64
 2   site3      82797 non-null  int64
 3   site4      82797 non-null  int64
 4   site5      82797 non-null  int64
 5   site6      82797 non-null  int64
 6   site7      82797 non-null  int64
 7   site8      82797 non-null  int64
 8   site9      82797 non-null  int64
 9   site10     82797 non-null  int64
 10  year       82797 non-null  int64
 11  min_month  82797 non-null  int64
 12  min_day    82797 non-null  int64
 13  min_hour   82797 non-null  int64
 14  max_month  82797 non-null  int64
 15  max_day    82797 non-null  int64
 16  max_hour   82797 non-null  int64
dtypes: int64(17)
memory usage: 11.4 MB


Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,year,min_month,min_day,min_hour,max_month,max_day,max_hour
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,29,35,22,321,23,2211,6730,21,44582,15336,1,10,5,11,10,5,11
2,782,782,782,782,782,782,782,782,782,782,1,7,3,11,7,3,11
3,55,55,55,55,55,55,55,55,1445,1445,0,12,4,15,12,4,15
4,1023,1022,50,222,202,3374,50,48,48,3374,1,11,1,10,11,1,10
5,301,301,301,66,67,69,70,68,71,167,1,5,4,15,5,4,15


In [99]:
new_train.shape, new_test.shape

((253561, 17), (82797, 17))

In [100]:
predicted_test = targetEnc_LR.predict_proba(new_test)[:, 1]
predicted_test.shape

(82797,)

In [101]:
write_submission_file(predicted_test, new_test, 'submission.csv')
check_submission = pd.read_csv('submission.csv')
check_submission

Unnamed: 0,session_id,target
0,1,4.768772e-06
1,2,4.850395e-06
2,3,3.130267e-04
3,4,3.903076e-06
4,5,2.437795e-03
...,...,...
82792,82793,1.607518e-02
82793,82794,3.729668e-05
82794,82795,1.678996e-05
82795,82796,4.396381e-06


## 4. All_websites_sparse + 1clmn year + 3clmn min MDH + seconds, weekend, morning

In [54]:
time_columns = [col for col in train_df.columns if col.startswith('time')]
idx_split = train_df.shape[0]

train_test_df = pd.concat([train_df.drop('target', axis=1), test_df])
time_features = pd.DataFrame(index=train_test_df.index)


time_features['year'] = train_test_df['time1'].dt.year.apply(lambda x: 0 if x == 2013 else 1)

time_train_test_df = pd.DataFrame(index= train_test_df.index,
                       data = {'min': train_test_df[time_columns].min(axis=1, skipna=True),
                               'max': train_test_df[time_columns].max(axis=1, skipna=True)
                              },
                      )

for time_clmn in ['min']:
    time_features[f'{time_clmn}_month'] = time_train_test_df[time_clmn].dt.month.fillna(0).astype(int)
    time_features[f'{time_clmn}_day'] = time_train_test_df[time_clmn].dt.dayofweek.fillna(0).astype(int)
    time_features[f'{time_clmn}_hour'] = time_train_test_df[time_clmn].dt.hour.fillna(0).astype(int)

time_features['seconds'] = ((time_train_test_df['max'] - time_train_test_df['min']) / np.timedelta64(1, 's')).astype(int)

time_features['weekend'] = np.where((time_features.min_day.values == 5) | (time_features.min_day.values == 6), 1, 0)
time_features['morning'] = np.where((time_features.min_hour.values >= 6) & (time_features.min_hour.values < 12), 1, 0)

time_features.info()
time_features.head()

<class 'pandas.core.frame.DataFrame'>
Index: 336358 entries, 21669 to 82797
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   year       336358 non-null  int64
 1   min_month  336358 non-null  int64
 2   min_day    336358 non-null  int64
 3   min_hour   336358 non-null  int64
 4   seconds    336358 non-null  int64
 5   weekend    336358 non-null  int64
 6   morning    336358 non-null  int64
dtypes: int64(7)
memory usage: 20.5 MB


Unnamed: 0_level_0,year,min_month,min_day,min_hour,seconds,weekend,morning
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
21669,0,1,5,8,0,1,1
54843,0,1,5,8,1786,1,1
77292,0,1,5,8,4,1,1
114021,0,1,5,8,3,1,1
146670,0,1,5,8,2,1,1


In [55]:
ohe_encoded_features = []
for clmn in time_features.columns:
# for clmn in ['time1_month', 'time1_day']:
    # ohe_encoded_feature = OneHotEncoder().fit_transform(time_features[[clmn]].values[:idx_split, :])
    ohe_encoded_feature = OneHotEncoder().fit_transform(time_features[[clmn]].values)
    ohe_encoded_features.append(ohe_encoded_feature)
print(len(ohe_encoded_features))

# combine all encoded and scaled features into a list
# all_features = [all_websites_sparse[:idx_split, :]] + ohe_encoded_features + [seconds_feature]
all_features = [all_websites_sparse] + ohe_encoded_features

# Stack all features horizontally into a CSR matrix
all_features_sparse = csr_matrix(hstack(all_features))

all_features_sparse.shape

7


(336358, 50214)

In [56]:
X = all_features_sparse[:idx_split]
y = train_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=seed_value)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(202848, 50214) (50713, 50214) (202848,) (50713,)


In [71]:
# X = all_websites_sparse[:idx_split]
# y = train_df['target']

# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=seed_value)
# print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(202848, 48371) (50713, 48371) (202848,) (50713,)


In [116]:
full_pipeline_LR = make_pipeline(#StandardScaler(with_mean=False),
                                 LogisticRegression(
                                     # solver='saga',
                                     C = 100,
                                     random_state=seed_value,
                                     max_iter=5000
                                 )
                                )

full_pipeline_LR.fit(X_train, y_train)

# param_grid = {
#     # 'logisticregression__penalty': ['l1', 'l2', None],
#     # 'logisticregression__solver': ['liblinear', 'newton-cg', 'newton-cholesky', 'lbfgs', 'sag', 'saga'],
#     # 'logisticregression__solver': ['lbfgs', 'sag', 'saga'],
#     # 'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100],
#     # 'logisticregression__C': [80, 100, 120],
#     # 'logisticregression__C': [120, 150, 200]
#     # 'logisticregression__C': [110, 120, 130]
#     # 'logisticregression__C': [130, 140]
#     # 'logisticregression__C': range(125, 135, 2)
#     'logisticregression__C': range(125, 130)
# }

# search_LR = GridSearchCV(full_pipeline_LR,
#                          param_grid,
#                          cv=5,
#                          n_jobs=2,
#                          verbose=1,
#                          scoring='roc_auc'
#                         )

# search_LR.fit(X_train, y_train)

In [117]:
print_scores(model = full_pipeline_LR)

ROC-AUC 	 F1 		 LogLoss 	 precision 	 recall
0.9968 0.9888 	 0.8075 0.658 	 0.0102 0.0177 	 0.8819 0.7754 	 0.7446 0.5714


In [48]:
print_scores(model = full_pipeline_LR)

ROC-AUC 	 F1 		 LogLoss 	 precision 	 recall
0.9887 0.9799 	 0.4681 0.3993 	 0.019 0.0218 	 0.8299 0.7333 	 0.326 0.2744


### roc_auc_score = 0.9964, C=127

In [32]:
print_scores(model = search_LR)

ROC-AUC 	 F1 		 LogLoss 	 precision 	 recall
0.9964 0.9881 	 0.7951 0.6564 	 0.0107 0.0181 	 0.8675 0.7589 	 0.7338 0.5782


In [33]:
model = search_LR
print(model.best_score_)
model.best_params_

0.9901868977894732


{'logisticregression__C': 127}

#### SUBMISSION 4: ROC-AUC = 0.94914

In [34]:
alice_test_sub = all_features_sparse[idx_split:]#,:]
alice_test_sub.shape, test_df.shape

((82797, 50214), (82797, 20))

In [35]:
predicted_test = search_LR.predict_proba(alice_test_sub)[:, 1]
predicted_test.shape
# write_submission_file(predicted_test, test_df, 'submission.csv')

(82797,)

In [36]:
write_submission_file(predicted_test, test_df, 'submission.csv')
check_submission = pd.read_csv('submission.csv')
check_submission

Unnamed: 0,session_id,target
0,1,1.705955e-11
1,2,1.275440e-18
2,3,3.202580e-23
3,4,3.492180e-21
4,5,1.225351e-10
...,...,...
82792,82793,5.783817e-12
82793,82794,1.968681e-15
82794,82795,1.142692e-09
82795,82796,4.385041e-12
