<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. This material is a translated version of the Capstone project (by the same author) from specialization "Machine learning and data analysis" by Yandex and MIPT. No solutions shared.

# <center> Week 6. Vowpal Wabbit
This week, we explore the popular library called Vowpal Wabbit and apply it to site visits data.

Week 6 roadmap:
- Part 1. Overview of Vowpal Wabbit
- Part 2. Applying Vowpal Wabbit to site visits data
    - 2.1 Data preprocssing
    - 2.2 Holdout validation
    - 2.3 Test set validation (on public leaderboard)
    
Resources: Vowpal Wabbit's [documentation](https://github.com/JohnLangford/vowpal_wabbit/wiki)

## The task 
1. Fill in code in this notebook
2. Choose answers in the [webform](https://docs.google.com/forms/d/1VWfSupfYXvb6gyROR0enXYVMjuxqgRaTScYhEz4f6YQ)

## Part 1. Overview of Vowpal Wabbit

Read the [article](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-8-vowpal-wabbit-fast-learning-with-gigabytes-of-data-60f750086237) on Vowpal Wabbit from the OpenDataScience machine learning course. Download the [notebook](https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic08_sgd_hashing_vowpal_wabbit/topic8_sgd_hashing_vowpal_wabbit.ipynb?flush_cache=true), play with the code a bit - that is the most effective way to get started.  

## Part 2. Applying Vowpal Wabbit to site visits data

## 2.1 Data preprocessing

Now we will see Vowpal Wabbit in action. If we were to use in a binary classification task, we would not have noticed any difference neither in terms of accuracy nor in terms of speed. Instead, we will do 400-class classification. Source data is the same, but now we have 400 users and our goal is to identify each one of them. 
- Download the data from [here](https://www.kaggle.com/c/identify-me-if-you-can4/data) - files **train_sessions_400users.csv** and **test_sessions_400users.csv**

In [1]:
import os
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

In [2]:
# Change to your data path
PATH_TO_DATA = '../../data/'

Read train and test data. You may notice, that sessions in the test subset are spanning the different time period than train sessions. 

In [3]:
train_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'train_sessions_400users.csv'), 
                           index_col='session_id')

In [4]:
test_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'test_sessions_400users.csv'), 
                           index_col='session_id')

In [5]:
train_df_400.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,user_id
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,23713,2014-03-24 15:22:40,23720.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:54,23720.0,2014-03-24 15:22:54,...,2014-03-24 15:22:55,23713.0,2014-03-24 15:23:01,23713.0,2014-03-24 15:23:03,23713.0,2014-03-24 15:23:04,23713.0,2014-03-24 15:23:05,653
2,8726,2014-04-17 14:25:58,8725.0,2014-04-17 14:25:59,665.0,2014-04-17 14:25:59,8727.0,2014-04-17 14:25:59,45.0,2014-04-17 14:25:59,...,2014-04-17 14:26:01,45.0,2014-04-17 14:26:01,5320.0,2014-04-17 14:26:18,5320.0,2014-04-17 14:26:47,5320.0,2014-04-17 14:26:48,198
3,303,2014-03-21 10:12:24,19.0,2014-03-21 10:12:36,303.0,2014-03-21 10:12:54,303.0,2014-03-21 10:13:01,303.0,2014-03-21 10:13:24,...,2014-03-21 10:13:36,303.0,2014-03-21 10:13:54,309.0,2014-03-21 10:14:01,303.0,2014-03-21 10:14:06,303.0,2014-03-21 10:14:24,34
4,1359,2013-12-13 09:52:28,925.0,2013-12-13 09:54:34,1240.0,2013-12-13 09:54:34,1360.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:54:34,...,2013-12-13 09:54:34,1346.0,2013-12-13 09:54:34,1345.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:58:19,1345.0,2013-12-13 09:58:19,601
5,11,2013-11-26 12:35:29,85.0,2013-11-26 12:35:31,52.0,2013-11-26 12:35:31,85.0,2013-11-26 12:35:32,11.0,2013-11-26 12:35:32,...,2013-11-26 12:35:32,11.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:03,10.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:04,273


**There are 182793 sessions in train data, 46473 in test data and 400 unique users.**

In [6]:
train_df_400.shape, test_df_400.shape, train_df_400['user_id'].nunique()

((182793, 21), (46473, 20), 400)

Vowpal Wabbit requires class labels to be encoded from 1 to K, where K is total number of classes in classification task (in our case that is 400). So we should apply *LabelEncoder* and add +1 to its result. (*LabelEncoder* translates all labels in 0 to K-1 range). We will also have to perform inverse transformation later.

In [7]:
y = train_df_400['user_id'].values
class_encoder = LabelEncoder()
y_for_vw = class_encoder.fit_transform(y) + 1

Next we will compare VW wih SGDClassifier and logistic regression. All these models require processed data. Prepare sparse matrices for sklearn models (just like we did in previous part):
- concatenate train and test data
- choose only websites (features 'site1' through 'site10')
- impute missing values with 0 (we started enumerating sites from 0)
- transform data to *csr_matrix* 
- split back to *train* and *test*

In [8]:
sites = ['site' + str(i) for i in range(1, 11)]

In [9]:
train_test_df = pd.concat([train_df_400.iloc[:, :-1], test_df_400])
train_test_df_sites = train_test_df[sites].fillna(0).astype('int')

In [10]:
flattened = train_test_df_sites.values.flatten()
train_test_sparse = csr_matrix(([1] * flattened.shape[0],
                                flattened,
                                range(0, flattened.shape[0] + 10, 10)))[:, 1:]
X_train_sparse = train_test_sparse[:train_df_400.shape[0],:]
X_test_sparse = train_test_sparse[train_df_400.shape[0]:,:]
y = train_df_400['user_id'].values

## 2.2 Holdout validation

Split data into training (70%) and validation (30%) subsets. We do not shuffle data and take into account that sessions are sorted by time

In [11]:
train_share = int(.7 * train_df_400.shape[0])
train_df_part = train_df_400[sites].iloc[:train_share, :]
valid_df = train_df_400[sites].iloc[train_share:, :]
X_train_part_sparse = X_train_sparse[:train_share, :]
X_valid_sparse = X_train_sparse[train_share:, :]

In [12]:
y_train_part = y[:train_share]
y_valid = y[train_share:]
y_train_part_for_vw = y_for_vw[:train_share]
y_valid_for_vw = y_for_vw[train_share:]

Implement function **arrays_to_vw** which transforms data to Vowpal Wabbit format. 

Input: 
- X - numpy matrix (training data)
- y (optional) - target numpy vector. It is optional since we will apply the same function to test data.
- train (flag) - True, if we are passing training data as X, False otherwise
- out_file - path to .vw file, in which we'll write results

Details:
- you should iterate over every row of X and write to file all the data using whitespace separator. Also you should add target value at the start of each row, separating it with | from the features.
- when applying function to test data, you can wirte any target value (1 for example)

In [13]:
train_df_part.values.shape

(127955, 10)

In [14]:
train_df_part.values[0,:]

array([23713., 23720., 23713., 23713., 23720., 23713., 23713., 23713.,
       23713., 23713.])

In [50]:
def to_vw_format(x, y):
    return str(y or '1') + ' | ' + ' '.join(x.astype(int).astype(str)) + '\n'

def arrays_to_vw(X, y=None, train=True, out_file='tmp.vw'):
    with open(os.path.join(PATH_TO_DATA, out_file), 'w') as vw_train_data:
        for i in range(X.shape[0]):
            if y is None:
                target = 1
            else:
                target = y[i]
            vw_train_data.write(to_vw_format(X[i,:], target))

Apply function to subset of training data (train_df_part, y_train_part_for_vw), to holdout set (valid_df, y_valid_for_wv), to whole training data and to whole test data. **Notice, that our method takes only numpy arrays as inputs.**

In [51]:
%%time
# should be 4 calls
arrays_to_vw(train_df_part.values, y_train_part_for_vw, out_file='train_part.vw')
arrays_to_vw(valid_df.values, y_valid_for_vw, out_file='valid.vw')
arrays_to_vw(train_df_400[sites].values, y_for_vw, out_file='train.vw')
arrays_to_vw(test_df_400[sites].values, None, out_file='test.vw')

Wall time: 6.38 s


In [52]:
# Won't work on Windows
!head -3 $PATH_TO_DATA/train_part.vw

262 | 23713 23720 23713 23713 23720 23713 23713 23713 23713 23713
82 | 8726 8725 665 8727 45 8725 45 5320 5320 5320
16 | 303 19 303 303 303 303 303 309 303 303


In [53]:
# Won't work on Windows
!head -3  $PATH_TO_DATA/valid.vw

4 | 7 923 923 923 11 924 7 924 838 7
160 | 91 198 11 11 302 91 668 311 310 91
312 | 27085 848 118 118 118 118 11 118 118 118


In [54]:
# Won't work on Windows
!head -3 $PATH_TO_DATA/test.vw

1 | 9 304 308 307 91 308 312 300 305 309
1 | 838 504 68 11 838 11 838 886 27 305
1 | 190 192 8 189 191 189 190 2375 192 8


Train Vowpal Wabbit on **train_part.wv**. Specify classification task with 400 classes **(--oaa)**, make 3 passes over dataset **(--passes)**. You can also specify cache file (**--cache_file** or flag **-c**) so VW would perform all passes following first one faster (you can delete previous cache file with argument **-k**). Also specify parameter **b=26**. That is number of bits to use for hashing, in this case we need more than deafult 18 bits. Finally, specifiy **random_seed=17**. Do not change other parameters.

In [41]:
train_part_vw = os.path.join(PATH_TO_DATA, 'train_part.vw')
valid_vw = os.path.join(PATH_TO_DATA, 'valid.vw')
train_vw = os.path.join(PATH_TO_DATA, 'train.vw')
test_vw = os.path.join(PATH_TO_DATA, 'test.vw')
model = os.path.join(PATH_TO_DATA, 'vw_model.vw')
pred = os.path.join(PATH_TO_DATA, 'vw_pred.csv')

In [56]:
%%time
!"C:\Program Files\VowpalWabbit\vw" -d $train_part_vw --oaa 400 --passes 3 -b 26 --random_seed 17 -c -f $PATH_TO_DATA/train_part_mdl.vw

Wall time: 43.3 s


final_regressor = ../../data//train_part_mdl.vw
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = ../../data/train_part.vw.cache
Reading datafile = ../../data/train_part.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0      262        1       11
1.000000 1.000000            2            2.0       82      262       11
1.000000 1.000000            4            4.0      241      262       11
1.000000 1.000000            8            8.0      352      262       11
1.000000 1.000000           16           16.0      135       16       11
1.000000 1.000000           32           32.0       71      112       11
0.968750 0.937500           64           64.0      358      231       11
0.976563 0.984375          128          128.0      348      346       11
0.941406 0.90625

Write predictions for **valid.vw** to **vw_valid_pred.csv**

In [59]:
%%time
!"C:\Program Files\VowpalWabbit\vw" -i $PATH_TO_DATA/train_part_mdl.vw -t -d $valid_vw -p $PATH_TO_DATA/vw_valid_pred.csv

Wall time: 1.15 s


only testing
predictions = ../../data//vw_valid_pred.csv
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = ../../data/valid.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0        4      188       11
1.000000 1.000000            2            2.0      160      220       11
0.750000 0.500000            4            4.0      143      143       11
0.750000 0.750000            8            8.0      247      247       11
0.687500 0.625000           16           16.0      341       30       11
0.593750 0.500000           32           32.0      237      237       11
0.609375 0.625000           64           64.0      178      178       11
0.640625 0.671875          128          128.0      132      228       11
0.656250 0.671875          256          256.0       14       14       11
0

Read predictions *kaggle_data/vw_valid_pred.csv* from file and see fraction of correct answers on holdout set. 

In [60]:
with open(os.path.join(PATH_TO_DATA, 'vw_valid_pred.csv')) as pred_file:
    test_prediction_mult = [float(label) for label in pred_file.readlines()]
    
accuracy_score(y_valid_for_vw, test_prediction_mult)

0.34541741128414605

Now train *SGDClassifier* (3 passes, logistic loss function) and *LogisticRegression* on 70% of sparse train dataset (X_train_part_sparse, y_train_part), make prediction for holdout set (X_valid_sparse, y_valid) and calculate accuracy. Logistic regression will take some time to fit (for me it took around 8 minutes) - this is okay, set multinomial multi_class to make it train much faster. Specify *random_state=17*, *n_jobs=-1* everywhere. For *SGDClassifier* also specify *max_iter=3*.

In [31]:
logit = LogisticRegression(solver='lbfgs', random_state=17, n_jobs=-1, multi_class='multinomial')
sgd_logit = SGDClassifier(loss='log', max_iter=3, n_jobs=-1, random_state=17)

In [32]:
%%time
logit.fit(X_train_part_sparse, y_train_part)

Wall time: 8min 18s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=-1, penalty='l2', random_state=17, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [34]:
%%time
sgd_logit.fit(X_train_part_sparse, y_train_part)

Wall time: 6.62 s


SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=3,
       n_iter=None, n_iter_no_change=5, n_jobs=-1, penalty='l2',
       power_t=0.5, random_state=17, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

- **Calculate accuracy on the holdout set for Vowpal Wabbit, round to 3 decimal places**
- **Calculate accuracy on the holdout set for SGD, round to 3 decimal places**
- **Calculate accuracy on the holdout set for logistic regression, round to 3 decimal places**

In [35]:
vw_valid_acc = accuracy_score(y_valid_for_vw, test_prediction_mult)
sgd_valid_acc = accuracy_score(y_valid, sgd_logit.predict(X_valid_sparse))
logit_valid_acc = accuracy_score(y_valid, logit.predict(X_valid_sparse))
print(vw_valid_acc, sgd_valid_acc, logit_valid_acc)

0.3072139757102739 0.2910755315657026 0.352201028483898


## 2.3 Test set validation (public leaderboard)

Train a VW model with same parameters on the whole training data - **train.wv**

In [None]:
%%time
!vw '''YOUR CODE HERE'''

Make predictions for test data

In [None]:
%%time
!vw '''YOUR CODE HERE'''

Write predictions to file, perform reverse label transformation (we got our labels via adding +1 to output of *LabelEncoder* instance) and send submission to Kaggle. 

In [None]:
def write_to_submission_file(predicted_labels, out_file,
                             target='user_id', index_label="session_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [None]:
vw_pred = '''YOUR CODE HERE'''

In [None]:
write_to_submission_file(vw_pred, os.path.join(PATH_TO_DATA, 'vw_400_users.csv'))

Do the same for SGD and logistic regression. I know, it is pretty annoying to wait for logistic regression to fit on this data, but let's be patient. 

In [None]:
'''YOUR CODE HERE'''

In [None]:
write_to_submission_file(sgd_logit_test_pred, 
                         os.path.join(PATH_TO_DATA, 'logit_400_users.csv'))
write_to_submission_file(sgd_logit_test_pred, 
                         os.path.join(PATH_TO_DATA, 'sgd_400_users.csv'))

Let's look at Public Leaderboard scores in [this](https://www.kaggle.com/c/identify-me-if-you-can4) competition.

- **What is the Public Leaderboard score for Vowpal Wabbit?**
- **What is the Public Leaderboard score for SGD?**
- **What is the Public Leaderboard score for logistic regression?**

In conclusion:
- think how do Vowpal Wabbit, SGD and logistic regression compare in terms of training speed/classification quality
- 400 user classification task probably can't be solved good enough if we use "honest" time based split for testing. Next we will compete in identification of only one user (Alice) - [here](https://inclass.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2) is the competition you are advised to participate in.

Good luck!