# Часть 2. Применение Vowpal Wabbit к данным по посещению сайтов

### 2.1. Подготовка данных

**Далее посмотрим на Vowpal Wabbit в деле. Правда, в задаче нашего соревнования при бинарной классификации веб-сессий мы разницы не заметим – как по качеству, так и по скорости работы (хотя можете проверить), продемонстрируем всю резвость VW в задаче классификации на 400 классов. Исходные данные все те же самые, но выделено 400 пользователей, и решается задача их идентификации. Скачайте данные отсюда – файлы train_sessions_400users.csv и test_sessions_400users.csv.**

In [53]:
import os
import pandas as pd
import numpy as np
import scipy.sparse as sps
from time import time
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import preprocessing

In [54]:
# Поменяйте на свой путь к данным
PATH_TO_DATA = 'capstone_user_identification'

**Загрузим обучающую и тестовую выборки. Можете заметить, что тестовые сессии здесь по времени четко отделены от сессий в обучающей выборке.**

In [55]:
train_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'train_sessions_400users.csv'), 
                           index_col='session_id')

In [56]:
test_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'test_sessions_400users.csv'), 
                           index_col='session_id')

In [57]:
train_df_400.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,user_id
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,23713,2014-03-24 15:22:40,23720.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:54,23720.0,2014-03-24 15:22:54,...,2014-03-24 15:22:55,23713.0,2014-03-24 15:23:01,23713.0,2014-03-24 15:23:03,23713.0,2014-03-24 15:23:04,23713.0,2014-03-24 15:23:05,653
2,8726,2014-04-17 14:25:58,8725.0,2014-04-17 14:25:59,665.0,2014-04-17 14:25:59,8727.0,2014-04-17 14:25:59,45.0,2014-04-17 14:25:59,...,2014-04-17 14:26:01,45.0,2014-04-17 14:26:01,5320.0,2014-04-17 14:26:18,5320.0,2014-04-17 14:26:47,5320.0,2014-04-17 14:26:48,198
3,303,2014-03-21 10:12:24,19.0,2014-03-21 10:12:36,303.0,2014-03-21 10:12:54,303.0,2014-03-21 10:13:01,303.0,2014-03-21 10:13:24,...,2014-03-21 10:13:36,303.0,2014-03-21 10:13:54,309.0,2014-03-21 10:14:01,303.0,2014-03-21 10:14:06,303.0,2014-03-21 10:14:24,34
4,1359,2013-12-13 09:52:28,925.0,2013-12-13 09:54:34,1240.0,2013-12-13 09:54:34,1360.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:54:34,...,2013-12-13 09:54:34,1346.0,2013-12-13 09:54:34,1345.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:58:19,1345.0,2013-12-13 09:58:19,601
5,11,2013-11-26 12:35:29,85.0,2013-11-26 12:35:31,52.0,2013-11-26 12:35:31,85.0,2013-11-26 12:35:32,11.0,2013-11-26 12:35:32,...,2013-11-26 12:35:32,11.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:03,10.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:04,273


**Видим, что в обучающей выборке 182793 сессий, в тестовой – 46473, и сессии действительно принадлежат 400 различным пользователям.**

In [58]:
train_df_400.shape, test_df_400.shape, train_df_400['user_id'].nunique()

((182793, 21), (46473, 20), 400)

**Vowpal Wabbit любит, чтоб метки классов были распределены от 1 до K, где K – число классов в задаче классификации (в нашем случае – 400). Поэтому придется применить LabelEncoder, да еще и +1 потом добавить (LabelEncoder переводит метки в диапозон от 0 до K-1). Потом надо будет применить обратное преобразование.**

In [59]:
y = train_df_400.user_id
class_encoder = preprocessing.LabelEncoder()
y_for_vw = class_encoder.fit_transform(y)+1

**Далее будем сравнивать VW с SGDClassifier и с логистической регрессией. Всем моделям этим нужна предобработка входных данных. Подготовьте для sklearn-моделей разреженные матрицы, как мы это делали в 5 части:**

* объедините обучающиую и тестовую выборки
* выберите только сайты (признаки от 'site1' до 'site10')
* замените пропуски на нули (сайты у нас нумеровались с 0)
* переведите в разреженный формат csr_matrix
* разбейте обратно на обучающую и тестовую части

In [60]:
sites = ['site' + str(i) for i in range(1, 11)]

In [61]:
train_test_df = pd.concat([train_df_400, test_df_400])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [62]:
train_test_df_sites = train_test_df[sites]

In [63]:
train_test_df_sites.isnull().sum().sum()

81858

In [64]:
train_test_df_sites = train_test_df_sites.fillna(0)

In [65]:
train_test_df_sites.isnull().sum().sum()

0

In [66]:
train_test_df_sites = train_test_df_sites.astype(int)

In [67]:
train_test_df_sites.shape

(229266, 10)

In [68]:
tmp = test_df_400[sites].fillna(0).astype(int)

In [69]:
# 25082019

def create_sparse_matrix(dataframe):
    tmp_arr = np.array(dataframe)
    row = 0
    rows = []
    cols = []
    data = []

    for arr in tmp_arr:
        #print(arr)
        for i, val in enumerate(arr):
            if val != 0:                
                data.append(val)
                cols.append(i)
                rows.append(row)
        row = row + 1
        
    return(sps.coo_matrix((data, (rows, cols))))

In [70]:
t_start = time()

idx_split = train_df_400.shape[0]

#tmp_sparse = csr_matrix(create_sparse_matrix(tmp))

train_test_sparse = csr_matrix(create_sparse_matrix(train_test_df_sites))

X_train_sparse = train_test_sparse[:idx_split, :]
X_test_sparse = train_test_sparse[idx_split:,:]

#y = train_df.target

print('Train DF size: {0}\nTest DF size: {1}\nTarget size: {2}'.format(
    str(X_train_sparse.shape), str(X_test_sparse.shape), str(y.shape)))

print("Time elapse: ", time() - t_start)

Train DF size: (182793, 10)
Test DF size: (46473, 10)
Target size: (182793,)
Time elapse:  1.8941981792449951


### 2.2. Валидация по отложенной выборке

**Выделим обучающую (70%) и отложенную (30%) части исходной обучающей выборки. Данные не перемешиваем, учитываем, что сессии отсортированы по времени.**

In [71]:
train_share = int(.7 * train_df_400.shape[0])
train_df_part = train_df_400[sites].iloc[:train_share, :]
valid_df = train_df_400[sites].iloc[train_share:, :]
X_train_part_sparse = X_train_sparse[:train_share, :]
X_valid_sparse = X_train_sparse[train_share:, :]

In [72]:
y_train_part = y[:train_share]
y_valid = y[train_share:]
y_train_part_for_vw = y_for_vw[:train_share]
y_valid_for_vw = y_for_vw[train_share:]

In [101]:
tmp = y_valid.values
y_valid_for_vw.shape

(54838,)

Реализуйте функцию, arrays_to_vw, переводящую обучающую выборку в формат Vowpal Wabbit.

Вход:

* X – матрица NumPy (обучающая выборка)
* y (необяз.) - вектор ответов (NumPy). Необязателен, поскольку тестовую матрицу будем обрабатывать этой же функцией
* train – флаг, True в случае обучающей выборки, False – в случае тестовой выборки
* out_file – путь к файлу .vw, в который будет произведена запись

Детали:

* надо пройтись по всем строкам матрицы X и записать через пробел все значения, предварительно добавив вперед нужную метку класса из вектора y и знак-разделитель |
* в тестовой выборке на месте меток целевого класса можно писать произвольные, допустим, 1

In [73]:
def arrays_to_vw(X, y=None, train=True, out_file='tmp.vw'):
    X = np.nan_to_num(X)
    X = X.astype(int)
    
    with open(out_file, 'w') as f:
        print(X.shape)
        for i in range(X.shape[0]):
            string =  ' '.join([str(x) for x in X[i]])
            if y is None:
                #print("NONE")
                f.write(str(1) + " | " + string + "\n")
            else:
                f.write(str(y[i]) + " | " + string + "\n")

**Примените написанную функцию к части обучащей выборки (train_df_part, y_train_part_for_vw), к отложенной выборке (valid_df, y_valid_for_vw), ко всей обучающей выборке и ко всей тестовой выборке. Обратите внимание, что на вход наш метод принимает именно матрицы и вектора NumPy.**

In [74]:
t_start = time()
#tmp = create_sparse_matrix(train_df_part)


arrays_to_vw(train_df_part.values, y_train_part_for_vw, True, 'kaggle_data/train_part.vw')
arrays_to_vw(valid_df.values, y_valid_for_vw, False, 'kaggle_data/valid.vw')
arrays_to_vw(train_df_400[sites].values, y_for_vw, True, 'kaggle_data/train.vw')
arrays_to_vw(test_df_400[sites].values, None, False, 'kaggle_data/test.vw')

print("Time elapse: ", time() - t_start)

(127955, 10)
(54838, 10)
(182793, 10)
(46473, 10)
Time elapse:  3.5745303630828857


**Обучите модель Vowpal Wabbitна выборке train_part.vw. Укажите, что решается задача классификации с 400 классами (--oaa), сделайте 3 прохода по выборке (--passes). Задайте некоторый кэш-файл (--cache_file, можно просто указать флаг -c), так VW будет быстрее делать все следующие после первого проходы по выборке (прошлый кэш-файл удаляется с помощью аргумента -k). Также укажите значение параметра b=26. Это число бит, используемых для хэширования, в данном случае нужно больше, чем 18 по умолчанию. Наконец, укажите random_seed=17. Остальные параметры пока не меняйте, далее уже в свободном режиме соревнования можете попробовать другие функции потерь.**

In [75]:
train_part_vw = os.path.join(PATH_TO_DATA, 'train_part.vw')
valid_vw = os.path.join(PATH_TO_DATA, 'valid.vw')
train_vw = os.path.join(PATH_TO_DATA, 'train.vw')
test_vw = os.path.join(PATH_TO_DATA, 'test.vw')
model = os.path.join(PATH_TO_DATA, 'vw_model.vw')
pred = os.path.join(PATH_TO_DATA, 'vw_pred.csv')

In [76]:
!head -3 kaggle_data/test.vw

1 | 9 304 308 307 91 308 312 300 305 309
1 | 838 504 68 11 838 11 838 886 27 305
1 | 190 192 8 189 191 189 190 2375 192 8


In [77]:
!head -3 kaggle_data/train_part.vw

262 | 23713 23720 23713 23713 23720 23713 23713 23713 23713 23713
82 | 8726 8725 665 8727 45 8725 45 5320 5320 5320
16 | 303 19 303 303 303 303 303 309 303 303


In [78]:
!head -3 kaggle_data/valid.vw

4 | 7 923 923 923 11 924 7 924 838 7
160 | 91 198 11 11 302 91 668 311 310 91
312 | 27085 848 118 118 118 118 11 118 118 118


In [79]:
#!vw --oaa 400 kaggle_data/train_part.vw -f kaggle_data_result/oaa.model 

In [80]:
!vw --oaa 400 --passes 3 -c -k kaggle_data/train_part.vw -b 26 -f  kaggle_data/model.vw --random_seed 17

final_regressor = kaggle_data/model.vw
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = kaggle_data/train_part.vw.cache
Reading datafile = kaggle_data/train_part.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0      262        1       11
1.000000 1.000000            2            2.0       82      262       11
1.000000 1.000000            4            4.0      241      262       11
1.000000 1.000000            8            8.0      352      262       11
1.000000 1.000000           16           16.0      135       16       11
1.000000 1.000000           32           32.0       71      112       11
0.968750 0.937500           64           64.0      358      231       11
0.976562 0.984375          128          128.0      348      346       11
0.941406 0.906250      

**Запишите прогнозы на выборке valid.vw в vw_valid_pred.csv.**

In [29]:
!vw -i kaggle_data_result/model.vw -t -d kaggle_data/valid.vw -p kaggle_data_result/vw_valid_pred.csv --quiet

In [81]:
!vw -t -i kaggle_data/model.vw kaggle_data/valid.vw -p kaggle_data/vw_valid_pred.csv

only testing
predictions = kaggle_data/vw_valid_pred.csv
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = kaggle_data/valid.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0        4      188       11
1.000000 1.000000            2            2.0      160      220       11
0.750000 0.500000            4            4.0      143      143       11
0.750000 0.750000            8            8.0      247      247       11
0.687500 0.625000           16           16.0      341       30       11
0.593750 0.500000           32           32.0      237      237       11
0.609375 0.625000           64           64.0      178      178       11
0.640625 0.671875          128          128.0      132      228       11
0.656250 0.671875          256          256.0       14       14       11


**Считайте прогнозы kaggle_data/vw_valid_pred.csv из файла и посмотрите на долю правильных ответов на отложенной части.**

In [82]:
y_pred = pd.read_csv('kaggle_data_result/vw_valid_pred.csv', header=None)

In [83]:
y_valid

session_id
127956      7
127957    402
127958    798
127959    360
127960    815
127961    370
127962    328
127963    616
127964     59
127965    558
127966    534
127967    100
127968    120
127969    812
127970    778
127971    877
127972    364
127973    832
127974    613
127975    495
127976    613
127977    475
127978    917
127979    179
127980    107
127981    586
127982     30
127983    832
127984    432
127985    753
         ... 
182764    120
182765    361
182766    318
182767    310
182768    205
182769    789
182770    200
182771    803
182772    939
182773    292
182774      7
182775    935
182776    829
182777    208
182778    725
182779    954
182780    829
182781    575
182782    778
182783    572
182784     48
182785    393
182786    371
182787    862
182788    512
182789    183
182790    448
182791    632
182792    232
182793    635
Name: user_id, Length: 54838, dtype: int64

In [98]:
y_pred.values

array([[188],
       [220],
       [364],
       ...,
       [118],
       [318],
       [125]])

In [99]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix

In [104]:
y_valid_for_vw

array([  4, 160, 312, ..., 254,  91, 256])

In [105]:
accuracy_score(y_pred.values, y_valid_for_vw)

0.34541741128414605

**ну не получается по схеме. попробую через другую разряженную матрицу**

In [33]:
def create_sparse_matrix(dataframe):
    tmp_arr = np.array(dataframe)
    row = 0
    rows = []
    cols = []
    data = []

    for arr in tmp_arr:
        unique, counts = np.unique(arr, return_counts=True)
        #print(dict(zip(unique, counts)))
        for key, value in dict(zip(unique, counts)).items():
            if key != 0:
                rows.append(row)
                cols.append(key-1)
                data.append(value)
        row = row + 1
        
    return(sps.coo_matrix((data, (rows, cols))))

In [34]:
t_start = time()

idx_split = train_df_400.shape[0]

#tmp_sparse = csr_matrix(create_sparse_matrix(tmp))

train_test_sparse = csr_matrix(create_sparse_matrix(train_test_df_sites))

X_train_sparse = train_test_sparse[:idx_split, :]
X_test_sparse = train_test_sparse[idx_split:,:]

#y = train_df.target

print('Train DF size: {0}\nTest DF size: {1}\nTarget size: {2}'.format(
    str(X_train_sparse.shape), str(X_test_sparse.shape), str(y.shape)))

print("Time elapse: ", time() - t_start)

Train DF size: (182793, 36656)
Test DF size: (46473, 36656)
Target size: (182793,)
Time elapse:  5.441497325897217


In [35]:
train_df_part = X_train_sparse[:train_share, :]
valid_df = X_train_sparse[train_share:, :]

In [36]:
tmp = [((i, j), train_df_part[i,j]) for i, j in zip(*train_df_part.nonzero())]

In [37]:
X_train_sparse.shape

(182793, 36656)

In [38]:
def arrays_to_vw_2(X, y=None, train=True, out_file='tmp.vw'):
    with open(out_file, 'w') as fid:
        for ind in range(X.shape[0]):
            string = ""
            row = X.getrow(ind)
            data = row.data
            indices = row.indices
            # extracting features
            val_dict = {}
            for index, value in zip(indices, data):
                if index not in val_dict:
                    val_dict[index] = value
                else:
                    val_dict[index] += value

            for site, count in val_dict.items():
                string = string + " " + (str(site) + ":" + str(count))

            if y is None:
                #print("NONE")
                fid.write(str(1) + " | " + string + "\n")
            else:
                fid.write(str(y[ind]+1) + " | " + string + "\n")   


        

In [39]:
arrays_to_vw_2(train_df_part, y_train_part_for_vw, True, 'kaggle_data/train_part.vw')
arrays_to_vw_2(valid_df, y_valid_for_vw, False, 'kaggle_data/valid.vw')
#arrays_to_vw(train_df_400[sites].values, y_for_vw, True, 'kaggle_data/train.vw')
#arrays_to_vw(test_df_400[sites].values, None, False, 'kaggle_data/test.vw')

In [40]:
!head -3 kaggle_data/train_part.vw

263 |  23712:8 23719:2
83 |  44:2 664:1 5319:3 8724:2 8725:1 8726:1
17 |  18:1 302:8 308:1


In [41]:
!head -3 kaggle_data/valid.vw

5 |  6:3 10:1 837:1 922:3 923:2
161 |  10:2 90:3 197:1 301:1 309:1 310:1 667:1
313 |  10:1 117:7 847:1 27084:1


In [44]:
!vw -d kaggle_data/train_part.vw --passes 3 -c -f kaggle_data/model.vw

final_regressor = kaggle_data/model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = kaggle_data/train_part.vw.cache
ignoring text input in favor of cache input
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features

finished run
number of examples per pass = 0
passes used = 3
weighted example sum = 0.000000
weighted label sum = 0.000000
average loss = undefined (no holdout)
total feature number = 0


In [42]:
!vw --oaa 400 kaggle_data/train_part.vw -f kaggle_data/oaa.model --loss_function=hinge

final_regressor = kaggle_data/oaa.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = kaggle_data/train_part.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0      263        1        3
1.000000 1.000000            2            2.0       83      263        7
1.000000 1.000000            4            4.0      242      263        8
1.000000 1.000000            8            8.0      353      263        3
1.000000 1.000000           16           16.0      157      263        3
1.000000 1.000000           32           32.0      244      167        9
0.984375 0.968750           64           64.0       80       93        7
0.984375 0.984375          128          128.0      299      310        3
label 401 is not in {1,400} This won't work right.
0.964844 0.945312          256    

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

In [43]:
!vw --oaa 400 --passes 3 -c -k kaggle_data/train_part.vw -b 26 -f  kaggle_data_result/model.vw --random_seed 17

final_regressor = kaggle_data_result/model.vw
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = kaggle_data/train_part.vw.cache
Reading datafile = kaggle_data/train_part.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0      263        1        3
1.000000 1.000000            2            2.0       83      263        7
1.000000 1.000000            4            4.0      242      263        8
1.000000 1.000000            8            8.0      353      263        3
1.000000 1.000000           16           16.0      136       17        4
1.000000 1.000000           32           32.0       72      263        6
0.968750 0.937500           64           64.0      359      100        5
label 401 is not in {1,400} This won't work right.
0.976562 0.984375          128        

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} This won't work right.
label 401 is not in {1,400} Thi

In [45]:
!vw -d kaggle_data/valid.vw -i kaggle_data/model.vw -t -p kaggle_data/vw_valid_pred.csv

only testing
predictions = kaggle_data/vw_valid_pred.csv
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = kaggle_data/valid.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
25.000000 25.000000            1            1.0   5.0000   0.0000        6
12973.000000 25921.000000            2            2.0 161.0000   0.0000        8
36162.750000 59352.500000            4            4.0 144.0000   0.0000        5
43522.500000 50882.250000            8            8.0 248.0000   0.0000        9
47310.750000 51099.000000           16           16.0 342.0000   0.0000        3
48562.843750 49814.937500           32           32.0 238.0000   0.0000        3
49527.062500 50491.281250           64           64.0 179.0000   0.0000        9
51180.046875 52833.031250          128          128.0 133.0000   0.0000        4
51485.492188 51

In [46]:
y_pred = pd.read_csv('kaggle_data_result/vw_valid_pred.csv', header=None)

In [47]:
accuracy_score(y_pred, y_valid)

0.00176884642036544

In [48]:
!vw --oaa 400 --passes 3 -c -k kaggle_data/train_part.vw -b 26 -f  kaggle_data_result/model.vw --random_seed 17

   8.1.1 c       �?                ��84 84    R     �?                ������̞������       �?                ���    0, �     �?                �T���|xdj     �?                X����������     �?                ��N��l�N�N�N�4��    �?                ��       `    �?                �� ��     �     �?                ����U�V��R�U`
     �?                �a         �     �?                �   ����������c    �?                ��C�!�c�A�CL���C5    �?                ��G�A�
��N�O��L     �?                X�*���)�*��1(�     �?         


In [49]:
!vw --oaa 400 --passes 3 --cache_file train.cache -b 26 --random_seed 17 -k -d $1 -f kaggle_data_result/vw.model

final_regressor = kaggle_data_result/vw.model
Num weight bits = 26
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = train.cache
Reading datafile = 1
can't open '1', sailing on!
num sources = 0
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features

finished run
number of examples per pass = 0
passes used = 3
weighted example sum = 0.000000
weighted label sum = 0.000000
average loss = undefined (no holdout)
total feature number = 0
