## What is Vowpal Wabbit

Vowpal Wabbit (VW) is a general purpose machine learning library which is implementing, among other things, logistic regression with ideas like the hashing trick and per-coordinate adaptive learning rates  (in fact, the hashing trick was made popular by that library). A big advantage  of Vowpal Wabbit is that it is blazing fast. Not only because its underlying implementation is in C++, but also because it is using the L-BFGS optimization method. L-BGFS  stand for  “Limited-memory Broyden–Fletcher–Goldfarb–Shanno” and basically approximates the Broyden–Fletcher–Goldfarb–Shanno ([BFGS](https://en.wikipedia.org/wiki/Broyden–Fletcher–Goldfarb–Shanno_algorithm)) method using a limited amount of memory.  This method is much more complex to implement than Stochastic Gradient descent (which can be implemented in few lines of code as we saw in our previous post), but is supposedly converging faster (in less iterations). If you want to read more about L-BFGS and/or understand its difference with other optimisation methods, you can check [this](https://github.com/JohnLangford/vowpal_wabbit/wiki/L-BFGS.pdf)  (doc from Vowpal Wabbit) or [this](http://aria42.com/blog/2014/12/understanding-lbfgs) (nice blog post). Note that L-BFGS was empirically observed to be superior to SGD in many cases, in particular in deep learning settings (check out that [paper](http://ai.stanford.edu/~quocle/LeNgiCoaLahProNg11.pdf) on that topic).

### Download task data

In [1]:
task_data_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip'

In [2]:
import os
os.system("wget {} -O /data/vw_tutorial.zip".format(task_data_link))

0

In [3]:
os.system("unzip /data/vw_tutorial.zip -d /data/")

256

In [4]:
! du -sh /data/bank*

4.4M	/data/bank-full.csv
4.0K	/data/bank-names.txt
452K	/data/bank.csv
0	/data/bank_train.vw


## Input format, Namespaces and more

### Dataset description https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

Dataset represents the attempt of a bank trying to predict if a marketing phone call will end up in a bank term deposit by the customer, based on a bunch of signals like socio-economic factors of the customer like “does he have a loan?”, etc..



The traditional way to represent such datasets is to have a tsv or csv file, with the header being the name of the signals and each line representing the value of the training example on each signal. Each line of the training set has thus a fixed size, and missing values are just a blank cell or some specific value to indicate that it’s missing. Typically, for that dataset, the header looks like that:

#### age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y

With y being the actual supervision (i.e. did the call ended up in bank term deposit). And a typical training example looks like that:

#### 58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no

In Vowpal Wabbit, there is no header, and each signal name is embedded in the training example itself. For example, the training example above can look like that in Vowpal Wabbit format:

#### -1 |i age:58 balance:2143 duration:261 campaign:1 pdays:-1 previous:0 |c job=management marital=married education=tertiary default=no housing=yes contact=unknown day=5 month=may poutcome=unknown


Let’s discuss multiple important things there:

* -1 says that this was a negative example.
* The |i and |c  are here to specify that the following features are part of a same feature namespace.  Being part of a namespace simply means that all the features in the namespace will be hashed together in a same feature space (this relates to the hashing trick, c.f. the previous post of that series).
* Here, i artificially created two namespaces: one for numerical features and another one for categorical ones. But that was just to illustrate the idea of namespace .
* In practice, namespaces can be used for different reasons (check the doc here) but one that is particularly useful  is that it allows you to do feature interactions:
* For instance, in the command line, using --quadratic ic would combine all the features of the namespaces i and c in our example above to create on the fly 2-way interacting features .  For instance the value of age and job together would be a new signal (maybe if you are a certain age in a certain profession, you’re more or less likely to do a bank term deposit).
* Note as well that for the numerical features, i used the colon ‘:‘ and for categorical ones i used ‘=‘ .
* Only the  ‘:‘ will be interpreted by Vowpal Wabbit. Both in training and when applying the model, the weight of the corresponding numerical feature (let’s say age) will be multiplied by the actual numerical value in the weighted linear product of the logistic hypothesis (more on that later).
* The  ‘=‘ is just cosmetic and for clarity. Technically, writing  married instead of marital=married makes absolutely no difference for the training, except if the value  married could show up in different contexts. E.g. if there were another signal childMarital indicating the marital status of customer’s children,  then you’d have to differentiate if the value married refers to the customer or his children, in which case the feature name would be necessary. Note that if you’d put two such features in different namespaces then they could not be mixed together and the prefix would be again not necessary.
* Note that for each signal, i’ve used the full name of the signal as a prefix (e.g. age or marital). First, we just saw that for categorical feature, this is not necessarily  required. For numerical signal though, it is (i.e. you cannot just throw a number without context). Now, for huge training sets, you don’t necessarily want to have a long string repeated millions (or more) of times. A good compromise is to have a mapping between signal names and very short string (like e.g. F1, F2, F3 ….). In the following section, i provide some code that allows to generate such training set with signal names mapping.
* There is a nice answer on Quora here exposing a short cheat-sheet  to remind those and how to encode boolean, categorical, ordinal+monotonic or numerical variables in VW.
* Last but not least, one thing i love about this format, is that it is very adapted to sparse data. Think that you have thousands of features or maybe just a list of words, then you don’t care about the order of the features or the missing values, you just  throw the features with the right prefix and/or in the right namespace and you’re done. VW will then hash them in their proper bucket in their proper hashing namespace.


### Transform your CSV datasets into VW format

In [5]:
import pandas as pd

In [6]:
data = pd.read_csv('/data/bank-full.csv', sep=';')
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [53]:
with open('/data/bank-names.txt', 'r') as text:
    print text.read()

Citation Request:
  This dataset is public available for research. The details are described in [Moro et al., 2011]. 
  Please include this citation if you plan to use this database:

  [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

  Available at: [pdf] http://hdl.handle.net/1822/14838
                [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt

1. Title: Bank Marketing

2. Sources
   Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012
   
3. Past Usage:

  The full dataset was described and analyzed in:

  S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedin

In [7]:
header = list(data)
#write numerical and categorical features names
num_features = #YOUR CODE HERE
cat_features = set(header) ^ set(num_features) ^ set('y')


In [8]:
from sklearn.model_selection import train_test_split

In [29]:
data_train, data_test= train_test_split(data, test_size=0.3, random_state=42)

In [30]:
#Replace 'no' with -1 and 'yes with 1

#YOUR CODE HERE

In [31]:
def create_vw_file(filename, dataset):
    # write function which would write dataset to file in vowpal wabbit format
    
    #YOUR CODE HERE


In [33]:
create_vw_file('bank_train.vw', data_train)
create_vw_file('bank_test.vw', data_test)

## Train VW

Let’s start by a first command to train a regression model:

In [41]:
!vw bank_train.vw -f model.vw --loss_function squared

final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = bank_train.vw
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.416214   0.416214            3         3.0  -1.0000  -0.7298       16
0.235059   0.053904            6         6.0  -1.0000  -1.0000       17
0.491115   0.798382           11        11.0   1.0000  -0.9921       18
0.643804   0.796493           22        22.0  -1.0000  -0.2591       18
0.502135   0.360465           44        44.0  -1.0000  -1.0000       17
0.377984   0.250946           87        87.0  -1.0000  -1.0000       17
0.377113   0.376242          174       174.0  -1.0000  -1.0000       17
0.388875   0.400637          348       348.0   1.0000   0.0729       18
0.391542   0.394209          696       696.0  -1.0000  -1.0000       17
0.371666   0.351790         1392      1392.0  -1

It is pretty much self explained:
* -f is to specify the filename of the output mode and
* --loss_function specifies which loss function to use, squared in our case

Then, to actually use the model on a separate test set you simply do:


In [42]:
!vw bank_test.vw -t -i model.vw -p preds.txt

only testing
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = preds.txt
using no cache
Reading datafile = bank_test.vw
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.069848   0.069848            3         3.0  -1.0000  -0.5437       17
0.089859   0.109869            6         6.0  -1.0000  -0.8352       17
0.320497   0.597264           11        11.0   1.0000  -0.5074       17
0.401549   0.482601           22        22.0   1.0000   0.1998       17
0.295090   0.188632           44        44.0  -1.0000  -1.0000       17
0.372497   0.451703           87        87.0   1.0000  -0.5474       17
0.415658   0.458819          174       174.0  -1.0000  -1.0000       17
0.349235   0.282813          348       348.0  -1.0000  -1.0000       18
0.326170   0.303104          696       696.0   1.0000  -0.9686       17
0.296370   0.266571         1392      13

Some options i found useful and interesting for the training part:

* -c --passes N .  This specifies to do N passes on the training set while learning the optimal weights. In deep learning, the term epoch is often used instead of pass, and basically represents a full pass over the whole training set to update the weights. Doing several passes often leads to stronger models but the ideal number of passes can be tuned as an hyper parameter.  Note that the  -c option specifying to use caching is necessary when doing multiple passes because from the second pass, VW is using pre-compiled information that it prepared/cached during the first pass.
* -b N  . The -b option allows you to control the number of bits in the hashing namespace (c.f part 2 of this series to understand what is the hashing trick ) and set it to 2N . The default value for N is 18, which might be more than ok (e.g. for the toy bank dataset) or not enough depending on the cardinality of your features values. If you need to encode  features having an high cardinality, i.e. a lot of different values like e.g. a product id in a catalog of millions of product, or, more frequently, if you need to create interactions of features (i.e. the cartesian product of two features values) which is also often leading to an high cardinality features, then you’ll probably need to increase N. Obviously the higher it is, the less collisions you’ll have in your namespace, but the more memory you’ll need.
* --interactions arg . This is a very powerful one. Basically  arg is a list of letters, and each letter represents a namespace (assuming you organised your features around namespaces, like e.g. in our example in previous section). Applying that option means that it will automatically create interactions between all features in the corresponding namespaces. For instance, in our example above, adding e.g.  --interactions ic   will instantly create a whole bunch of new features in the model: all the interactions pairs between features in the namespace i and in the namespace c . Note that in this case the option is equivalent to --quadratic ic but the --interactions option is more general as it allows to create not only quadratic interactions but even more (triplets, quadruplets etc…). Such a feature somehow allows you to get closer to factorization machine models.

In [44]:
preds = pd.read_csv('preds.txt', header=None)
test_split = pd.read_csv('bank_test.vw', header=None, sep = '|')
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(test_split[0].values, preds[0].values)
auc = metrics.auc(fpr, tpr)
print(auc)

0.902321099812


Look at other [arguments](https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments) and try to achive better results