<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

# <center> Assignment № 8
## <center> Vowpal Wabbit for Stackoverflow question tag classification

## Plan
    1. Introduction
    2. Data description
    3. Data preprocessing
    4. Training and validation of models
    5. Summary

### 1. Introduction

In this task, you will do something that we do every week at Mail.Ru Group: train models on several GBs of data. You might cope with Python in Windows, but we strongly recommend some \*NIX-system (for instance, with Docker) and use bash utils.
A sad, but true, fact is that, if you want to work in the best companies in the world in ML, you will need experience with UNIX bash. Here is an interactive [tutorial](https://www.codecademy.com/en/courses/learn-the-command-line/lessons/environment/exercises/bash-profile) from CodeAcademy on UNIX command line (1-2 hours).

Submit your answers through the [web-form](https://docs.google.com/forms/d/14adHGB-XKtpHlG9JJgog3DUzMUabd4y1YWG3b866m54/edit).

For this particular task, you will need Vowpal Wabbit installed (we already have it inside the docker-container of our course. Check out instructions in the README in our course [repo](https://github.com/Yorko/mlcourse_open)). Make sure you have approximately 70 GB of disk space. I have tested the solution on an ordinary Macbook Pro 2015 (8 kernels, 16GB RAM), and the heaviest model was trained in ~ 12 min, so this task is doable with ordinary hardware. Still, if you have plans to rent Amazon servers, right now is a good time to do it.

### 2. Data description

We have 10 GB of questions from StackOverflow – [download](https://drive.google.com/file/d/1ZU4J3KhJDrHVMj48fROFcTsTZKorPGlG/view) and unpack the archive. 

The data format is simple:<br>
<center>*question text* (space dilimited words) TAB *question tags* (space delimited)

TAB is the tabulation symbol.
Let's see the first sample from the training set:

In [1]:
!head -1 stackoverflow.10kk.tsv

 is there a way to apply a background color through css at the tr level i can apply it at the td level like this my td background color e8e8e8 background e8e8e8 however the background color doesn t seem to get applied when i attempt to apply the background color at the tr level like this my tr background color e8e8e8 background e8e8e8 is there a css trick to making this work or does css not natively support this for some reason 	css css3 css-selectors


Here, we have the question text, followed by a tab and the question tags: *css, css3* and *css-selectors*. There are 10 billion of such questions in our dataset.

In [2]:
%%time
!wc -l stackoverflow.10kk.tsv

 10000000 stackoverflow.10kk.tsv
CPU times: user 235 ms, sys: 103 ms, total: 338 ms
Wall time: 13.9 s


Note, that we do not want to overload memory with this amount of data, so we will use the following Unix utilities - `head`, `tail`, `wc`, `cat`, `cut`, etc.

### 3. Data preprocessing

Let's select all questions with the tags *javascript, java, python, ruby, php, c++, c#, go, scala*, and *swift* from the data source, and prepare the training set in Vowpal Wabbit's data format. We will perform 10-class question classification over the tags we've selected.

In general, questions may have several tags, but we will simplify our task by selecting only one of the listed tags or dropping questions in case of no such tags.
Note that VW supports multilabel classification (`--multilabel_oaa` parameter).
<br>
<br>
Implement your data preprocessing code in a separate file `preprocess.py`. Your code must select lines with our tags and write them to a separate file in Vowpal Wabbit format. Details are as follows:
 - script must work with command line arguments: file paths for input and output
 - lines are processed one-by-one (there is a wonderful `tqdm` module for iterations counting)
 - if a line has no tab symbols or more than one tab symbol - then the line is broken, skip it
 - if a line has exactly one tab symbol, check how many tags are from our list *javascript, java, python, ruby, php, c++, c#, go, scala* or  *swift*. If there is only one tag, write the string to output with VW format: `label | text`, where `label` is a number from 1 to 10 (1 - *javascript*, ... 10 – *swift*). Skip strings with more than 1 or no tags.
 - remove `:` and `|` symbols from the question text - they have special meaning for VW

In [3]:
import os
from tqdm import tqdm
from time import time
import numpy as np
from sklearn.metrics import accuracy_score

You should have 4389054 lines in the preprocessed data file. We can see that VW can process 10 GB of data in roughly 1-2 minutes.

In [4]:
!python preprocess.py stackoverflow.10kk.tsv stackoverflow.vw

10000000it [01:20, 123690.31it/s]
4389054 lines selected, 15 lines corrupted.


Split the dataset into training, validation, and test sets in equal proportions with 1463018 lines in each file. We don't need to shuffle the data, the first 1463018 lines must go into training `stackoverflow_train.vw`, the last 1463018 lines to test `stackoverflow_test.vw`, and the rest to validation `stackoverflow_valid.vw`. 

Save answer vectors for validation and test sets into separate files: `stackoverflow_valid_labels.txt` and `stackoverflow_test_labels.txt`, respectively.

Do not hesitate to use `head`, `tail`, `split`, `cat` and `cut` linux utils.

In [1]:
# Your code here

### 4. Training and validation of models

Train Vowpal Wabbit with `stackoverflow_train.vw` 9 times with (1,3,5) iterating passes and n-gram (n=1,2,3) parameters.
The rest of the parameters are `bit_precision=28` and `seed=17`. Don't forget to tell VW that we have a 10-class problem.

Evaluate accuracy on `stackoverflow_valid.vw`. Choose the model with the best parameters, and test it on `stackoverflow_test.vw` set.

In [None]:
# Your code here

**Question 1.** Which parameter set provides the best accuracy on the validation set `stackoverflow_valid.vw`?
- bigrams and 3 passes
- trigrams and 5 passes
- bigrams and 1 pass
- unigrams and 1 pass

Check the best (according to validation accuracy) model on the test set. 

In [2]:
# Your code here

**Question 2.** Compare best validation and test accuracies. Choose the correct answer (% is a percent here i.e. a drop from 50% to 40% would be 10%, not 20%).
- Test accuracy is lower by approx. 2%
- Test accuracy is lower by approx. 3%
- The difference is less than 0.5%

Train VW with parameters selected on the validation set, but first merge the training and validation sets. Evaluate the share of correct answers on the test set. 

In [3]:
# Your code here

**Question 3.** How large is the gain after training with 2x the data (training `stackoverflow_train.vw` + validation `stackoverflow_valid.vw`) versus the model trained solely on `stackoverflow_train.vw`?
 - 0.1%
 - 0.4%
 - 0.8%
 - 1.2%

### 5. Conclusion

We have only just scratched the surface with Vowpal Wabbit in this assignment. Here are some hints on what to do next:
 - multilabel classification (`multilabel_oaa` argument) – data format perfectly matches with this type of problem
 - Tuning VW parameters with hyperopt. VW developers say that the accuracy strongly depends on gradient descent (`initial_t` and `power_t`) parameters. Also, we can test different loss functions i.e. train logistic regression and linear SVM
 - Learn about factorization machines and its implementation in VW (the `lrq` argument)