# Limitations of word2vec
1. No sentence representations -- taking the average pre-trained word vector is popular. But, it does not work very well.
2. Not exploiting morphology -- words with same radicals do not share parameters -- e.g., disastrous and disaster will be 2 different words, mangera and mangerai are different.

# Goal of fastText library
* Unified framework for-
  * Text representation
    * Word representation (with character-level features with character n-grams)
  * Text classification
* Core of the library: given a set of indiceas -> predict an index
* cbow, skip-gram and bow text classification are instances of this model.
  * CBOW: given many words, predict the word
  * skip-gram: given word, predict the word
  * text-classification: given many words, predict the label

# The CBOW and the skipgram models
![cbow+skipgram](figs/cbow_and_skipgram.png)

# 

### CBOW
- You are given a context (i.e., a group of surrounding words around a word -- `n` words preceding and `n` words succeeding the word), predict the word in the middle.
  ![cbow](figs/cbow.png)
- Model the probability of a `word` given a context:
  $$
  p(w|C) = \dfrac{e^{h^T_Cv_w}}{\sum_{k=1}^K e^{h^T_Cv_k}}
  $$
- where, feature for context $C: h_C$
- classifier for word, $w: v_w$
- Continuous `Bag of Words`, $h_C = \sum_{c\in C}x_C$



### Skip-gram
* You are given the word in the middle, predict the surrounding words.
  ![skipgram](figs/skipgram.png)
* Model probability of a `context word` given a word: $p(c|w) = \dfrac{e^{x^T_wv_c}}{\sum_{k=1}^Ke^{x^T_wv_k}}$
  * where, feature for word $w$: $x_w$,
  * classifier for word $c$: $v_c$
  * Word vectors, $x_w \in \mathbb{R}^d$
* Minimize a `negative log likelihood`: 
  * Given, a stream of words $(w_1, \cdots, w_t, \cdots, w_T)$
  * Objective: 
  $$
  \min_{x, v} \quad -\sum_{t=1}^T\sum_{c\in C_t} \log \dfrac{e^{x^T_{w_t}v_c}}{\sum_{k=1}^Ke^{x^T_{w_t}v_k}}
  $$
  * The denominator sum is computationally intensive! The above sum hids `co-occurrence counts`
* Approximations to the loss:
  * Replace the multi-class loss by a set of binary logistic losses.

## fastText
* Both models are instances of a broader set of models
* Different input and output dictionarires.
* Common core but different pooling strategies.
* Efficient and modular C++ implementation.
* Allows easy building of extensions by writing own pooling.
* Ease of use is at the core of the library
  * `./fasttext supervised -input data/dbpedia.train -output data/dbpedia`
  * `./fasttext test data/dbpedia.bin data/dbpedia.test`
* Model probability of a `label` given a paragraph:
  * feature for paragraph $P: h_P$
  * classifier for label $l: v_l$
  $$
  p(l|P) = \dfrac{e^{h^T_Pv_l}}{\sum_{k=1}^Ke^{h^T_Pv_k}}
  $$
  * paragraph feature $h_P = \sum_{w\in P}x_w$
  * Word vectors are latent and not useful *per se*
  * If scarce supervised data, use pre-trained word vectors.
* Represent words as sum of its `character n-grams`, and the word itself.
  * Used hashed dictionary

# Now, let's code

### Installing fastText
Reference [https://fasttext.cc/docs/en/supervised-tutorial.html](https://fasttext.cc/docs/en/supervised-tutorial.html)
* `$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip`
* `$ unzip v0.9.2.zip`
* `$ cd fastText-0.9.2`
* `$ make`
* `$ pip install .`

Calling the help function will show high level documentation of the library.

In [1]:
import fasttext

In [2]:
help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    cbow(*kargs, **kwargs)
    
    eprint(*args, **kwargs)
    
    load_model(path)
        Load a model given a filepath and return a model object.
    
    read_args(arg_list, arg_dict, arg_names, default_values)
    
    skipgram(*kargs, **kwargs)
    
    supervised(*kargs, **kwargs)
    
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
    
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
        
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might wan

### Our first classifier

The dataset below looks like:

`__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?`

`__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments`

`__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?`

`__label__restaurant Michelin Three Star Restaurant; but if the chef is not there`

`__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?`



* `$ head -n 12404 cooking.stackexchange.txt > cooking.stackexchange.12404.train`
* `$ tail -n 3000 cooking.stackexchange.txt > cooking.stackexchange.3000.valid`

In [3]:
import fasttext
model = fasttext.train_supervised(input='../datasets/cooking.stackexchange/cooking.stackexchange.12404.train')

Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 100.0% words/sec/thread:   40933 lr:  0.000000 avg.loss: 10.033125 ETA:   0h 0m 0s


In [4]:
model.save_model('../model_files/model_cooking_stackexchange_fasttext.bin')

In [5]:
#Now, we can test the classifier
model.predict("Which baking dish is best to bake a banana bread?")

(('__label__baking',), array([0.06053093]))

In [6]:
model.predict("Why not put knives in the dishwasher?")

(('__label__baking',), array([0.06984674]))

In [7]:
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.3000.valid")

#returns: num_of_samples, precision, recall

(3000, 0.14, 0.060544904137235116)

In [8]:
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.3000.valid", k=5)

#returns: num_of_samples, precision@k, recall@k

(3000, 0.0672, 0.14530776992936428)

## Making the model better
The model obtained by running fastText with the default arguments is pretty bad at classifying new questions. Let's try to improve the performance, by changing the default parameters.

### preprocessing the data
* Looking at the data, we observe that some words contain uppercase letter or punctuation. 
* One of the first step to improve the performance of our model is to apply some simple pre-processing. 
* A crude normalization can be obtained using command line tools such as `sed` and `tr`:
  * `$ cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.stackexchange.preprocessed.txt`
  * `$ head -n 12404 cooking.stackexchange.preprocessed.txt > cooking.stackexchange.preprocessed.12404.train`
  * `$ tail -n 3000 cooking.stackexchange.preprocessed.txt > cooking.stackexchange.preprocessed.3000.valid`

* Let's train a new model on the preprocessed data

In [9]:
model = fasttext.train_supervised(input='../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.12404.train')

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   44573 lr:  0.000000 avg.loss: 10.008055 ETA:   0h 0m 0s


We observe that thanks to the pre-processing, the vocabulary is smaller (from 14k words to 9k). The precision is also starting to go up by 4%!

In [11]:
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.3000.valid")

#returns: num_of_samples, precision, recall

(3000, 0.16866666666666666, 0.0729421940320023)

### More epochs and larger learning rate
* By default, `fastText` sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the `-epoch` option:

In [13]:
model = fasttext.train_supervised(input="../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.12404.train", epoch=25)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   44510 lr:  0.000000 avg.loss:  7.193739 ETA:   0h 0m 0s


In [14]:
#Now, test the model
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.3000.valid")

#returns: num_of_samples, precision, recall

(3000, 0.5136666666666667, 0.22214213637018884)

* This is much better! 
* Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would mean that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range 0.1 - 1.0.

In [16]:
model = fasttext.train_supervised(input="../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.12404.train", lr=1.0, epoch=25)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   44533 lr:  0.000000 avg.loss:  4.380679 ETA:   0h 0m 0s


In [17]:
#Now, test the model
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.3000.valid")

#returns: num_of_samples, precision, recall

(3000, 0.581, 0.2512613521695257)

Better!

### Improving with word n-grams
* Finally, we can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis.

In [18]:
model = fasttext.train_supervised(input="../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.12404.train", lr=1.0, epoch=25, wordNgrams=2)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   44212 lr:  0.000000 avg.loss:  3.066071 ETA:   0h 0m 0s


In [19]:
#Now, test the model
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.3000.valid")

#returns: num_of_samples, precision, recall

(3000, 0.6086666666666667, 0.2632261784633127)

* With a few steps, we were able to go from a precision at one of 12.4% to 59.9%. Important steps included:
* **preprocessing the data**
    * changing the number of epochs (using the option `-epoch`, standard range [5 - 50]) ;
    * changing the learning rate (using the option `-lr`, standard range [0.1 - 1.0]) ;
    * using word n-grams (using the option `-wordNgrams`, standard range [1 - 5]).

## Scaling things up
* Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax. This can be done with the option `-loss hs`:

In [20]:
model = fasttext.train_supervised(input="../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.12404.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs')

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread: 1438728 lr:  0.000000 avg.loss:  2.262096 ETA:   0h 0m 0s avg.loss:  2.262096 ETA:   0h 0m 0s


Training should now take less than a second.

## Multi-label classification
* When we want to assign a document to multiple labels, we can still use the softmax loss and play with the parameters for prediction, namely the number of labels to predict and the threshold for the predicted probability. However playing with these arguments can be tricky and unintuitive since the probabilities must sum to 1.
* A convenient way to handle multiple labels is to use independent binary classifiers for each label. This can be done with `-loss one-vs-all` or `-loss ova`.

In [21]:
model = fasttext.train_supervised(input="../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.12404.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   82759 lr:  0.000000 avg.loss:  4.038188 ETA:   0h 0m 0s


* It is a good idea to decrease the learning rate compared to other loss functions.

In [22]:
#Now, test the model
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.3000.valid")

#returns: num_of_samples, precision, recall

(3000, 0.6063333333333333, 0.2622170967276921)

In [24]:
#Now, test the model
model.test("../datasets/cooking.stackexchange/cooking.stackexchange.preprocessed.3000.valid", k=-1)

#returns: num_of_samples, precision, recall

(3000, 0.003146031746031746, 1.0)

Now let's have a look on our predictions, we want as many prediction as possible (argument -1) and we want only labels with probability higher or equal to 0.5 :

In [23]:
model.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__bread',
  '__label__equipment',
  '__label__bananas'),
 array([1.00001001, 0.98935753, 0.97069776, 0.83974397]))

We can also evaluate our results with the test function: