# FastText - Word Vectors & Text Classification

Hello and welcome to the IAA FastText tutorial!

In this tutorial, we will dive into the following topics:

- Word vectors and word space arithmetic
- Text classification

You will be able to experiment freely with word vectors and then move onto classifying text into categories. So, let's get started!

We'll be using FastText from Facebook throughout this tutorial. FastText is a highly optimized open-source tool that serves the following three purposes:

- Learning vector representations for words. See [this paper](https://arxiv.org/pdf/1301.3781.pdf).
- Classifying text into categories. See [this paper](https://arxiv.org/pdf/1607.01759.pdf).
- Compressing these models to work on mobile devices. See [this paper](https://arxiv.org/pdf/1612.03651.pdf).

To start working on this notebook, please **select the Python 3 kernel** via Kernel > Change kernel.

## Word representations via Skipgram model

In 2013, the research group around Thomas Mikolov at Google [introduced](https://arxiv.org/pdf/1301.3781.pdf) two models for learning vector representations for words from very large data sets. In this tutorial, we will concentrate on the Skipgram model.

The skipgram model is surprisingly simple: Given a sentence, it predicts the surrounding words given the current word. Due to limited time in this tutorial, we cannot explain this model in detail. Please refer to [the original paper](https://arxiv.org/pdf/1301.3781.pdf), [our presentation in the team meeting](https://orangesharing.com/confluence/download/attachments/35651640/DOCLA%20FINAL_v3.pptx?version=1&modificationDate=1510305214047&api=v2) as well as [this excellent blog post](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/).

Facebook provides pre-trained models trained on Wikipedia. We have prepared these models in this workspace. So, let's load them!

In [1]:
import numpy as np
import subprocess
from scipy import spatial
%pylab inline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE

import fastText

Populating the interactive namespace from numpy and matplotlib


In [3]:
vec_model = fastText.load_model("/home/iaa/lib/fastText/data/wiki.en.bin")

We have now loaded the pre-trained skipgram model. Note that this model was trained using sub-word information, meaning that a word is composed of a set of n-grams. We can see the n-grams a made is up of using the `get_subwords` function:

In [4]:
vec_model.get_subwords("wikipedia")

(['wikipedia',
  '<wi',
  '<wik',
  '<wiki',
  '<wikip',
  'wik',
  'wiki',
  'wikip',
  'wikipe',
  'iki',
  'ikip',
  'ikipe',
  'ikiped',
  'kip',
  'kipe',
  'kiped',
  'kipedi',
  'ipe',
  'iped',
  'ipedi',
  'ipedia',
  'ped',
  'pedi',
  'pedia',
  'pedia>',
  'edi',
  'edia',
  'edia>',
  'dia',
  'dia>',
  'ia>'],
 array([    104, 3464641, 4459358, 3986705, 4499551, 3641052, 2981995,
        4003405, 4022108, 3360838, 3088430, 3734365, 2960415, 3583541,
        4103636, 3494260, 2885315, 3309957, 3864535, 3689398, 3311169,
        3519608, 3537807, 3432822, 4513568, 3119881, 3214276, 3085910,
        3969639, 2531043, 2799581]))

In the previous output, the `<` and `>` indicate the beginning and end of a word.

The vector representation of a word can be queries using the `get_word_vector` function:

In [5]:
vec_model.get_word_vector("wikipedia")

array([ -3.69488209e-01,  -3.06198329e-01,  -4.78578627e-01,
         3.66119631e-02,  -3.71713400e-01,  -1.29155189e-01,
         3.08121353e-01,  -6.10210359e-01,  -4.26935256e-01,
         2.64354289e-01,   1.28045790e-02,   3.42148066e-01,
         6.46890253e-02,   5.30238673e-02,  -2.96356902e-02,
        -2.14672059e-01,  -1.80559069e-01,   1.27842560e-01,
         5.68398312e-02,   4.88140643e-01,  -2.61829883e-01,
         5.56036174e-01,   2.24148571e-01,   2.22534016e-01,
        -1.85583994e-01,  -1.10030048e-01,  -1.90432772e-01,
        -3.12642530e-02,  -3.27409059e-02,  -1.52658150e-01,
        -1.69946402e-01,  -1.20262094e-01,  -3.58753651e-01,
         3.11426520e-01,  -4.05167937e-01,  -3.54144722e-01,
        -2.18132645e-01,   3.32609594e-01,  -1.36299971e-02,
         8.77231508e-02,  -4.32297498e-01,  -3.00237298e-01,
        -1.92881003e-01,  -1.04065567e-01,   8.00600201e-02,
         5.41281223e-01,  -1.00463189e-01,  -1.50544688e-01,
         7.27532152e-03,

The model has been trained with the tuning parameter dim set to 300. Hence, a word vector is just a vector of 300 numbers in continuous space.

**<span style="color:red">Please note that you should lowercase all words because the model has been trained with lowercase input!</span>**

## Word vector arithmetic in continuous space

By now, you may have seen visualizations of word vectors and word relationships on several occasions. But does this actually work?

In [6]:
wine = vec_model.get_word_vector("wine")
beer = vec_model.get_word_vector("beer")
france = vec_model.get_word_vector("france")
germany = vec_model.get_word_vector("germany")
cognac = vec_model.get_word_vector("cognac")

In [7]:
cosine_similarity((wine - france).reshape(1, -1), (beer - germany).reshape(1, -1))

array([[ 0.61361015]], dtype=float32)

In [8]:
cosine_similarity((wine - france).reshape(1, -1), (cognac - germany).reshape(1, -1))

array([[ 0.37843513]], dtype=float32)

The similary for (beer - Germany) is much higher than (wine - Germany)!

We can also do more arithmetic in conjunction with a tree search to search for the solution of an equation.

For example, let's try to find what country would be equivalent to German without beer but with wine!

In [9]:
A = []
word_list = vec_model.get_words()

for i, word in enumerate(word_list[:100000]):
    vec = vec_model.get_word_vector(word)
    A.append(vec)
    word_list.append(word)
    
T = spatial.KDTree(A)

In [10]:
def find_closest(formula, k=5):
    dists, indices = T.query(formula, k=k)
    words = [word_list[i] for i in indices]
    
    return dists, indices, words

In [11]:
find = vec_model.get_word_vector("germany") - vec_model.get_word_vector("beer") + vec_model.get_word_vector("wine")

In [12]:
find_closest(find)

(array([ 4.06677678,  4.72205323,  4.82766514,  5.14881981,  5.19047626]),
 array([ 582,  947,  479, 2306, 1128]),
 ['germany', 'italy', 'france', 'switzerland', 'spain'])

The second match already corresponds to Italy and France! Feel free to try something else!

## Text Classification with fastText

In this section, we are using fastText to train a classification model on the AG News corpus to categorize news article into the following four classes:

- World (index 1)
- Sports (index 2)
- Business (index 3)
- Science & Tech (index 4)

The corpus is comprised of one million news articles. Note that for performance reasons and the limited time during the workshop, we are not using pre-trained word vectors in this section.


In [13]:
!head -n3 ~/lib/fastText/data/ag_news.train

__label__2 , california beats n . carolina 9-2 in llws ( ap ) , ap - danny leon hit a two-run homer , and tyler carp and john lister added solo homers to lead conejo valley little league to a 9-2 victory over morganton , n . c . , on sunday in the little league world series . 
__label__3 , stocks open lower as oil nears \$50/barrel , us stocks opened lower on monday as investor concerns were fueled by oil prices hitting a fresh record high , marching closer to \$50 a barrel , on supply concerns emerging from nigeria . 


As can be seen, the data set has already been pre-processed to a data format suitable for fastText. For supervised classification models, fastText provides a `train_supervised` function:

In [14]:
help(fastText.train_supervised)

Help on function train_supervised in module fastText.FastText:

train_supervised(input, lr=0.1, dim=100, ws=5, epoch=5, minCount=1, minCountLabel=0, minn=0, maxn=0, neg=5, wordNgrams=1, loss='softmax', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label='__label__', verbose=2, pretrainedVectors='', saveOutput=0)
    Train a supervised model and return a model object.
    
    input must be a filepath. The input text does not need to be tokenized
    as per the tokenize function, but it must be preprocessed and encoded
    as UTF-8. You might want to consult standard preprocessing scripts such
    as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
    
    The input file must must contain at least one label per line. For an
    example consult the example datasets which are part of the fastText
    repository such as the dataset pulled by classification-example.sh.



It's arguments are as follows:

- `lr`: the learning rate of the neural network
- `dim`: the dimension of the embedding vectors to be trained; if using pre-trained embeddings, the dimension should match the dimension of these vectors
- `ws`: size of the context window
- `epoch: the number of times to go over the training data
- `minCount`: minimal number of word occurences
- `wordNgrams`: the maximum n in n-grams to compute
- `buckets`: number of buckets for feature hashing
- `minn`: minimum length of char ngrams (for subword embeddings)
- `maxn`: maximum length of char ngrams (for subword embeddings)

Let's train our first model! Note that the Python API does not yet fully support everything fastText provides and only supports file-based data sets.

In [45]:
model = fastText.train_supervised("/home/iaa/lib/fastText/data/ag_news.train", lr=0.5, minn=1, maxn=1, wordNgrams=5, minCount=1, bucket=10000000)

In [41]:
model.save_model("model")

In [42]:
!~/lib/fastText/fasttext test ~/proj/ing_tutorials/iaa_fasttext/model ~/lib/fastText/data/ag_news.test

N	7600
P@1	0.915
R@1	0.915
Number of examples: 7600


It already works quite okay. However, when further inspecting the model, we find the following problem:

In [65]:
model.predict("nuclear test in north korea")

(('__label__1',), array([ 1.]))

In [66]:
model.predict("nuclear?? test in north korea??")

(('__label__3',), array([ 0.81640625]))

### Task 1

Modify the model to use sub-word embeddings by setting the maxn parameter to > 1.

In [None]:
## Your code goes here
model2 = fastText.train_supervised("/home/iaa/lib/fastText/data/ag_news.train", lr=0.5, minn=1, maxn=?, wordNgrams=5, minCount=1, bucket=10000000)
##

Rerun the predictions for `nuclear test in north korea` and `nuclear?? test in north korea??`. Can you explain the difference to the previous model?

In [75]:
model2.predict("nuclear test in north korea")

(('__label__1',), array([ 1.]))

In [76]:
model2.predict("nuclear?? test in north korea??")

(('__label__1',), array([ 0.99804687]))

### Task 1

Feel free to experiment freely with fastText. What is the best performance you can achieve? If you want, you can also compare it to other approaches!

In [73]:
## Your code goes here

##