# NLP Tutorial : fastText

### What is fastText ?
**fastText** stands for two different -but related- algorithms from Facebook Research. Both algorithms were developped for text classification, but the name "fastText" may either refer to the classification technique or the embedding technique that the former is based on. Let's quicly talk about the embedding technique.

Unlike other classic word embedding techniques such as Word2vec or Glove, fastText can work at the **character-level**. This means that instead of learning embeddings for words or word N-grams (N successive words), fastText is able to learn embeddings for **character N-grams**. For example:

- Instead of learning a word representation for "thanks", the algorithm may learn one for "< thank", "thanks" and "hanks >"


- Note that the word representation for the character N-gram "hanks >" will be different from the one of the word "< hanks >" from Tom Hanks


- Let's assume the model has never seen the word "< thankful >". It is then able to produce a good embedding using the word vector for "< thank".

Using these word embeddings along with many optimization tricks, the fastText classification model trains a simple linear model that is extremely fast both at training and prediction time.

For more information abour the theory, please refer to the following papers:

[[1]](https://arxiv.org/abs/1607.04606) P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information  
[[2]](https://arxiv.org/abs/1607.01759) A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

### How to use FastText for text classification ?

fastText was developped to be used in a command line interface, but thankfully there is an official python wrapper that will allow us to train/load/compress/save models. This tutorial is in Python 2.7, but everything should work on python 3. The only thing to be cautious about is that all the strings are "utf-8" encoded.

Requirements: To use fastText we will need to install the following packages `numpy` `scipy` `pybind11` `fasttext`

- First we install the dependencies by running in a terminal `pip install numpy scipy pybind11`


- Then we install the actual FastText package from it's repository: `pip install git+https://github.com/facebookresearch/fastText.git`

Now let's try to import FastText:

In [1]:
import fastText

It works ! Let's move on to actually using this library.

## fastText : using a pre-trained model

fastText provides us with many models that were pre-trained on many different tasks. These tasks range from **sentiment analysis** through **topic classification** to **language identification**. To access the list of available models please refer to [this page](https://github.com/facebookresearch/fastText/blob/master/docs/supervised-models.md). Every pre-trained model is avalable in two versions, a more accurate but larger version (.bin) and a compressed, lighter but slightly less accurate version (.ftz).

Let's load the compressed **sentiment analysis** classifier that were trained on **amazon reviews**. First we need to download the model by running:  
`wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_full.ftz`

Once downloaded, we can load the model using the `load_model` function:

In [2]:
import os


MODEL_DIR_PATH = "yourpath/pretrained_models/fasttext/"

model = fastText.load_model(os.path.join(MODEL_DIR_PATH, "amazon_review_full.ftz"))

That's it ! We now have a model that is able to perform sentiment analysis on any text english text you have.

Let's try it:

In [3]:
model.predict("This tutorial is amazing.")

((u'__label__3',), array([0.84078544]))

The model outputs two things, a label and a number. The label is the predicted category and it depends on the task the model was trained on. Here we used a model trained on the amazon reviews dataset which learned to predict a sentiment value ranging from 1 to 5. The other output is a probability. The higher it is, the more the model is confident in its prediction.

In this example we see that the model predicts a sentiment of 3 (meaning a rather neutral sentence) with a confidence of 84%. This is weird because the sentence is clearly positive. So what happened ?

The problem is that we didn't pre-process the sentence properly. In fact, all these pretrained models were trained on texts that were preprocessed so that all the strings are lowered and all the punctuation is removed. Let's write a quick pre-processing function.

In [4]:
import re
import string
maketrans = string.maketrans


def clean_text(text):
    """
    Applies some pre-processing to clean text data.
    
    In particular:
    - lowers the string
    - removes the character [']
    - replaces punctuation characters with spaces

    """
    
    text = text.lower()

    text = re.sub(r"\'", "", text)  # remove the character [']

    # removing the punctuation
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    split = " "

    if isinstance(text, unicode):
        translate_map = dict((ord(c), unicode(split)) for c in filters)
        text = text.translate(translate_map)
    elif len(split) == 1:
        translate_map = maketrans(filters, split * len(filters))
        text = text.translate(translate_map)
    else:
        for c in filters:
            text = text.replace(c, split)

    return text

Now let's define a prediction function that pre-processes the input text before feeding it to the model:

In [5]:
predict_sentiment = lambda s: model.predict(clean_text(s))

And let's retry the previous example:

In [6]:
predict_sentiment("This tutorial is amazing.")

((u'__label__4',), array([0.52384621]))

This is better! The model predicts a more positive sentiment, but it still does so with a low confidence... This may be because of the word "tutorial" that might not be in the model's vocabulary. Let's try again without the word "tutorial":

In [7]:
predict_sentiment("This is amazing.")

((u'__label__5',), array([0.84315604]))

We were right! We now predict a highly positive sentiment with a pretty good confidence. Let's try some other sentences before we move on:

In [8]:
predict_sentiment("This tutorial is okay.")  # neutral

((u'__label__3',), array([0.81006908]))

In [9]:
predict_sentiment("This tutorial is horrible.")  # highly negative

((u'__label__1',), array([0.95363003]))

## fastText : train a fresh model

Let's see how we can train our own fastText classifier on out own data. The process is very simple. The only thing we need to do is to create a file where each sentence is on a different line, and the lines start with "\__label__\" then the category. For example:

- Assume we have a dataset with samples (review, sentiment)
- A sample of this dataset is ("This pizza is amazing!", "1")
- The corresponding line in the file should be: \__label\__1 this pizza is amazing

Note that we preprocessed the text before making our training data file, this is not necessary but highly recommended and will boost the performance of the model.

Let's train a fastText classifier on the [IMDB movie reviews dataset](http://ai.stanford.edu/~amaas/data/sentiment/):

In [10]:
import pandas as pd


DATA_DIR_PATH = "path_to_IMDB_csv_files"

imdb_train = pd.read_csv(os.path.join(DATA_DIR_PATH, "imdb_reviews.train.csv"))
imdb_test = pd.read_csv(os.path.join(DATA_DIR_PATH, "imdb_reviews.test.csv"))

A preview of the training data:

In [11]:
imdb_train.head()

Unnamed: 0,sentence,sentiment,polarity
0,This show is wonderful. It has some of the bes...,10,1
1,"In fact, Marc Blitzstein's off-Broadway adapta...",4,0
2,This so called remake is terrible. I went to s...,1,0
3,"This is a nice little movie with a nice story,...",8,1
4,I managed to sneak away one night and go to th...,1,0


This dataset has two sentiment columns, one ranging from 1 to 10 and another binarized into 0-1. Let's use the `polarity` column. We need to change this column by prepending the word "\__label\__" to it.

In [12]:
imdb_train.polarity = imdb_train.polarity.apply(lambda v: "__label__" + str(v))

Then we also need to pre-process the reviews:

In [13]:
imdb_train.sentence = imdb_train.sentence.apply(lambda s: clean_text(s))

Let's have a look at these columns before writing them into our training file:

In [14]:
imdb_train[["polarity", "sentence"]].head()

Unnamed: 0,polarity,sentence
0,__label__1,this show is wonderful it has some of the bes...
1,__label__0,in fact marc blitzsteins off broadway adaptat...
2,__label__0,this so called remake is terrible i went to s...
3,__label__1,this is a nice little movie with a nice story ...
4,__label__0,i managed to sneak away one night and go to th...


Don't forget to use UTF-8 encoding !

In [15]:
imdb_train[["polarity", "sentence"]].to_csv(
    "yourpath/imdb.train",
    sep=" ", encoding="utf-8",
    index=False, header=None
)

Now that we have our training file, we can use the `train_supervised` to train a fastText classifier. This function can take many training parameters and we will just use the parameters in the fastText online doc:

In [16]:
import time


t0 = time.time()
model = fastText.train_supervised(
    input="yourpath/imdb.train", epoch=25,
    lr=1.0, wordNgrams=2, verbose=2, minCount=1)

print("The training took {:.2f} seconds".format(time.time() - t0))

The training took 31.76 seconds


That's it, we trained a whole classifier on 25,000 english reviews in 30 seconds. This is way faster than any other deep learning based classification techniques, but does it work as well ?

Let's pre-process our test sentences and predict their sentiment polarities:

In [17]:
test_sentences = imdb_test.sentence.apply(lambda s: clean_text(s).decode("utf-8")).values

In [18]:
t0 = time.time()
predicted_labels, probabilities = model.predict(list(test_sentences))

print("Prediction took {:.2f} seconds".format(time.time() - t0))

Prediction took 3.79 seconds


We need to quickly convert the predicted labels (strings that begin with \__label\__) to numerical polarities:

In [19]:
import numpy as np


predicted_labels = np.vectorize(lambda s: int(s[-1]))(np.array(predicted_labels).ravel())

Now let's compute the accuracy on the test dataset:

In [20]:
actual_labels = imdb_test.polarity.values

print("The accuracy of the freshly trained model is: {:.2f}".format((predicted_labels==actual_labels).mean()*100))

The accuracy of the freshly trained model is: 89.64


89.64 % ! This is a great accuracy on this dataset. Keep in mind that we didn't fine-tune anything and that the training only took 30 seconds.

These are great results, but at which cost ? Let's quickly check the size of this model:

In [21]:
model.save_model("yourpath/imdb.bin")

In [22]:
print("The model size is {:.0f} mb.".format(os.stat("yourpath/imdb.bin").st_size/1e6))

The model size is 835 mb.


800+ megabytes is a big model. Fortunately, fastText provides a method for compressing a model without loosing too much performance:  
(please refer to [this paper](https://arxiv.org/abs/1612.03651) for details)

In [23]:
t0 = time.time()
model.quantize(input="yourpath/imdb.train", qnorm=True, retrain=True, cutoff=100000)

print("Compressing the model took {:.2f} seconds".format(time.time() - t0))

model.save_model("/data/helboukkouri/misc/imdb.ftz")

Compressing the model took 80.19 seconds


In [24]:
print("The model size is {:.0f} mb.".format(os.stat("yourpath/imdb.ftz").st_size/1e6))

The model size is 6 mb.


The compression took a little bit more than a minute but the result is worth it, the model is now 6 mb !

But did we loose any accuracy ?

In [25]:
predicted_labels, probabilities = model.predict(list(test_sentences))
predicted_labels = np.vectorize(lambda s: int(s[-1]))(np.array(predicted_labels).ravel())
actual_labels = imdb_test.polarity.values

print("The accuracy of the freshly trained model is: {:.2f}".format((predicted_labels==actual_labels).mean()*100))

The accuracy of the freshly trained model is: 89.45


The accuracy is still over 89% so we won over 800mb of space for the very small price (on a balanced dataset) of ~ 0.2% accuracy.

___
___