<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#MultiLabel-Text-Classification-with-FastText2" data-toc-modified-id="MultiLabel-Text-Classification-with-FastText2-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>MultiLabel Text Classification with FastText2</a></span><ul class="toc-item"><li><span><a href="#BackGround" data-toc-modified-id="BackGround-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>BackGround</a></span></li><li><span><a href="#Quick-Introduction-to-Fasttext" data-toc-modified-id="Quick-Introduction-to-Fasttext-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Quick Introduction to Fasttext</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Model-Training" data-toc-modified-id="Model-Training-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Model Training</a></span></li></ul></li><li><span><a href="#Fasttext2-Inferencing-Benchmark" data-toc-modified-id="Fasttext2-Inferencing-Benchmark-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Fasttext2 Inferencing Benchmark</a></span></li><li><span><a href="#Fasttext2-Modification" data-toc-modified-id="Fasttext2-Modification-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fasttext2 Modification</a></span></li><li><span><a href="#Reference" data-toc-modified-id="Reference-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

In [1]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import time

# notice the import is from fasttext2
# can be installed with python setup.py install,
# this package is not on pip and as the package is
# renamed to fasttext2 it does not conflict with
# the original fasttext package
import fasttext2 as fasttext

%watermark -a 'Ethen' -d -t -v -p numpy,pandas

Ethen 2020-05-07 13:31:48 

CPython 3.6.4
IPython 7.9.0

numpy 1.16.5
pandas 0.25.0


# MultiLabel Text Classification with FastText2

## BackGround

Multi label classification is different from regular classification task where there is single ground truth that we are predicting. Here, each record can have multiple labels attached to it. e.g. in the data that we'll be working with later, our goal is to build a classifier that assigns tags to stackexchange questions about cooking. As we can imagine, each question can belong into multiple tags/topics at the same time, i.e. each record have multiple "correct" labels/targets. Let's look at some examples to materialize this.

```
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
```

Looking at the first few lines, we can see that for each question, its corresponding tags are prepended with the `__label__` prefix. Our task is to train a model that predicts the tags/labels given the question.

This file format is expected by [Fasttext](https://fasttext.cc/), the library we'll be using to train our tag classifier.

## Quick Introduction to Fasttext

We'll be using Fasttext to train our text classifier. Fasttext at its core is composed of two main idea.

First, unlike deep learning methods where there are multiple hidden layers, the architecture is similar to Word2vec. After feeding the words into 1 hidden layer, the words representation are averaged into the sentence representation and directly followed by the output layer.

<img src="img/fasttext.png" width="50%" height="50%">

This seemingly simple method works extremely well on classification task, and from the original paper it can achieve performance that are on par with more complex deep learning methods, while being significantly quicker to train.

The second idea is instead of treating words as the basic entity, it uses character n-grams or word n-grams as additional features. For example, in the sentence, "I like apple", the 1-grams are 'I', 'like', 'apple'. The word 2-gram are consecutive word such as: 'I like', 'like apple', whereas the character 2-grams are for the word apple are 'ap', 'pp', 'pl', 'le'. By using word n-grams, the model now has the potential to capture some information from the ordering of the word. Whereas, with character n-grams, the model can now generate better embeddings for rare words or even out of vocabulary words as we can compose the embedding for a word using the sum or average of its character n-grams.

For readers accustomed to Word2vec, one should note that the word embedding/representation for the classification task is neither the skipgram or cbow method. Instead it is tailored for the classification task at hand. To elaborate:

- Given a word, predict me which other words should go around (skipgram).
- Given a sentence with a missing word, find me the missing word (cbow).
- Given a sentence, tell me which label corresponds to this sentence (classification).

Hence, for skipgram and cbow, words in the same context will tend to have their word embedding/representation close to each other. As for classification task, words that are most discriminative for a given label will be close to each other.

## Data Preparation

We'll download the data and take a peek at it. Then it's the standard train and test split on our text file. As Fasttext accepts the input data as files, the following code chunk provides a function to perform the split without reading all the data into memory. 

In [2]:
# download the data and un-tar it under the 'data' folder

# -P or --directory-prefix specifies which directory to download the data to
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz -P data
# -C specifies the target directory to extract an archive to
!tar xvzf data/cooking.stackexchange.tar.gz -C data

--2020-05-07 13:31:49--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘data/cooking.stackexchange.tar.gz’


2020-05-07 13:31:49 (4.28 MB/s) - ‘data/cooking.stackexchange.tar.gz’ saved [457609/457609]

x cooking.stackexchange.id
x cooking.stackexchange.txt
x readme.txt


In [3]:
!head -n 3 data/cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?


In [4]:
import os
import random


def train_test_split_file(input_path: str,
                          output_path_train: str,
                          output_path_test: str,
                          test_size: float,
                          random_state: int=1234,
                          encoding: str='utf-8',
                          verbose: bool=True):
    random.seed(random_state)

    # we record the number of data in the training and test
    count_train = 0
    count_test = 0
    train_range = 1 - test_size

    with open(input_path, encoding=encoding) as f_in, \
         open(output_path_train, 'w', encoding=encoding) as f_train, \
         open(output_path_test, 'w', encoding=encoding) as f_test:

        for line in f_in:
            random_num = random.random()
            if random_num < train_range:
                f_train.write(line)
                count_train += 1
            else:
                f_test.write(line)
                count_test += 1

    if verbose:
        print('train size: ', count_train)
        print('test size: ', count_test)


def prepend_file_name(path: str, name: str) -> str:
    """
    e.g. data/cooking.stackexchange.txt
    prepend 'train' to the base file name
    data/train_cooking.stackexchange.txt
    """
    directory = os.path.dirname(path)
    file_name = os.path.basename(path)
    return os.path.join(directory, name + '_' + file_name)

In [5]:
data_dir = 'data'
test_size = 0.2
input_path = os.path.join(data_dir, 'cooking.stackexchange.txt')
input_path_train = prepend_file_name(input_path, 'train')
input_path_test = prepend_file_name(input_path, 'test')
random_state = 1234
encoding = 'utf-8'

train_test_split_file(input_path, input_path_train, input_path_test,
                      test_size, random_state, encoding)
print('train path: ', input_path_train)
print('test path: ', input_path_test)

train size:  12297
test size:  3107
train path:  data/train_cooking.stackexchange.txt
test path:  data/test_cooking.stackexchange.txt


## Model Training

We can refer to the full list of parameters from [Fasttext's documentation page](https://fasttext.cc/docs/en/python-module.html#train_supervised-parameters). Like with all machine learning models, feel free to experiment with various hyperparameters, and see which one leads to better performance.

In [6]:
# lr = learning rate
# lrUpdateRate similar to batch size
fasttext_params = {
    'input': input_path_train,
    'lr': 0.1,
    'lrUpdateRate': 1000,
    'thread': 8,
    'epoch': 10,
    'wordNgrams': 1,
    'dim': 100,
    'loss': 'ova'
}
model = fasttext.train_supervised(**fasttext_params)

print('vocab size: ', len(model.words))
print('label size: ', len(model.labels))
print('example vocab: ', model.words[:5])
print('example label: ', model.labels[:5])

vocab size:  14496
label size:  733
example vocab:  ['</s>', 'to', 'a', 'How', 'the']
example label:  ['__label__baking', '__label__food-safety', '__label__substitutions', '__label__equipment', '__label__bread']


Although not used here, fasttext has a parameter called `bucket`. It can be a bit unintuitive what the parameter controls. We note down the [explanation provided by the package maintainer](https://github.com/facebookresearch/fastText/issues/641).

> The size of the model will increase linearly with the number of buckets. The size of the input matrix is DIM x (VS + BS), where VS is the number of words in the vocabulary and BS is the number of buckets. The number of buckets does not have other influence on the model size.
> The buckets are used for hashed features (such as character ngrams or word ngrams), which are used in addition to word features. In the input matrix, each word is represented by a vector, and the additional ngram features are represented by a fixed number of vectors (which corresponds to the number of buckets).

The loss function that we've specified is one versus all, `ova` for short. This type of loss function handles the multiple labels by building independent binary classifiers for each label.

Upon training the model, we can take a look at the prediction generated by the model via passing a question to the `.predict` method.

In [7]:
text = 'How much does potato starch affect a cheese sauce recipe?'
model.predict(text, k=2)

(('__label__sauce', '__label__cheese'), array([0.77185351, 0.53899324]))

The annotated tags for this question were `__label__sauce` and `__label__cheese`. Meaning we got both the prediction correct when asking for the top 2 tags. i.e. the precision@2 (precision at 2) for this example is 100%.

In [8]:
text = 'Dangerous pathogens capable of growing in acidic environments'
model.predict(text, k=2)

(('__label__food-safety', '__label__storage-method'),
 array([0.21207881, 0.06561483]))

In this example, the annotated tags were `__label__food-safety` and `__label__storage-method`. In other words, 1 of our predicted tag was wrong, hence the precision@2 is 50%.

Notice the second prediction's score is pretty low, when calling the `.predict` method, we can also provide a threshold to cutoff predictions lower than that value.

In [9]:
text = 'Dangerous pathogens capable of growing in acidic environments'
model.predict(text, k=2, threshold=0.1)

(('__label__food-safety',), array([0.21207881]))

The `.predict` method also supports batch prediction, where we pass in a list of text.

In [10]:
texts = [
    'How much does potato starch affect a cheese sauce recipe?',
    'Dangerous pathogens capable of growing in acidic environments'
]
batch_results = model.predict(texts, k=2)
batch_results

([['__label__sauce', '__label__cheese'],
  ['__label__food-safety', '__label__storage-method']],
 [array([0.7718535 , 0.53899324], dtype=float32),
  array([0.21207881, 0.06561483], dtype=float32)])

After obtaining the prediction, we might want to clean up the label prediction such as removing the `__label__` indicator that fasttext uses to differentiate which token is a label and which is a input word/token. Also replace the dashed `-` in between each token with a whitespace.

In [11]:
FASTTEXT_LABEL = '__label__'


def parse_fasttext_label(label):
    return label[len(FASTTEXT_LABEL):].replace('-', ' ')


def parse_batch_labels(batch_labels):
    parsed_batch_labels = []
    for labels in batch_labels:
        parsed_labels = [parse_fasttext_label(label) for label in labels]
        parsed_batch_labels.append(parsed_labels)

    return parsed_batch_labels

In [12]:
# the result from the .predict method is a tuple of label and score
# we only need to parse the label
parse_batch_labels(batch_results[0])

[['sauce', 'cheese'], ['food safety', 'storage method']]

To evaluate the precision/recall metrics all together on our train and test file, we can leverage the `.test` method from the model to evaluate the overall precision and recall metrics.

In [13]:
def print_results(model, input_path, k):
    num_records, precision_at_k, recall_at_k = model.test(input_path, k)
    f1_at_k = 2 * (precision_at_k * recall_at_k) / (precision_at_k + recall_at_k)

    print("records\t{}".format(num_records))
    print("Precision@{}\t{:.3f}".format(k, precision_at_k))
    print("Recall@{}\t{:.3f}".format(k, recall_at_k))
    print("F1@{}\t{:.3f}".format(k, f1_at_k))
    print()

In [14]:
k = 1
print('train metrics:')
print_results(model, input_path_train, k)

print('test metrics:')
print_results(model, input_path_test, k)

train metrics:
records	12297
Precision@1	0.485
Recall@1	0.211
F1@1	0.294

test metrics:
records	3107
Precision@1	0.410
Recall@1	0.177
F1@1	0.247



In [15]:
# we save the model under its own folder/directory
directory = 'cooking_model'
if not os.path.isdir(directory):
    os.makedirs(directory, exist_ok=True)

model_checkpoint = os.path.join(directory, 'fasttext2_model.fasttext')
model.save_model(model_checkpoint)

# Fasttext2 Inferencing Benchmark

Everything that's shown above is provided by the original `fasttext` library, `fasttext2` mainly added code to speed up model inferencing, i.e. after training the model, generating prediction for new inputs. The core logic that does the model training remains the same.

Here, we benchmark the different predict methods. We fix the top-k labels we're predicting and the batch size (The number of input text that we wish to get the output for).

In [16]:
k = 5
batch_size = 500

In [17]:
def batch_read_text(input_path: str, batch_size: int=500, encoding: str='utf-8'):
    texts = []
    with open(input_path, encoding=encoding) as f:
        for _ in range(batch_size):
            try:
                tokens = []
                line = f.readline().strip('\n')
                for token in line.split(' '):
                    if FASTTEXT_LABEL not in token:
                        tokens.append(token)

                text = ' '.join(tokens)
                texts.append(text)
            except ValueError as e:
                # bad practice to just skip exceptions,
                # we'll let it slide for this demo code ...
                continue
                
    return texts

In [18]:
batch_texts = batch_read_text(input_path_train, batch_size)
print('batch size: ', len(batch_texts))
batch_texts[:3]

batch size:  500


['Dangerous pathogens capable of growing in acidic environments',
 'How do I cover up the white spots on my cast iron stove?',
 "What's the purpose of a bread box?"]

The original `.predict` method from fasttext library.

In [19]:
%%timeit
batch_results = model.predict(batch_texts, k=k)
batch_labels = parse_batch_labels(batch_results[0])

44.4 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


If we only want the predicted label, and not the score, we can use `.predict_label`. The speed gain here is not much.

In [20]:
%%timeit
batch_labels = model.predict_label(batch_texts, k=k)
batch_labels = parse_batch_labels(batch_labels)

43 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


fasttext uses sequential loop to generate prediction for a batch of text. One way to improve the inferencing speed is to change to a parallel loop to generate the prediction. This is exposed using the `predict_label_future` method. It uses c++'s async/future pattern to do the parallelize, hence the future naming.

In [21]:
%%timeit
batch_labels = model.predict_label_future(batch_texts, k=k)
batch_labels = parse_batch_labels(batch_labels)

14.1 ms ± 480 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The idea behind next speed up is not as straightforward as implementing parallel prediction on top of the sequential prediction. 

At a high level, it uses `hnswlib` to create an index on the output matrix of the model. Doing this indexing allows us the speed up looking for the top-k predicted labels for all future predictions. We'll defer the elaborated explanation till later, and see how to leverage it as an end-user first.

We need to create an index using the `create_index` method.

In [22]:
index_params = {
    'ef_construction': 100,
    'M': 5,
    'random_seed': 100
}
model.create_index(**index_params)

<fasttext2.FastText._FastText at 0x11e9bdbe0>

After that, we call `batch_predict_label` to perform the batch prediction. Note that multi-label classification problem that has a large label space will see more gains with this "indexing" trick.

In [23]:
%%timeit
batch_labels = model.batch_predict_label(batch_texts, k=k)
batch_labels = parse_batch_labels(batch_labels)

11.1 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


We can print out the predicted labels to make sure the prediction remains the same.

In [24]:
threshold = 0.1
batch_labels = model.predict_label_future(texts, k=k, threshold=threshold)
batch_labels

[['__label__sauce', '__label__cheese', '__label__tomatoes'],
 ['__label__food-safety']]

In [25]:
batch_labels = model.batch_predict_label(texts, k=k, threshold=threshold)
batch_labels

[['__label__sauce', '__label__cheese', '__label__tomatoes'],
 ['__label__food-safety']]

We can save the "indexed model" using the original `save_model` method, and load it back.

In [26]:
indexed_model_checkpoint = os.path.join(directory, 'fasttext2_indexed_model.fasttext')
model.save_model(indexed_model_checkpoint)

In [27]:
fasttext_model = fasttext.load_model(indexed_model_checkpoint)

# confirm prediction still works
batch_labels = fasttext_model.batch_predict_label(texts, k=k)
batch_labels[:2]



[['__label__sauce',
  '__label__cheese',
  '__label__tomatoes',
  '__label__pasta',
  '__label__flavor'],
 ['__label__food-safety',
  '__label__storage-method',
  '__label__storage-lifetime',
  '__label__food-science',
  '__label__refrigerator']]

The rest of section demonstrates we can use the quantization capability provided by fasttext and the indexing trick together to both speed up inferencing and reduce memory reduction at the same time with our model.

In [29]:
# we load the model
fasttext_model = fasttext.load_model(model_checkpoint)



In [30]:
# we use the original .quantize method to quantize the model
fasttext_model.quantize(dsub=2)

In [31]:
# creates the index
index_params = {
    'ef_construction': 100,
    'M': 5,
    'random_seed': 100
}
fasttext_model.create_index(**index_params)

<fasttext2.FastText._FastText at 0x12642d748>

In [32]:
# check the prediction
batch_labels = fasttext_model.batch_predict_label(texts, k=k, threshold=threshold)
batch_labels

[['__label__sauce', '__label__cheese', '__label__pasta', '__label__tomatoes'],
 ['__label__food-safety']]

In [33]:
# saves the quantized and indexed model
quantized_model_checkpoint = os.path.join(directory, 'fasttext2_quantized_model.fasttext')
fasttext_model.save_model(quantized_model_checkpoint)

In [34]:
# load the quantized and indexed model back into memory
fasttext_model = fasttext.load_model(quantized_model_checkpoint)



In [35]:
# check the size of each model
!ls -lh cooking_model

total 28904
-rw-r--r--  1 mingyuliu  110304721   6.4M May  7 13:32 fasttext2_indexed_model.fasttext
-rw-r--r--  1 mingyuliu  110304721   6.1M May  7 13:31 fasttext2_model.fasttext
-rw-r--r--  1 mingyuliu  110304721   1.6M May  7 13:32 fasttext2_quantized_model.fasttext


In [36]:
# check the prediction
batch_labels = fasttext_model.batch_predict_label(texts, k=k, threshold=threshold)
batch_labels

[['__label__sauce', '__label__cheese', '__label__pasta', '__label__tomatoes'],
 ['__label__food-safety']]

In [37]:
%%timeit
batch_labels = fasttext_model.predict_label_future(texts, k=5)
batch_labels = parse_batch_labels(batch_labels)

135 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


# Fasttext2 Modification

We listed the major code change to the original Fasttext source code.

- We switched the package name from `fasttext` to `fasttext2` to prevent over-stepping on each other. Note that during the import statement, we can use `import fasttext2 as fasttext` to avoid changes to other parts of the code. As this work is mainly around speeding inferencing, all the other commonly used methods, such as `train_supervised`, `save_model`, `load_model`, `quantize` will work as usual.
- The fasttext object from `fasttext2` mainly adds `create_index`, `batch_predict_label`, `predict_label_future` method to speed up inferencing.
- For the underlying C++ code:
    - In the main `FastText` class, there is a `Model` class that takes care of the computational aspect, `Loss` class that handles loss and applies gradients to the output. Here, we refactor `Model`'s `computeHidden` method so instead of having a method that directly computes the hidden layer and output layer in one go, we isolate the `computeHidden` this gives us the flexibility to only compute up to the hidden layer, so we can redirect rest of the inferencing to the faster indexing method. Also the `Loss` contains a method that computes the `sigmoid` given an input number, we also expose that so we can re-use it outside the `Loss` context.
    - We add the `Index` class that contains the core logic to create the hnsw index, also perform knn search from the index. The `Index` class is a wrapper around the `hnswlib` library. We also include this object to the `FastText` class. The `Index` class can be created using the newly added `createIndex` method of the `FastText` class. There is also a `indexed_` attribute, similar to the `quant_` attribute that is used to indicate whether we quantize the model, we added this attribute to denote whether the model is indexed. The `Index` has a `save` and `load` method that works with `FastText` object, i.e. we can save/load the `FastText` object using its `saveModel`, `loadModel` method, and it will streamline save/load the underlying `Index` together with the `FastText` object if we created one.
    - The `knnQueryLineLabel` method is added to the `FastText` object to generate the prediction using the index.
    - In the `fasttext2_pybind.cc` script that uses pybind11 to expose the C++ code into python. We added the methods to the `FastText` class. `multilinePredictLabelFuture` for performing parallel prediction using C++'s async/future pattern. The original `multilinePredict` method performs prediction sequentially for the batch of input text. `multilineKnnQueryLabel` is for performing parallel prediction after we've indexed the class.

# Reference

- [Fasttext Documentation: Text Classification](https://fasttext.cc/docs/en/supervised-tutorial.html)
- [Quora: What is the main difference between word2vec and fastText?](https://www.quora.com/What-is-the-main-difference-between-word2vec-and-fastText)
- [Paper: A. Joulin, E. Grave, P. Bojanowski, T. Mikolov - Bag of Tricks for Efficient Text Classification (2016)](https://arxiv.org/abs/1607.01759)