# Flair Tutorial on Document Classification

(C) 2019 by [Damir Cavar](http://damir.cavar.me/)

## Corpus Preparation

We will use a script to split a corpus into a training, development, and test corpus. The corpus format will use the [FastText](https://fasttext.cc/docs/en/supervised-tutorial.html) format. We will split the corpus into:

- training set
- development set
- test set

We will use the dev data for measuring over-fitting.

In [12]:
from flair.data_fetcher import NLPTaskDataFetcher
from flair.data import TaggedCorpus
from pathlib import Path

Set the path to the corpus files:

In [13]:
data_folder = Path('./data')

Load the corpus files:

In [14]:
corpus: TaggedCorpus = NLPTaskDataFetcher.load_classification_corpus(data_folder,
                                                                     test_file='test.txt',
                                                                     dev_file='dev.txt',
                                                                     train_file='train.txt')

2019-04-10 10:08:37,254 Reading data from data
2019-04-10 10:08:37,255 Train: data/train.txt
2019-04-10 10:08:37,255 Dev: data/dev.txt
2019-04-10 10:08:37,256 Test: data/test.txt


Print out some stats for the corpus:

In [15]:
stats = corpus.obtain_statistics()
print(stats)

{
    "TRAIN": {
        "dataset": "TRAIN",
        "total_number_of_documents": 8000,
        "number_of_documents_per_class": {
            "1": 4099,
            "2": 3901
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 737984,
            "min": 17,
            "max": 252,
            "avg": 92.248
        }
    },
    "TEST": {
        "dataset": "TEST",
        "total_number_of_documents": 1000,
        "number_of_documents_per_class": {
            "2": 508,
            "1": 492
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 92053,
            "min": 18,
            "max": 228,
            "avg": 92.053
        }
    },
    "DEV": {
        "dataset": "DEV",
        "total_number_of_documents": 1000,
        "number_of_documents_per_class": {
            "1": 506,
            "2": 494
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
      

Load the modules for training a network:

In [16]:
from flair.data_fetcher import NLPTask
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

Create a label dictionary:

In [17]:
label_dict = corpus.make_label_dictionary()

Load the different word embeddings:

In [18]:
word_embeddings = [WordEmbeddings('glove'),
                   FlairEmbeddings('news-forward'),
                   FlairEmbeddings('news-backward'),
                   ]

The three embedding models will be concatenated and should give state of the art results. If this is too slow and complicated on your computer, try first without the *FlairEmbeddings*.

Document Embeddings generate one embedding for an entire text. The produced embeddings are PyTorch vectors. There are two different methods using the word embeddings to obtain a document embedding.
- Pooling Operation
- RNN

### Pooling Operation

The **Pooling Operation** calculates a pooling operation over all word embeddings in a document. The default operation is mean which gives us the mean of all words in the sentence. The resulting embedding is taken as document embedding.



To create a mean document embedding simply create any number of TokenEmbeddings first and put them in a list. Afterwards, initiate the DocumentPoolEmbeddings with this list of TokenEmbeddings. If you want to create a document embedding using GloVe embeddings together with CharLMEmbeddings, use the following code:

In [1]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence

glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

document_embeddings = DocumentPoolEmbeddings([glove_embedding,
                                              flair_embedding_backward,
                                              flair_embedding_forward])

Now, create an example sentence and call the embedding's embed() method.

In [2]:
sentence = Sentence('The grass is green . And the sky is blue .')

document_embeddings.embed(sentence)

print(sentence.get_embedding())

tensor([-0.3197,  0.2621,  0.4037,  ..., -0.0013, -0.0026,  0.0170])


Since the document embedding is derived from word embeddings, its dimensionality depends on the dimensionality of word embeddings you are using.

Next to the mean pooling operation you can also use min or max pooling. Simply pass the pooling operation you want to use to the initialization of the DocumentPoolEmbeddings:

In [3]:
document_embeddings = DocumentPoolEmbeddings([glove_embedding,
                                             flair_embedding_backward,
                                             flair_embedding_backward],
                                             mode='min')

### Use an RNN to obtain Embeddings

The RNN takes the word embeddings of every token in the document as input and provides its last output state as document embedding. You can choose which type of RNN you wish to use.

Create a document embeddings RNN:

In [4]:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

glove_embedding = WordEmbeddings('glove')

document_embeddings = DocumentRNNEmbeddings([glove_embedding])

By default, a GRU-type RNN is instantiated. Now, create an example sentence and call the embedding's embed() method.

See Cho, et al. (2014) for GRU (Gated Recurrent Unit). It aims to solve the vanishing gradient problem which comes with a standard recurrent neural network (RNN). GRU can also be considered as a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results.

In [5]:
sentence = Sentence('The grass is green . And the sky is blue .')

document_embeddings.embed(sentence)

print(sentence.get_embedding())

tensor([-1.1233, -0.3052, -0.7384, -0.7638,  0.0027,  0.3457, -0.8189,  0.0000,
         1.0095,  0.2369, -0.0000,  0.1885,  0.5334, -0.0000, -0.0000,  0.0083,
         0.4750,  0.1510, -0.5622, -0.2329,  0.2736,  0.0000, -0.6329, -0.5956,
         0.0000,  0.0000, -0.0000, -0.1479, -1.3262, -0.0000, -0.0000,  0.0000,
         0.0000,  0.3408,  0.0000, -0.0000, -0.2298, -0.2081,  0.5123, -0.0000,
        -0.5355,  0.9092, -0.0000,  0.3954,  0.0000, -0.0000,  0.0000,  0.3119,
         0.0000,  0.2398, -0.2036, -0.0813,  0.0000,  0.1089,  0.0000, -0.5947,
        -0.0000, -0.4465,  0.2653, -0.0000,  0.0000, -0.0000,  0.0000,  0.3511,
        -0.5406, -0.3120, -0.0000, -0.0000,  0.0000, -0.0000,  0.1125,  0.3827,
        -0.7062,  0.0000,  0.5449, -0.6990,  0.0000, -0.0000,  0.0000, -0.0000,
        -0.4847, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000,  0.1327,  0.0000,
        -0.3140, -0.2876, -0.4289,  0.0000,  0.7989,  0.0000, -0.0000,  0.9270,
         0.1857, -0.0000, -0.7961,  0.00

This will output a single embedding for the complete sentence. The embedding dimensionality depends on the number of hidden states you are using and whether the RNN is bidirectional or not.

If you want to use a different type of RNN, you need to set the rnn_type parameter in the constructor. So, to initialize a document RNN embedding with an LSTM, do:

In [6]:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

glove_embedding = WordEmbeddings('glove')

document_lstm_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')

Note that while DocumentPoolEmbeddings are immediately meaningful, DocumentRNNEmbeddings need to be tuned on the downstream task. This happens automatically in Flair, if you train a new model with these embeddings. Once the model is trained, you can access the tuned DocumentRNNEmbeddings object directly from the classifier object and use it to embed sentences.

The model takes word embeddings, puts them into an RNN to obtain a text representation, and puts the text representation in the end into a linear layer to get the actual class label. The model can handle single and multi class data sets.

In [19]:
from flair.models import TextClassifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=False)
document_embeddings = classifier.document_embeddings

sentence = Sentence('The grass is green . And the sky is blue .')

document_embeddings.embed(sentence)

print(sentence.get_embedding())

tensor([-0.0000, -0.4872, -0.0000, -0.7042,  0.1878, -0.0000, -0.0000, -0.0000,
         0.0000,  0.1602, -0.7092,  0.0000,  0.0948, -0.7893, -0.0965, -0.0000,
         0.4577,  0.0000, -0.2404, -0.5142,  0.0000,  0.2066, -0.6878, -0.0000,
        -0.6402,  0.3356, -0.0000,  0.0000, -0.9928, -0.5057, -0.0000,  0.0719,
        -0.0000,  0.0000,  0.0000, -0.0000,  0.0000, -0.1804,  0.8450, -0.0000,
        -1.2277,  0.4854, -0.0000,  0.7708, -0.2778, -0.0000,  0.4839,  0.0876,
         0.3991,  0.5270, -0.1792,  0.6602,  0.0000, -0.3155,  0.0000, -0.5706,
        -0.0000, -0.0000, -0.3597,  0.0391,  0.0000, -0.1249,  0.5825,  0.0000,
        -0.0000, -0.0000, -0.0000, -0.0000,  0.0000,  0.6441,  0.0957,  0.2359,
        -0.7695,  0.0000,  0.3934, -0.0000, -0.0000,  0.0000,  0.7513,  0.0100,
        -0.0000, -0.0000,  0.0000, -0.0000, -0.0000,  0.6134,  0.0184,  0.8898,
        -0.0000, -0.2680,  0.4325, -0.2350,  0.5146,  0.0000, -0.2735,  0.0000,
         0.1126,  0.0000, -0.0000, -0.00

DocumentRNNEmbeddings have a number of hyper-parameters that can be tuned to improve learning:

- hidden_size: the number of hidden states in the rnn.
- rnn_layers: the number of layers for the rnn.
- reproject_words: boolean value, indicating whether to reproject the token embeddings in a separate linear
layer before putting them into the rnn or not.
- reproject_words_dimension: output dimension of reprojecting token embeddings. If None the same output
dimension as before will be taken.
- bidirectional: boolean value, indicating whether to use a bidirectional rnn or not.
- dropout: the dropout value to be used.
- word_dropout: the word dropout value to be used, if 0.0 word dropout is not used.
- locked_dropout: the locked dropout value to be used, if 0.0 locked dropout is not used.
- rnn_type: one of 'RNN', 'LSTM', 'RNN_TANH' or 'RNN_RELU'

In our current example of Amazon reviews, we will use the following settings:

In [20]:
document_embeddings: DocumentRNNEmbeddings = DocumentRNNEmbeddings(word_embeddings,
                                                                     hidden_size=512,
                                                                     reproject_words=True,
                                                                     reproject_words_dimension=256,
                                                                     )

...

Create a classifier using the *document_embedding*:

In [21]:
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=False)

Create a trainer from the classifier and the corpus:

In [22]:
trainer = ModelTrainer(classifier, corpus)

Train the model:

In [None]:
trainer.train('resources/taggers/ag_news',
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=150)

Visualize the training curve:

In [1]:
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('resources/taggers/ag_news/loss.tsv')
plotter.plot_weights('resources/taggers/ag_news/weights.txt')

2019-04-10 02:41:21,074 No handles with labels found to put in legend.
2019-04-10 02:41:21,083 No handles with labels found to put in legend.
2019-04-10 02:41:21,094 No handles with labels found to put in legend.
