[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ajpar94/embeddings-comparison/blob/master/notebooks/train_flair_embedding.ipynb)
# Training a Flair Language Model

Example code for training your own [flair](https://github.com/zalandoresearch/flair) language models (*Flair Embeddings*). The majority of this content is similar/equal to [Tutorial: Training your own Flair Embeddings](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md), so make sure to check it out for greater detail.




### Google Colab setup
First, make sure to turn on 'Hardware accelerator: GPU' in *Edit > Notebook Settings*. Next, we will we mount our google drive to easily access corpora, datasets, embeddings and store models. Finally, install flair and configure your paths.

In [0]:
# Mount google drive
from google.colab import drive
drive.mount('/gdrive')

In [0]:
!pip install flair --quiet

In [0]:
# PATHS
base_path = "/gdrive/My Drive/WordEmbeddings-Comparison/"
lm_path = f"{base_path}Language-Modeling/resources/"

## Preparing the Corpus
To train your own embeddings you need a suitably large plain text file. The Corpus must consist of following parts:

*   Test data -> *'test.txt'*
*   Validation data -> *'valid.txt'*
*   Training data splitted in to many smaller parts contained in the folder *'train'*

This means the corpus folder structure has to look like this:

```console
corpus
 |- test.txt
 |- valid.txt
 |- train
     |- train_split_1
     |- train_split_2
     |  ...
     |- train_split_x
```
 
To create a corpus folder from one plain text file, you can use this script: [*make_corpus_folder.py*](https://github.com/ajpar94/embeddings-comparison/blob/master/Language-Modeling/preprocessing/make_corpus_folder.py). For example, if you want use 1% of the data for validation, 2% for testing, 97% for training and want the training data to be splitted in 20 parts, you can do

```console
$ python make_corpus_folder.py corpus.txt /path/to/corpus_folder -p 97-1-2 -s 20
```






## Training the Language Model
You need to specify whether you want to train a forward or a backward model (I would recommend doing both). If you will use a Latin alphabet, load the default character dictionary. For Non-Latin alphabets check the [tutorial](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md#non-latin-alphabets) on how create your own character dictionary. Intialize your *TextCorpus*.

In [0]:
from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import TextCorpus

# are you training a forward or backward LM?
is_forward_lm = True

# load the default character dictionary
dictionary = Dictionary.load('chars')

# corpus folder with train splits, test and valid
corpus_folder "example-corpus"
corpus_path = f"{lm_path}corpus/{corpus_folder}/

# initialize corpus
corpus = TextCorpus(corpus_path,
                    dictionary,
                    is_forward_lm,
                    character_level=True,
                    random_case_flip=False)

Next, specify the model folder. After training, this folder will contain the best model, the final model checkpoint, checkpoints for every epoch, training.log and a loss.txt (details about the training process).

Training a decent language model will require a powerful decent GPU and probably a lot of time. However, you can quit training whenever you want a continue training from a checkpoint (either `checkpoint.pt` or `epoch_X.pt`). The training log will show you the progress of the training process, incl. current loss, perplexity and learning rate. After each epoch the log will display a sequence of text generated by the current model. The similarity of this text to real language can be seen as an indication for how well the model has learned. The model is trained with an annealed learning rate. Setting `patience=10` means the scheduler will decrease the learning rate after 10 splits without improvement. For a full list of training parameters check [here](https://github.com/zalandoresearch/flair/blob/master/flair/trainers/language_model_trainer.py#L244).

In [0]:
from flair.trainers.language_model_trainer import LanguageModelTrainer

# model folder
model_folder = "example-fwd"
model_path = f"{lm_path}models/FLAIR/{model_folder}/"

# option to continue training from checkpoint
continue_training = False

if not continue_training:
    # instantiate your language model, set hidden size and number of layers
    language_model = LanguageModel(dictionary,
                                   is_forward_lm,
                                   hidden_size=1024,
                                   nlayers=1)
  
    trainer = LanguageModelTrainer(language_model, corpus)
  
else:
    checkpoint = f"{model_path}checkpoint.pt"
    trainer = LanguageModelTrainer.load_from_checkpoint(checkpoint, corpus)


trainer.log_interval = 500
trainer.train(model_path,
              sequence_length=250,
              mini_batch_size=32,
              max_epochs=10,
              learning_rate=20.0,
              patience=10,
              checkpoint=True,
              num_workers=2)

## Fine-Tuning an Existing Language Model

Fine-tuning an existing model can be easier than training from scratch. For example, if you have a general LM for German and you would like to fine-tune for a specific domain. You can access the language model like this: `FlairEmbeddings('de-forward').lm`

Keep in mind that you need to match the direction of the model and dictionary. 

In [0]:
from flair.data import Dictionary
from flair.embeddings import FlairEmbeddings
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus


# instantiate an existing LM, such as one from the FlairEmbeddings
language_model = FlairEmbeddings('de-forward').lm

# are you fine-tuning a forward or backward LM?
is_forward_lm = language_model.is_forward_lm

# get the dictionary from the existing language model
dictionary: Dictionary = language_model.dictionary

# corpus folder with train splits, test and valid
corpus_folder "example-corpus"
corpus_path = f"{lm_path}corpus/{corpus_folder}/

# initialize Corpus
corpus = TextCorpus(corpus_path,
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# use the model trainer to fine-tune this model on your corpus
trainer = LanguageModelTrainer(language_model, corpus)

# model folder
model_folder = "de-forward-finetuned"
model_path = f"{lm_path}models/FLAIR/{model_folder}/"

trainer.train(model_path,
              sequence_length=100,
              mini_batch_size=100,
              learning_rate=20,
              patience=10,
              checkpoint=True)