# FARM Building Blocks

Welcome to the FARM building blocks tutorial! There are many different ways to make use of this repository, but in this notebook, we will be going through the most import building blocks that will help you harvest the rewards of a successfully trained NLP model.

Happy FARMing!

## 1) Text Classification

GNAD (https://tblock.github.io/10kGNAD/) is a dataset of 10K German documents labelled with one of 9 classes. In this section, we are going to build a classifier for this task that is composed of Google's BERT language model and a feed forward neural network prediction head.

### Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Let's start by adjust the working directory so that it is the root of the repository
# This should be run just once.

import os
os.chdir('../')
print("Current working directory is {}".format(os.getcwd()))

Current working directory is /Users/deepset/deepset/FARM


In [3]:
# Here are the imports we need

import torch
from farm.modeling.tokenization import BertTokenizer
from farm.data_handler.processor import GNADProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import Bert
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.experiment import initialize_optimizer
from farm.train import Trainer

Apex not installed. If you use distributed training with local rank != -1 apex must be installed.


In [4]:
# We need to fetch the right device to drive the growth of our model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Data Handling

In [5]:
# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path="bert-base-german-cased",
    do_lower_case=False)

07/19/2019 17:20:51 - INFO - farm.modeling.tokenization_utils -   loading file https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt from cache at /Users/deepset/.cache/torch/farm/da299cdd121a3d71e1626f2908dda0d02658f42e925a3d6abd8273ec08cf41a6.2a48e6c60dcdb582effb718237ce5894652e3b4abb94f0a4d9a857b70333308d


In [6]:
# We can test out how it will do on an example sentence

EXAMPLE_SENTENCE = "Selbst ein blindes Huhn findet mal ein Korn."
tokenizer.tokenize(EXAMPLE_SENTENCE)

['Selbst',
 'ein',
 'bl',
 '##inde',
 '##s',
 'Hu',
 '##hn',
 'findet',
 'mal',
 'ein',
 'Korn',
 '.']

In [7]:
# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor

processor = GNADProcessor(tokenizer=tokenizer,
                          max_seq_len=128,
                          data_dir="data/gnad")

In [8]:
# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

07/19/2019 17:09:48 - INFO - farm.data_handler.data_silo -   Loading train set from: data/gnad/train.csv


KeyboardInterrupt: 

### Modeling

In FARM, we make a strong distinction between the language model and prediction head so that you can mix and match different building blocks for your needs.

For example, in the transfer learning paradigm, you might have the one language model that you will be using for both document classification and NER. Or you perhaps you have a pretrained language model which you would like to adapt to your domain, then use for a downstream task such as question answering. 

All this is possible within FARM and require only the replacement of a few modular components, as we shall see below.

Let's first have a look at how we might set up a model.

In [6]:
# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "bert-base-german-cased"

language_model = Bert.load(MODEL_NAME_OR_PATH)

07/19/2019 17:20:56 - INFO - botocore.credentials -   Found credentials in shared credentials file: ~/.aws/credentials
07/19/2019 17:20:56 - INFO - farm.modeling.language_model -   loading archive file s3://int-models-bert/bert-base-cased-de-2b-end/bert-base-cased-de-2b-end.tar.gz from cache at /Users/deepset/.cache/torch/farm/9c04dc7fe652b18e117a3bcfbbd50a46dd97dcce1b27f4689c47c52fbf0ebf77.5d8643be67c1cdea6a3daad77fcc21f22aaca376cc4bffa9152d4260c4285410
07/19/2019 17:20:56 - INFO - farm.modeling.language_model -   extracting archive file /Users/deepset/.cache/torch/farm/9c04dc7fe652b18e117a3bcfbbd50a46dd97dcce1b27f4689c47c52fbf0ebf77.5d8643be67c1cdea6a3daad77fcc21f22aaca376cc4bffa9152d4260c4285410 to temp dir /var/folders/9y/kg7mpp0947l5n87j11ff64200000gp/T/tmpxfi2jxxi
07/19/2019 17:21:03 - INFO - farm.modeling.language_model -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02

In [9]:
# TODO CAN WE GIVE USERS SOMETHING TO PLAY WITH? USING dataset_from_dicts?

# More concretely, a language model takes a sequence of tokens and returns vectors.
# You can try it out 

language_model(["hi"])

TypeError: forward() missing 2 required positional arguments: 'segment_ids' and 'padding_mask'

In [10]:
# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 9 (the number of classes in the GNAD dataset)

LAYER_DIMS = [768, 9]

prediction_head = TextClassificationHead(layer_dims=LAYER_DIMS)

In [11]:
# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)


?? HAVE DIAGRAM??

### Training

In [10]:
# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

LEARNING_RATE = 2e-5
WARMUP_PROPORTION = 0.1
N_EPOCHS = 1

optimizer, warmup_linear = initialize_optimizer(
    model=model,
    learning_rate=LEARNING_RATE,
    warmup_proportion=WARMUP_PROPORTION,
    n_examples=data_silo.n_samples("train"),
    batch_size=data_silo.batch_size,
    n_epochs=N_EPOCHS)

NameError: name 'model' is not defined

In [11]:
# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 0

trainer = Trainer(
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    warmup_linear=warmup_linear,
    device=device,
)

NameError: name 'optimizer' is not defined

In [14]:
model = trainer.train(model)

07/19/2019 17:05:22 - INFO - farm.train -   ***** Running training *****
Train epoch 1/1:   0%|          | 0/260 [00:00<?, ?it/s]


KeyboardInterrupt: 

# Switch to NER

In a transfer learning paradigm, there is a core computation that is shared amongst all tasks. FARM's modular structure means that you can easily swap out different building blocks to make the same language model work for many different tasks.

We can adapt the above text classification model to NER by simply switching out the processor and prediction head.

In [7]:
# Import the new building blocks

from farm.data_handler.processor import CONLLProcessor
from farm.modeling.prediction_head import TokenClassificationHead

In [8]:
# This processor will preprocess the data for the CoNLL03 NER task

ner_processor = CONLLProcessor(tokenizer=tokenizer,
                               max_seq_len=128,
                               data_dir="data/conll03")

In [9]:
# This prediction head is also a feed forward neural network but expects one
# vector per token in the input sequence and will generate a set of logits
# for each input

LAYER_DIMS = [768, 9]

ner_prediction_head = TokenClassificationHead(layer_dims=LAYER_DIMS)

In [None]:
# We can integrate these new pieces with the rest using this code
# It is pretty much the same structure as what we had above for text classification

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=ner_processor,
    batch_size=BATCH_SIZE)

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[ner_prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

optimizer, warmup_linear = initialize_optimizer(
    model=model,
    learning_rate=LEARNING_RATE,
    warmup_proportion=WARMUP_PROPORTION,
    n_examples=data_silo.n_samples("train"),
    batch_size=BATCH_SIZE,
    n_epochs=N_EPOCHS)

trainer = Trainer(
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    warmup_linear=warmup_linear,
    device=device,
)

07/19/2019 17:21:13 - INFO - farm.data_handler.data_silo -   Loading train set from: data/conll03/train.txt


> /Users/deepset/deepset/FARM/farm/data_handler/utils.py(59)read_ner_file()
-> if len(line) == 0 or line.startswith("-DOCSTART") or line[0] == "\n":
(Pdb) line
'-DOCSTART- -X- -X- -X- O\n'
(Pdb) c
> /Users/deepset/deepset/FARM/farm/data_handler/utils.py(58)read_ner_file()
-> pdb.set_trace()
(Pdb) line
'\n'
(Pdb) c
> /Users/deepset/deepset/FARM/farm/data_handler/utils.py(59)read_ner_file()
-> if len(line) == 0 or line.startswith("-DOCSTART") or line[0] == "\n":
(Pdb) line
'Schartau O O O B-PER\n'
(Pdb) c
> /Users/deepset/deepset/FARM/farm/data_handler/utils.py(58)read_ner_file()
-> pdb.set_trace()
(Pdb) line
'sagte O O O O\n'
(Pdb) data
[]
(Pdb) sentence
['Schartau O O O B-PER\n']
(Pdb) c
> /Users/deepset/deepset/FARM/farm/data_handler/utils.py(59)read_ner_file()
-> if len(line) == 0 or line.startswith("-DOCSTART") or line[0] == "\n":
(Pdb) c
> /Users/deepset/deepset/FARM/farm/data_handler/utils.py(58)read_ner_file()
-> pdb.set_trace()
(Pdb) line
'" O O O O\n'
(Pdb) c
> /Users/deepset/d

In [None]:
%debug

> [0;32m/Users/deepset/deepset/FARM/farm/data_handler/utils.py[0m(205)[0;36mexpand_labels[0;34m()[0m
[0;32m    203 [0;31m        [0;32mif[0m [0mim[0m[0;34m:[0m[0;34m[0m[0m
[0m[0;32m    204 [0;31m            [0;31m# i.e. if token is word initial[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 205 [0;31m            [0mlabels_token[0m[0;34m.[0m[0mappend[0m[0;34m([0m[0mlabels_word[0m[0;34m[[0m[0mword_index[0m[0;34m][0m[0;34m)[0m[0;34m[0m[0m
[0m[0;32m    206 [0;31m            [0mword_index[0m [0;34m+=[0m [0;36m1[0m[0;34m[0m[0m
[0m[0;32m    207 [0;31m        [0;32melse[0m[0;34m:[0m[0;34m[0m[0m
[0m
ipdb> dir()
['im', 'initial_mask', 'labels_token', 'labels_word', 'non_initial_token', 'word_index']
ipdb> initial_mask
[0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
model = trainer.train(model)

# Save and load