# Tutorial "How to solve NLP tasks with DeepPavlov"

This tutorial is aimed to make participants familiar with solving NLP tasks using `DeepPavlov`.
We are going to use **BERT-based models** in this tutorial.

The tutorial has the following structure:

* [BERT input representation](#BERT-input-representation)

* [DeepPavlov Installation](#DeepPavlov-Installation)

* [Configs](#Configs)

* [Command line interface](#Command-line-interface)

* [Python code interface](#Python-code-interface)

* [BERT for text classification](#BERT-for-text-classification)

* [BERT for tagging](#BERT-for-tagging)

* [BERT for Question Answering](#BERT-for-Question-Answering)

* [Zero-shot Transfer from English to 103 languages](#Zero-shot-Transfer-from-English-to-103-languages)

## BERT input representation
Text preprocessing for BERT relies on tokenizing text on subtokens (or WordPieces). Then BERT internally represents each subtoken as sum of three vectors:

* subtoken embedding
* segment embedding
* position embedding

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_input.png?raw=1" width="75%" />

## DeepPavlov Installation

The following command installs basic requirements of `DeepPavlov`. Please, pay attention that in case of using some particular model you will probably have to install some additional dependencies. Please, for versions coincidence install additional requirements also using `DeepPavlov`.

In [0]:
!pip install deeppavlov

## Configs

One of the main conceptions of `DeepPavlov` is that each model is being defined by the configuration file, so called `config`. `config` is just a `json` file containing dictionary with dataset reader, dataset iterator, model pipeline, training parameters and some metadata. 

So, in case you want to use pre-defined model, just find [here](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs) the config of your interest.

If you want to compose your own model pipeline or somehow change presented model, the best idea will be to take one of the presented [here](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs) configs and change data/pipeline elements/training parameters etc.

For example, now we want to use BERT-based classification model for insults classification in [Social Commentary](https://www.kaggle.com/c/detecting-insults-in-social-commentary).

Let's look into the config.


In [0]:
import json
from deeppavlov import configs

config_path = configs.classifiers.insults_kaggle_bert
print("Path to config: {}".format(config_path))

with open(config_path, "r") as f:
    config = json.load(f)
    
print(json.dumps(config, indent=2))

There are two possible ways to work with configs in `DeepPavlov`: from command line or `python` code.

## Command line interface


So, let's firtsly install additional dependencies for BERT-based classification model.

In [0]:
! python -m deeppavlov install insults_kaggle_bert


From command line one may call model for interact in the following way (flag `-d` for downloading model, if files were already downloaded and were not modified, they won't be downloaded again):

In [0]:
! python -m deeppavlov interact -d insults_kaggle_bert

Configs can be also called for training and evaluating (calculating scores) using modes `train` and `evaluate`. Flag `-d` is also optional.

```
! python -m deeppavlov train [-d] insults_kaggle_bert
```
and
```
! python -m deeppavlov evaluate [-d] insults_kaggle_bert
```

Take into account that `insults_kaggle_bert` in the examples above is not a special keyword but just stem part of the corresponding [config file name](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/classifiers/insults_kaggle_bert.json). So, any config file from the folder [`deeppavlov/configs/`](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/) can be given to command line only as a part of the name of file without extension. Anyway one still can save config file anywhere and specify a full path to config file.

## Python code interface

In `python` code interface one can specify to functions `build_model` and `train_evaluate_model_from_config` either **path to config file or config dictionary itself**.

`DeepPavlov` models can be also used in `python` interface in the following way:

In [0]:
from deeppavlov import build_model, configs

model = build_model(configs.classifiers.insults_kaggle_bert, 
                    download=False) # download=True if model is not downloaded yet

In [0]:
model(['Hey, you are stupid!', 
       'Hey, you are smart!'])

Configs can be also called for training and/or evaluating (calculating scores) using python commands `train_evaluate_model_from_config`. Parameter `download` is also optional.

```python
from deeppavlov import train_evaluate_model_from_config

train_evaluate_model_from_config(configs.classifiers.insults_kaggle_bert,  
                                 to_train=False,  # set to True to train the model
                                 to_validate=True,
                                 download=True  # download=True if model is not downloaded yet
                                )

```

## BERT for text classification
When we want to use BERT model for text classification task we can add only one dense layer on top of the output from the last BERT Transformer layer for special `[CLS]` token.

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_classification.png?raw=1" width="75%" />

Let's just recap and discuss how the config for BERT-based text classification looks like:

In [0]:
import json
from deeppavlov import configs

config_path = configs.classifiers.insults_kaggle_bert
print("Path to config: {}".format(config_path))

with open(config_path, "r") as f:
    config = json.load(f)
    
print(json.dumps(config, indent=2))

## BERT for tagging

BERT model can be used for tagging tasks such like Named Entity Recognition and Part of Speech tagging.
We train only one dense layer on top of the output from the last BERT Transformer layer for each token. You can optionally add CRF layer on top the dense layer like in most common architecture BiLSTM + CRF for tagging.

Named Entity Recognition:

For example, we want to extract persons' and organizations' names from the text. Then for the input text:

    Yan Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

Here is how input is preprocessed for tagging:

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_NER.png?raw=1" width="75%" />

In [0]:
from deeppavlov import build_model, configs

model = build_model(configs.ner.ner_ontonotes_bert, 
                    download=True) # download=True if model is not downloaded yet

In [0]:
model(['Moscow Institute of Physics and Technology is aimed to win Alexa Prize Challenge'])

Data for Named Enitity Recognition task is usually stored in CoNLL files.
Typical CoNLL file with NER data contains lines with pairs of tokens (word/punctuation symbol) and tags, separated by a whitespace. In many cases additional information such as POS tags included between  Different documents are separated by lines **started** with **-DOCSTART-** token. Different sentences are separated by an empty line. Example

    -DOCSTART- -X- -X- O

    EU NNP B-NP B-ORG
    rejects VBZ B-VP O
    German JJ B-NP B-MISC
    call NN I-NP O
    to TO B-VP O
    boycott VB I-VP O
    British JJ B-NP B-MISC
    lamb NN I-NP O
    . . O O

    Peter NNP B-NP B-PER
    Blackburn NNP I-NP I-PER
    
    
If you wants to train model on your own data you can convert it to this CoNLL format or implement your version of `dataset_reader`. 

Now let's look into the config for BERT-based NER.

In [0]:
import json
from deeppavlov import configs

config_path = configs.ner.ner_ontonotes_bert
print("Path to config: {}".format(config_path))

with open(config_path, "r") as f:
    config = json.load(f)
    
print(json.dumps(config, indent=2))

## BERT for Question Answering 

One can use BERT model for extractive Question Answering, e.g.,
context:
```markdown
In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals **within a cloud**. Short, intense periods of rain in scattered locations are called “showers”.
```
and question:
```
Where do water droplets collide with ice crystals to form precipitation?
```
Answer is always a span from context.

To solve this task with BERT model all we need is to train two dense layes to predict answer start and answer end positions:

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_QA.png?raw=1" width="50%" />

In [0]:
from deeppavlov import build_model, configs

model = build_model(configs.squad.squad_bert,
                    download=True)

Model returns an answer, position in characters and confidence.

In [0]:
model(['In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”.'], 
      ['Where do water droplets collide with ice crystals to form precipitation?'])

To train model on your data you should put it json files in SQuAD format: https://rajpurkar.github.io/SQuAD-explorer/

These json files contain paragraphs, questions and answers.

## Zero-shot Transfer from English to 103 languages

BERT model was originaly trained only for English language, but lately multilingual model trained on 103 was released. It gives ability to train models on language and use them for 103 other language. This technique is called zero-shot transfer as we don't use any training data for target language.

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_multilingual.png?raw=1" width="75%" />

We will cover two examples:
 * NER transfer from Ontonotes dataset (English -> 103)
 * QA transfer from SQuAD dataset (English -> 103)
 
 These models are also available at [demo.ipavlov.ai](https://demo.ipavlov.ai/#multiLang)

#### Zero-shot multilingual NER

Download and interact the model:

In [0]:
from deeppavlov import build_model, configs

model = build_model(configs.ner.ner_ontonotes_bert_mult, 
                    download=True)

In [0]:
model(['Curling World Championship will be held in Antananarivo'])

In [0]:
model(['Чемпионат мира по кёрлингу пройдёт в Антананариву']) # Чемпионат мира по кёрлингу == Curling World Championship

#### Zero-shot multilingual QA

Get configuration file, download and interact the model:

In [0]:
! wget https://raw.githubusercontent.com/deepmipt/DeepPavlov/squad_multilingual_configs/deeppavlov/configs/squad/squad_bert_multilingual_freezed_emb.json

In [0]:
from deeppavlov import build_model, configs

model = build_model('./squad_bert_multilingual_freezed_emb.json', download=True)

In [0]:
model(['Su área de distribución comprende casi toda Sudamérica al este de los Andes en las \
       cuencas del río Orinoco, del Amazonas y del Río de la Plata; cubriendo desde el este \
       de Venezuela y la Guyana hasta Uruguay y el norte y centro de Argentina. Pueden vivir \
       en diferentes tipos de hábitat, pero muestran preferencia por algunos en concreto. \
       Suelen encontrarse cerca de lagos, ríos, marismas o manglares.'], 
      ['What countries do capybara live in?'])

As you can see model can work even if context and question languages are different!

### Zero-shot transfer performance

Results for Zero-Shot NER from English to Russian:

| model                            | Overall (Span F-1)   | PER (Span F-1)    | LOC (Span F-1)   | ORG (Span F-1) |
|----------------------------------|-------|----------|----|----|
| RuBERT NER | 97.7 |98.3   | 99.7 | 94.9|
| Zero-shot Multilingual BERT NER   | 79.4 | 95.7   |82.6 | 55.7|

Results for Zero-Shot QA from English to Russian:

| model                            | F-1   |
|----------------------------------|-------|
| RuBERT QA | 84.6 |
| Zero-shot Multilingual BERT QA   | 77.36 |

Results for Zero-Shot QA from Russian to English:

| model                            | F-1   |
|----------------------------------|-------|
| BERT QA | 88.49 |
| Zero-shot Multilingual BERT QA   | 75.26 |