# DeepPavlov Library

Deeppavlov is a publicly released open-source framework for developing conversational Natural Language Processing models.  DeepPavlov is created for modular and configuration-driven development of NLP models and it is based on PyTorch and supports HuggingFace `transformers`.

Useful links:

```
# Выбран кодовый формат
```


- `deeppavlov` library documentation: http://docs.deeppavlov.ai/en/master/
- `deeppavlov` demo: https://demo.deeppavlov.ai/

Install the library

In [None]:
!pip install deeppavlov -q

## Configuration files

The DeepPavlov models are defined in the corresponding configuration files (see the full list in docs or on Github).

In this notebook we will be working with the `topics_distilbert_base_uncased` model, its configuration file can be found by the [link](https://github.com/deeppavlov/DeepPavlov/blob/master/deeppavlov/configs/classifiers/topics_distilbert_base_uncased.json) . 
 
This model is a distilBERT-based classifier trained on a dataset of conversational topics. Let's inspect how a typical configuration file looks like.   

Each configuration file consists of five main sections: `dataset_reader`,  `dataset_iterator`, `chainer`, `train`, and `metadata`.



- `dataset_reader` and `dataset_iterator` are responsible for **accessing the data** and **splitting it** into training, validation, and test sets. `dataset_reader` supports the datasets from HuggingFace. Here we define the path to the dataset folder and the names of the files for each of the data split and use a basic iterator to iterate over the examples while training and infering.

```
"dataset_reader": {
    "class_name": "basic_classification_reader",
    "class_sep": ";",
    "x": "text",
    "y": "topic",
    "data_path": "{DOWNLOADS_PATH}/dp_topics_downsampled_data/",
    "train" : "train.csv",
    "valid" : "valid.csv"  
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42
```

- `chainer` is a **core concept** of DeepPavlov: it **builds a pipeline** from heterogeneous components (Rule-Based/ML/DL) and allows for training or infering of the entire pipeline as a unified unit. `chainer` also specifies component inputs (`in`, `in_y`) and outputs (`out`) as arrays of names.
A pipeline element can be either a function or an object of a class that **implements `__call__` method**. Any configuration file can be used within another configuration file as an element of the `chainer`, and any field of the nested configuration file can be overwritten.

In this model´s pipeline we first preprocess the data with the Transformer and get BERT embeddings for our examples

```
"chainer": {
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      {
        "class_name": "torch_transformers_preprocessor",
        "vocab_file": "{TRANSFORMER}",
        "do_lower_case": true,
        "max_seq_length": 128,
        "in": ["x"],
        "out": ["bert_features"]
      },
```
We also create the vocabulary of our target class labels and encode them into one-hot vectors 
```
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": ["y"],
        "save_path": "{MODEL_PATH}/classes.dict",
        "load_path": "{MODEL_PATH}/classes.dict",
        "in": ["y"],
        "out": ["y_ids"]
      },
      {
        "in": ["y_ids"],
        "out": ["y_onehot"],
        "class_name": "one_hotter",
        "id": "my_one_hotter",
        "depth": "#classes_vocab.len",
        "single_vector": true
      },
  ```
  We pass our vectors to the classifier.
  ```
      {
        "class_name": "torch_transformers_classifier",
        "one_hot_labels": true,
        "n_classes": "#classes_vocab.len",
        "return_probas": true,
        "pretrained_bert": "{TRANSFORMER}",
        "save_path": "{MODEL_PATH}/model",
        "load_path": "{MODEL_PATH}/model",
        "multilabel": true,
        "optimizer": "AdamW",
        "optimizer_parameters": {"lr": 1e-05},
        "learning_rate_drop_patience": 5,
        "learning_rate_drop_div": 2.0,
        "in": ["bert_features"],
        "in_y": ["y_onehot"],
        "out": ["y_pred_probas"]
      },

```
And we decode the predicted probabilities to labels.
```
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "class_name": "proba2labels",
        "max_proba": false,
        "confidence_threshold": 0.5
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_labels",
        "ref": "classes_vocab"
      },
      {
        "ref": "my_one_hotter",
        "in": "y_pred_ids",
        "out": "y_pred_onehot"
      }
    ],
    "out": ["y_pred_labels"]
  },
```

- the `train` section defines **training parameters**, such as trainer class, evaluation metrics, batch size, etc

We define all these parameters in this config as well, and also define `validation_patience` which means that training is stopped if the validation metrics are not improved for 10 times in a row (turn to -1 if yu want to train for all the defined epochs), `val_every_n_epochs` defines how often we do the validation. 

```
  "train": {
    "epochs": 100,
    "batch_size": 64,
    "metrics": [
      {
        "name": "f1_macro",
        "inputs": [
          "y_onehot",
          "y_pred_onehot"
        ]
      },
      {
        "name": "f1_weighted",
        "inputs": [
          "y_onehot",
          "y_pred_onehot"
        ]
      },
      {
        "name": "accuracy",
        "inputs": [
          "y",
          "y_pred_labels"
        ]
      },
      {
        "name": "roc_auc",
        "inputs": [
          "y_onehot",
          "y_pred_probas"
        ]
      }
    ],
    "validation_patience": 10,
    "val_every_n_epochs": 1,
    "log_every_n_epochs": 1,
    "log_every_n_batches": 100,
    "show_examples": false,
    "evaluation_targets": [
      "train",
      "valid",
      "test"
    ],
    "tensorboard_log_dir": "{MODEL_PATH}/logs",
    "class_name": "torch_trainer"
  },
```

- the `metadata` section contains **variables** used in other sections of the configuration file, as well as a list of files required by the `chainer` components. Here we define the backbone transfomer (using the name from HuggingFace), the path to store the model after training or downloading it, and we provide the links to the dataset and the pretrained model (if exists).
```
"metadata": {
    "variables": {
      "TRANSFORMER": "distilbert-base-uncased",
      "ROOT_PATH": "~/.deeppavlov",
      "DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
      "MODELS_PATH": "{ROOT_PATH}/models",
      "MODEL_PATH": "{MODELS_PATH}/classifiers/topic_distilbert_base_v0"
    },
    "download": [
      {
        "url": "http://files.deeppavlov.ai/datasets/dp_topics_downsampled_dataset_v0.tar.gz",
        "subdir": "{DOWNLOADS_PATH}"
      },
      {
        "url": "http://files.deeppavlov.ai/deeppavlov_data/classifiers/topic_distilbert_base_v0.tar.gz",
        "subdir": "{MODELS_PATH}/classifiers"
      }
    ]
  }
```

## Usage

There are two ways to work with deeppavlov's models: through command line interface, and through Python.

## Use Deeppavlov from CLI

Learn about the available command and their parameters by running

In [None]:
!python -m deeppavlov -h

To get interactive predictions from the pretrained model, run:

In [None]:
!python -m deeppavlov interact sentiment_sst_conv_bert -d -i

The `-d` flag is used to download the pre-trained model along with embeddings and all other files needed to run the model that are defined in the `download` variable of `metadata`.

The `-i` flag installs all the packages required for the correct use of the specific model.

Run the following command to evaluate your model:

In [None]:
!python -m deeppavlov evaluate sentiment_sst_conv_bert -d -i

## Use Deeppavlov from Python

Alternatively, the same command can be done through Python. We will be using the same model as in the previous section.

Bulid the model using the same config name. `download` and `install` arguments correspond to the `-d` and `-i` flags of the command line.

In [None]:
from deeppavlov import build_model

topic_classifier = build_model('sentiment_sst_conv_bert', download=True, install=True)

Get the pretrained model prediction.

In [None]:
topic_classifier(['I like Italian cuisine?', 'This movie was actually neither that funny, nor super witty.'])

## Python pipelines 

Python pipelines have recently been added to the Deeppavlov library as well. Here is how you can build the same model we used previously using the Python classes. Currently this interface only works for inference.

In [None]:
from deeppavlov import Element, Model
from deeppavlov.core.commands.utils import expand_path
from deeppavlov.core.data.simple_vocab import SimpleVocabulary
from deeppavlov.download import download_resource
from deeppavlov.models.classifiers.proba2labels import Proba2Labels
from deeppavlov.utils.pip_wrapper.pip_wrapper import install_from_config

In [None]:
classifiers_path = expand_path('~/.deeppavlov/models/classifiers')
model_path = classifiers_path / 'sentiment_sst_bert_torch'
transformer_name = 'DeepPavlov/bert-base-cased-conversational'
vocab_path = model_path / 'classes.dict'

In [None]:
install_from_config('sentiment_sst_conv_bert')

download_resource(
    'http://files.deeppavlov.ai/v1/classifiers/sentiment_sst_bert/sentiment_sst_bert_torch.tar.gz',
    {classifiers_path}
)

In [None]:
from deeppavlov.models.preprocessors.torch_transformers_preprocessor import TorchTransformersPreprocessor
from deeppavlov.models.torch_bert.torch_transformers_classifier import TorchTransformersClassifierModel

In [None]:
preprocessor = TorchTransformersPreprocessor(vocab_file=transformer_name, max_seq_length=64)

classes_vocab = SimpleVocabulary(load_path=vocab_path, save_path=vocab_path)

classifier = TorchTransformersClassifierModel(
    n_classes=classes_vocab.len,
    return_probas=True,
    pretrained_bert=transformer_name,
    save_path=model_path / 'model',
    optimizer_parameters={'lr': 1e-05}
)

proba2labels = Proba2Labels(max_proba=True)

model = Model(
    x=['x'],
    out=['y_pred_labels'],
    pipe=[
        Element(component=preprocessor, x=['x'], out=['bert_features']),
        Element(component=classifier, x=['bert_features'], out=['y_pred_probas']),
        Element(component=proba2labels, x=['y_pred_probas'], out=['y_pred_ids']),
        Element(component=classes_vocab, x=['y_pred_ids'], out=['y_pred_labels'])
    ]
)

In [None]:
model(['I like watching Arrival with Amy Adams'])

## Train your custom model

To change the config parameters and train your own model, parse the configuration file and change it the way you need.   

In [None]:
!wget https://raw.githubusercontent.com/deeppavlov/DeepPavlov/master/deeppavlov/configs/classifiers/sentiment_sst_conv_bert.json

Let's change the transformer to `bert-base-uncased` and reduce the number of training epochs to 2.


NB: if you have already used the pretrained model in this session, but now you want to train the model from scratch, check that the folder with the model is empty, or change the `MODEL_PATH` config variable to save it in another directory.  

In [None]:
from deeppavlov.core.common.file import read_json

config_json = read_json('sentiment_sst_conv_bert.json')

# original backbone transformer 
print(config_json['metadata']['variables']['TRANSFORMER'])

DeepPavlov/bert-base-cased-conversational


In [None]:
config_json['metadata']['variables']['TRANSFORMER'] ='bert-base-uncased'
config_json['train']['epochs'] = 2
config_json['metadata']['variables']['MODEL_PATH'] = 'my_custom_models/sentiment_classifier'

Parse the config and have a look at it after parsing

In [None]:
from deeppavlov.core.commands.utils import parse_config

model_config = parse_config(config_json)

In [None]:
model_config

Train you model using the parsed config

In [None]:
from deeppavlov import train_model

new_sentiment_classifier = train_model(model_config)

Check your model's predictions.

In [None]:
new_sentiment_classifier(['I like Italian cuisine?', 'I like listening rock music'])

Evaluate your model.

In [None]:
from deeppavlov import evaluate_model

evaluate_model(model_config, download=False)

To train your own version of the model from CLI, make all the necessary changes in the configuration file and run: 

In [None]:
!python -m deeppavlov train path_to_config.json -i

# Assignment
Build and train your own BERT-based classifier for the Recognizing Textual Entailment (RTE) task of the GLUE benchmark using [this config](https://github.com/deeppavlov/DeepPavlov/blob/master/deeppavlov/configs/classifiers/glue/glue_rte_cased_bert_torch.json). Read more about the GLUE benchmark [on their website](https://russiansuperglue.com/).

Please follow the instructions and do not delete the cell'**s** output.

## Part 1. Train and measure performance.

In [None]:
# Read and parse the config

rte_classifier_config = ...

In [None]:
# Train your model, the config doesn't contain the pretrained model

from deeppavlov import train_model

rte_classifier = train_model(rte_classifier_config)

In [None]:
# Interact with the model model

rte_classifier([sentence1], [sentence2])

In [None]:
# Describe what is the purpose of the model, describe the classes, provide few examples for different classes. Find misclassified samples.

In [None]:
# Evaluate your model

from deeppavlov import evaluate_model

evaluate_model(rte_classifier_config, download=False, install=False)

## Part 2. Improve RTE performance.

In [None]:
# Propose as many as possible ways to improve the model's performance. Implement the most promising one, change the config file accordingly, retrain the model and evaluate it.

# do not forget to change the path
rte_classifier_config['metadata']['variables']['MODEL_PATH'] = 'my_custom_models/rte_improved'
###

In [None]:
# Retrain the model

from deeppavlov import train_model

rte_classifier_improved = train_model(rte_classifier_config)

In [None]:
# Evaluate the improved model

from deeppavlov import evaluate_model

evaluate_model(rte_classifier_improved, download=False, install=False)

## Part 3. Explain improvements.

In [None]:
# Evaluate [config](https://github.com/deeppavlov/DeepPavlov/blob/master/deeppavlov/configs/classifiers/glue/glue_rte_roberta_mnli.json)


In [None]:
# Describe why performance of the new config is higher than the performance of the source config.


## Part 4. Leave feedback about **DeepPavlov** framework.

### 1. What did you like about the framework.

### 2. What you didn't like about the framework.

### 3. How would you improve **DeepPavlov**.