# Open-domain question answering with DeepPavlov


The architecture of the DeepPavlov ODQA skill is modular and consists of two components: a **ranker** and a **reader**. In order to answer any question, the **ranker** first retrieves a few relevant articles from the article collection, and then the **reader** scans them carefully to identify the answer. The **ranker** is based on DrQA [1] proposed by Facebook Research. Specifically, the DrQA approach uses unigram-bigram hashing and TF-IDF matching designed to efficiently return a subset of relevant articles based on a question. The **reader** is based on R-NET [2] proposed by Microsoft Research Asia and its implementation by Wenxuan Zhou. The R-NET architecture is an end-to-end neural network model that aims to answer questions based on a given article. R-NET first matches the question and the article via gated attention-based recurrent networks to obtain a question-aware article representation. Then the self-matching attention mechanism refines the representation by matching the article against itself, which effectively encodes information from the whole article. Finally, the pointer networks locate the positions of answers in the article. The scheme below shows DeepPavlov ODQA system architecture.

DeepPavlov’s ODQA system has two Wikipedia-based models. The first one is based on the English Wikipedia dump from 2018-02-11 (5,180,368 articles) and the second one is based on the Russian Wikipedia dump from 2018-04-01 (1,463,888 articles).

[1] [Chen, Danqi, et al. "Reading wikipedia to answer open-domain questions." arXiv preprint arXiv:1704.00051 (2017)](https://arxiv.org/pdf/1704.00051.pdf)

[2] [R-NET: Machine reading comprehension with self-matching networks](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)

<img src="odqa.png">

<center>Picture 1. The DeepPavlov-based ODQA system architecture</center>

# Model Requirements

The DeepPavlov ODQA system has two Wikipedia-based models. The English Wikipedia model requires 35 GB of local storage, whereas the Russian version takes up about 20 GB. The Wikipedia dumps can be rebuilt by steps described in the [documentation](http://docs.deeppavlov.ai/en/0.1.6/components/tfidf_ranking.html#available-data-and-pretrained-models). Both models require about 24 GB of RAM. It is possible to run them on a 16 GB machine, but the swap size should be at least 8 GB.
 
But first, install DeepPavlov and all the model's requirements.

In [None]:
!pip install -q deeppavlov
!python -m deeppavlov install en_odqa_infer_wiki

# Model Description

The architecture of the ODQA skill is modular and consists of two components, a **ranker** and a **reader**. In order to answer any question, the **reader** first retrieves **top_n** relevant articles from the document collection, and then the **reader** scans them carefully to identify the answer. The detailed description of the ODQA models can be found in the [DeepPavlov documentation](http://docs.deeppavlov.ai/en/0.1.6/skills/odqa.html).

# Interacting with the model

**As it was mentioned, the Wikipedia-based models have significant storage and RAM requirements, therefore it's impossible to interact with them on Colab, however you can do so localy (of course when the requirements are satisfied). Alternatively, you can check out our [demo](http://demo.ipavlov.ai/).**

Make sure that you can navigate the configuration files by using Autocomplete (Tab key) with **configs** module.

# Training the model

You can train a model by running the framework with **train** parameter, wherein the model will be trained on the document collection defined in the **dataset_reader** section of the configuration file. The **dataset_reader** section of the ranker’s configuration defines the source of the articles. The source can be of the following **dataset_format-**:

wiki — the Wikipedia dump,
txt — the path to the separated text files,
json — JSON files, which should be formatted as a list with dicts that contain the *title* and *doc* keywords.


* *wiki* - The Wikipedia dump
* *txt* - each document in separate txt file
* *json* - JSON files should be formatted as list with dicts which contain 'title' and 'doc' keywords.

As a training corpus, I will use the PloS sentence corpus. It consists of 300 computational biology articles, each of them stored in a separate *txt* file. For simplicity, we will use the same configuration files that is used for the Wikipedia-based ODQA system; however, we strongly encourage you to create custom configuration files for your own models.

In [3]:
!wget -q http://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip
!unzip SentenceCorpus.zip

dyld: Library not loaded: /usr/local/opt/openssl/lib/libssl.1.0.0.dylib
  Referenced from: /usr/local/bin/wget
  Reason: image not found
unzip:  cannot find or open SentenceCorpus.zip, SentenceCorpus.zip.zip or SentenceCorpus.zip.ZIP.


In [6]:
pwd

'/Users/cubreto/Downloads/PSB/qa_query'

In order to fit a model on new data, first, change the **data_path** parameter of the **dataset_reader** section. Then change the **dataset_format** to *txt*. Finally, train the model.

In [8]:
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model

model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config["dataset_reader"]["data_path"] = "/Users/cubreto/Downloads/PSB/qa_query/SentenceCorpus/unlabeled_articles/plos_unlabeled"
model_config["dataset_reader"]["dataset_format"] = "txt"
doc_retrieval = train_model(model_config)

2021-01-23 12:21:22.432 INFO in 'deeppavlov.dataset_readers.odqa_reader'['odqa_reader'] at line 57: Reading files...
2021-01-23 12:21:22.433 INFO in 'deeppavlov.dataset_readers.odqa_reader'['odqa_reader'] at line 134: Building the database...
  0%|          | 0/300 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
100%|██████████| 300/300 [00:00<00:00, 7888.33it/s]
2021-01-23 12:21:22.553 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 57: Connecting to database, path: /Users/cubreto/.deeppavlov/downloads/odqa/enwiki.db
2021-01-23 12:21:22.554 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 112: SQLite iterator: The size of the database is 300 documents
[nltk_data] Downloading package punkt to /Users/cubreto/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cubreto/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downl

Examine the ranker output.

In [9]:
doc_retrieval(['cerebellum'])

[['499.txt',
  '563.txt',
  '566.txt',
  '585.txt',
  '58.txt',
  '50.txt',
  '426.txt',
  '494.txt',
  '490.txt',
  '485.txt',
  '484.txt',
  '583.txt',
  '478.txt',
  '466.txt',
  '46.txt',
  '453.txt',
  '445.txt',
  '438.txt',
  '437.txt',
  '436.txt',
  '430.txt',
  '429.txt',
  '505.txt',
  '470.txt',
  '59.txt']]

Everything is done to run the ODQA component, make sure that the **download = False** otherwise the pretrained Wikipedia dump will overwrite your model.

In [10]:
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

# Download all the SQuAD models
squad = build_model(configs.squad.multi_squad_noans_infer, download = True)
# Do not download the ODQA models, we've just trained it
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)
answers = odqa(["what is tuberculosis?", "how should I take antibiotics?"])

2021-01-23 12:23:29.475 INFO in 'deeppavlov.core.data.utils'['utils'] at line 64: Downloading from http://files.deeppavlov.ai/deeppavlov_data/multi_squad_model_noans_1.1.tar.gz to /Users/cubreto/.deeppavlov/multi_squad_model_noans_1.1.tar.gz
100%|██████████| 265M/265M [00:43<00:00, 6.07MB/s] 
2021-01-23 12:24:13.73 INFO in 'deeppavlov.core.data.utils'['utils'] at line 216: Extracting /Users/cubreto/.deeppavlov/multi_squad_model_noans_1.1.tar.gz archive into /Users/cubreto/.deeppavlov/models
2021-01-23 12:24:17.373 INFO in 'deeppavlov.models.preprocessors.squad_preprocessor'['squad_preprocessor'] at line 310: SquadVocabEmbedder: loading saved tokens vocab from /Users/cubreto/.deeppavlov/models/multi_squad_model_noans/emb/vocab_embedder.pckl
2021-01-23 12:24:17.698 INFO in 'deeppavlov.models.preprocessors.squad_preprocessor'['squad_preprocessor'] at line 310: SquadVocabEmbedder: loading saved chars vocab from /Users/cubreto/.deeppavlov/models/multi_squad_model_noans/emb/char_vocab_embedd










Using TensorFlow backend.
2021-01-23 12:24:23.783 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 614: 


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
This class is equivalent as tf.keras.layers.GRUCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initiali

2021-01-23 12:24:26.245 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 614: 




2021-01-23 12:24:26.522 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 614: 




2021-01-23 12:24:26.671 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 614: 


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.






Instructions for updating:
Use keras.layers.dense instead.






Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Instructions for updating:
Use standard file APIs to check for files with this prefix.


2021-01-23 12:24:43.679 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 52: [loading model from /Users/cubreto/.deeppavlov/models/multi_squad_model_noans/model]



INFO:tensorflow:Restoring parameters from /Users/cubreto/.deeppavlov/models/multi_squad_model_noans/model


2021-01-23 12:24:45.171 INFO in 'deeppavlov.models.vectorizers.hashing_tfidf_vectorizer'['hashing_tfidf_vectorizer'] at line 264: Loading tfidf matrix from /Users/cubreto/.deeppavlov/models/odqa/enwiki_tfidf_matrix.npz
2021-01-23 12:24:46.52 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 57: Connecting to database, path: /Users/cubreto/.deeppavlov/downloads/odqa/enwiki.db
2021-01-23 12:24:46.54 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 112: SQLite iterator: The size of the database is 300 documents
2021-01-23 12:24:46.59 INFO in 'deeppavlov.models.preprocessors.squad_preprocessor'['squad_preprocessor'] at line 310: SquadVocabEmbedder: loading saved tokens vocab from /Users/cubreto/.deeppavlov/models/multi_squad_model_noans/emb/vocab_embedder.pckl
2021-01-23 12:24:46.889 INFO in 'deeppavlov.models.preprocessors.squad_preprocessor'['squad_preprocessor'] at line 310: SquadVocabEmbedder: loading saved chars vocab fr



2021-01-23 12:24:48.467 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 614: 




2021-01-23 12:24:48.651 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 614: 




2021-01-23 12:24:48.797 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 614: 
















2021-01-23 12:25:06.350 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 52: [loading model from /Users/cubreto/.deeppavlov/models/multi_squad_model_noans/model]


INFO:tensorflow:Restoring parameters from /Users/cubreto/.deeppavlov/models/multi_squad_model_noans/model




In [11]:
answers = odqa(["what is tuberculosis?", "how should I take antibiotics?"])

In [12]:
answers

['a disease for which a new drug is desperately needed', '']

# Useful links

[DeepPavlov repository](https://github.com/deepmipt/DeepPavlov)

[DeepPavlov demo page](https://demo.ipavlov.ai)

[DeepPavlov documentation](https://docs.deeppavlov.ai)