# Open-domain question answering with DeepPavlov


This notebook shows how to train and interact with the Open Domain Question Answering (ODQA) component of the DeepPavlov framework. The DeepPavlov ODQA system has a modular architecture where the ranker is based on the **DrQA** [1] approach proposed by Facebook Research and the reader is based on **R-NET** [2] proposed by Microsoft Research Asia and its implementation by Wenxuan Zhou. The DeepPavlov framework contains pretrained models to extract answers from Wikipedia for Russian and English.

[1]
[2]

# Model Requirements

Both Wikipedia-based models require about 24 GB of RAM. It is possible to run them on a 16 GB machine, but the swap size should be at least 8 GB. You can run both models on Google Colab enabling the Hardware accelerator in Edit->Notebook settings.

 
But first, install DeepPavlov and all the model's requirements.

In [None]:
!pip install -q deeppavlov
!python -m deeppavlov install en_odqa_infer_wiki

# Model Description

The architecture of the ODQA skill is modular and consists of two components, a **ranker** and a **reader**. In order to answer any question, the **reader** first retrieves **top_n** relevant articles from the document collection, and then the **reader** scans them carefully to identify the answer. The detailed description of the ODQA models can be found in the [DeepPavlov documentation](http://docs.deeppavlov.ai/en/master/skills/odqa.html).


In [None]:
%load https://raw.githubusercontent.com/deepmipt/DeepPavlov/master/deeppavlov/configs/odqa/en_odqa_infer_wiki.json

# Interacting with the model

The DeepPavlov ODQA system has two Wikipedia-based models. The English Wikipedia model (enwiki.db) requires ~9.5 GB of local storage, whereas the Russian version takes ~2.7 GB of local storage. The Wikipedia dumps can be rebuilt by steps described in the [documentation](http://docs.deeppavlov.ai/en/master/components/tfidf_ranking.html#available-data-and-pretrained-models).

Make sure that you can navigate the configuration files by using Autocomplete (Tab key) with **configs** module.

In [None]:
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = True)
a = odqa(["Who destroyed the Death Star?"])
a

# Training the model

You can train a model by running the framework with **train** parameter, wherein the model will be trained on the document collection defined in the **dataset_reader** section of the configuration file. DeepPavlov ODQA supports three types of document sources: 

* *wiki* - The Wikipedia dump
* *txt* - each document in separate txt file
* *json* - JSON files should be formatted as list with dicts which contain 'title' and 'doc' keywords.

Let's train the model on the PloS sentence corpus [3]. This corpus consists of 300 computatioal biolog articles where each article stored in a separate *txt* file.

[3]: A. Chambers. Statistical models for text classification: Applications and analysis, Ph.D., University of California, Irvine (2013). ProQuest Dissertations and Theses.

In [None]:
!wget -q http://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip
!unzip SentenceCorpus.zip

For simplicity, we will use the same configuration files that are used for the Wikipedia-based ODQA system; however, we strongly encourage you to create custom configuration files for your own models. In order to fit a model on new data, first, change the **data_path** parameter of the **dataset_reader** section. Then change the **dataset_format** to *txt*. In addition, you can alter the number of **top_n** documents to retrieve. Finally, train the model.

In [None]:
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model

model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config["dataset_reader"]["data_path"] = "/content/SentenceCorpus/unlabeled_articles/plos_unlabeled"
model_config["dataset_reader"]["dataset_format"] = "txt"
model_config["chainer"][-1]["top_n"] = 30
doc_retrieval = train_model(model_config)

Examine the ranker output.

In [None]:
doc_retrieval(['cerebellum'])

Everything is done to run the ODQA component, make sure that the **download=Flase** otherwise the pretrained Wikipedia dump will overwrite your model.

In [None]:
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)
a = odqa(["what is tuberculosis ?"])

# About Us

We are iPavlov, our story started in 2017 when we decided to build a conversational AI framework that on the one hand will contain all required NLP components to build chatbots and on the other hand will be easy to use. Our work resulted in releasing DeepPavlov library. Our lab at MIPT is honored with Facebook AI Academic Partnership and NVIDIA GPU Research Center status. We successfully combine research and extreme coding in our week-long DeepHack.me hackathons — DeepHack.Game, DeepHack.Q&A and DeepHack.RL. We serve a global AI community by organizing NIPS Conversational Challenge to evaluate state-of-the-art techniques in the field of dialog systems and collect open source dialog datasets.

# Useful links

[DeepPavlov repository](https://github.com/deepmipt/DeepPavlov)

[DeepPavlov demo page](https://demo.ipavlov.ai)

[DeepPavlov documentation](https://docs.deeppavlov.ai)