# Open-domain question answering with DeepPavlov


The architecture of the DeepPavlov ODQA skill is modular and consists of two components: a **ranker** and a **reader**. In order to answer any question, the **ranker** first retrieves a few relevant articles from the article collection, and then the **reader** scans them carefully to identify the answer. The **ranker** is based on DrQA [1] proposed by Facebook Research. Specifically, the DrQA approach uses unigram-bigram hashing and TF-IDF matching designed to efficiently return a subset of relevant articles based on a question. The **reader** is based on R-NET [2] proposed by Microsoft Research Asia and its implementation by Wenxuan Zhou. The R-NET architecture is an end-to-end neural network model that aims to answer questions based on a given article. R-NET first matches the question and the article via gated attention-based recurrent networks to obtain a question-aware article representation. Then the self-matching attention mechanism refines the representation by matching the article against itself, which effectively encodes information from the whole article. Finally, the pointer networks locate the positions of answers in the article. The scheme below shows DeepPavlov ODQA system architecture.

DeepPavlov’s ODQA system has two Wikipedia-based models. The first one is based on the English Wikipedia dump from 2018-02-11 (5,180,368 articles) and the second one is based on the Russian Wikipedia dump from 2018-04-01 (1,463,888 articles).

[1] [Chen, Danqi, et al. "Reading wikipedia to answer open-domain questions." arXiv preprint arXiv:1704.00051 (2017)](https://arxiv.org/pdf/1704.00051.pdf)

[2] [R-NET: Machine reading comprehension with self-matching networks](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)

<img src="https://github.com/deepmipt/dp_notebooks/blob/master/odqa.png?raw=1">

<center>Picture 1. The DeepPavlov-based ODQA system architecture</center>

# Model Requirements

The DeepPavlov ODQA system has two Wikipedia-based models. The English Wikipedia model requires 35 GB of local storage, whereas the Russian version takes up about 20 GB. The Wikipedia dumps can be rebuilt by steps described in the [documentation](http://docs.deeppavlov.ai/en/0.1.6/components/tfidf_ranking.html#available-data-and-pretrained-models). Both models require about 24 GB of RAM. It is possible to run them on a 16 GB machine, but the swap size should be at least 8 GB.
 
But first, install DeepPavlov and all the model's requirements.

In [0]:
!pip install -q deeppavlov
!python -m deeppavlov install ru_odqa_infer_wiki

[K     |████████████████████████████████| 778kB 2.8MB/s 
[K     |████████████████████████████████| 20.1MB 1.3MB/s 
[K     |████████████████████████████████| 51kB 7.4MB/s 
[K     |████████████████████████████████| 665kB 50.8MB/s 
[K     |████████████████████████████████| 7.3MB 19.6MB/s 
[K     |████████████████████████████████| 61kB 8.1MB/s 
[K     |████████████████████████████████| 51kB 7.7MB/s 
[K     |████████████████████████████████| 51kB 6.2MB/s 
[K     |████████████████████████████████| 71kB 9.7MB/s 
[K     |████████████████████████████████| 2.1MB 49.1MB/s 
[K     |████████████████████████████████| 10.4MB 47.6MB/s 
[K     |████████████████████████████████| 51kB 4.9MB/s 
[K     |████████████████████████████████| 61kB 8.2MB/s 
[K     |████████████████████████████████| 61kB 8.2MB/s 
[K     |████████████████████████████████| 8.0MB 30.5MB/s 
[K     |████████████████████████████████| 1.5MB 46.9MB/s 
[K     |████████████████████████████████| 6.7MB 40.4MB/s 
[K     |████

# Model Description

The architecture of the ODQA skill is modular and consists of two components, a **ranker** and a **reader**. In order to answer any question, the **reader** first retrieves **top_n** relevant articles from the document collection, and then the **reader** scans them carefully to identify the answer. The detailed description of the ODQA models can be found in the [DeepPavlov documentation](http://docs.deeppavlov.ai/en/0.1.6/skills/odqa.html).

In [0]:
%load https://github.com/deepmipt/DeepPavlov/blob/0.1.6/deeppavlov/configs/odqa/ru_odqa_infer_wiki.json

# Interacting with the model

**As it was mentioned, the Wikipedia-based models have significant storage and RAM requirements, therefore it's impossible to interact with them on Colab, however you can do so localy (of course when the requirements are satisfied). Alternatively, you can check out our [demo](http://demo.ipavlov.ai/).**

Make sure that you can navigate the configuration files by using Autocomplete (Tab key) with **configs** module.

from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = True)
answers = odqa([
                "Where did guinea pigs originate?", 
                "When did the Lynmouth floods happen?",
                "When is the Bastille Day?"
                ])

# Training the model

You can train a model by running the framework with **train** parameter, wherein the model will be trained on the document collection defined in the **dataset_reader** section of the configuration file. The **dataset_reader** section of the ranker’s configuration defines the source of the articles. The source can be of the following **dataset_format-**:

wiki — the Wikipedia dump,
txt — the path to the separated text files,
json — JSON files, which should be formatted as a list with dicts that contain the *title* and *doc* keywords.


* *wiki* - The Wikipedia dump
* *txt* - each document in separate txt file
* *json* - JSON files should be formatted as list with dicts which contain 'title' and 'doc' keywords.

As a training corpus, I will use the PloS sentence corpus. It consists of 300 computational biology articles, each of them stored in a separate *txt* file. For simplicity, we will use the same configuration files that is used for the Wikipedia-based ODQA system; however, we strongly encourage you to create custom configuration files for your own models.

In [0]:
# !wget -q http://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip
# !unzip SentenceCorpus.zip

In order to fit a model on new data, first, change the **data_path** parameter of the **dataset_reader** section. Then change the **dataset_format** to *txt*. Finally, train the model.

In [0]:
# from deeppavlov import configs
# from deeppavlov.core.common.file import read_json
# from deeppavlov import configs, train_model

# model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
# model_config["dataset_reader"]["data_path"] = "/content/SentenceCorpus/unlabeled_articles/plos_unlabeled"
# model_config["dataset_reader"]["dataset_format"] = "txt"
# doc_retrieval = train_model(model_config)

Examine the ranker output.

In [0]:
# doc_retrieval(['cerebellum'])

Everything is done to run the ODQA component, make sure that the **download = False** otherwise the pretrained Wikipedia dump will overwrite your model.

In [0]:
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

# # Download all the SQuAD models
# squad = build_model(configs.squad.multi_squad_noans_infer, download = True)
# Do not download the ODQA models, we've just trained it
odqa = build_model(configs.odqa.ru_odqa_infer_wiki, download = True) # = False 
answers = odqa(["Что такое любовь?", "Как жить?"])

2020-04-26 09:05:56.508 INFO in 'deeppavlov.core.data.utils'['utils'] at line 80: Downloading from http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_word_tokenize/ft_native_300_ru_wiki_lenta_nltk_word_tokenize.vec to /root/.deeppavlov/downloads/embeddings/ft_native_300_ru_wiki_lenta_nltk_word_tokenize.vec
100%|██████████| 4.53G/4.53G [15:58<00:00, 4.73MB/s]
2020-04-26 09:21:55.589 INFO in 'deeppavlov.core.data.utils'['utils'] at line 80: Downloading from http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_word_tokenize-char.vec to /root/.deeppavlov/downloads/embeddings/ft_native_300_ru_wiki_lenta_nltk_word_tokenize-char.vec
100%|██████████| 4.52M/4.52M [00:01<00:00, 3.06MB/s]
2020-04-26 09:21:57.765 INFO in 'deeppavlov.core.data.utils'['utils'] at line 80: Downloading from http://files.deeppavlov.ai/deeppavlov_data/ru_odqa.tar.gz to /root/.deeppavlov/ru_odqa.tar.gz
100%|██████████| 1.19G/1.19G [03:16<00:00, 6.06MB/s]
2020-04-26 09:25:14.480 INFO 











2020-04-26 09:39:54.232 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 615: 


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
This class is equivalent as tf.keras.layers.GRUCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead 

2020-04-26 09:39:56.38 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 615: 
2020-04-26 09:39:56.169 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 615: 
2020-04-26 09:39:56.261 INFO in 'deeppavlov.core.layers.tf_layers'['tf_layers'] at line 615: 


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.




Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Instructions for updating:
Use standard file APIs to check for files with this prefix.


2020-04-26 09:40:09.210 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 51: [loading model from /root/.deeppavlov/models/squad_model_ru/model]



INFO:tensorflow:Restoring parameters from /root/.deeppavlov/models/squad_model_ru/model




In [0]:
answers

['спокойствие, новый художественный идеал', 'в финал Кубка Англии']

In [0]:
# answers = odqa(["What is Tuberculosis?"])



In [0]:
answers 

['спокойствие, новый художественный идеал', 'в финал Кубка Англии']

In [0]:
# mount a folder

from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive
/gdrive


In [0]:
%cd 'My Drive'
%cd Diploma
%cd data

/gdrive/My Drive
/gdrive/My Drive/Maga
/gdrive/My Drive/Maga/Diploma
/gdrive/My Drive/Maga/Diploma/data


In [0]:
%cd Answers/

/gdrive/My Drive/Maga/Diploma/data/Answers


In [0]:
import os

# os.chdir("Processed_ege_report")

direc = os.getcwd() # Get current working directory
ext = '.json' # Select your file delimiter

# Select only files with the ext extension
txt_files = [i for i in os.listdir(direc) if os.path.splitext(i)[1] == ext]

In [0]:
# collect all small files into one list
import json


big_list = [] # Create an empty list, 

# Iterate over your json files
for f in txt_files:
    # Open them and assign k and v in them to big_dict
    with open(os.path.join(direc,f), 'r') as file_object:
        for i, item in enumerate(json.load(file_object)):
            big_list.append(item)

In [0]:
big_list[1]

In [0]:
# import json

# dl = json.load(open("open_9368.json", "r", encoding="utf-8"))

In [0]:
# for i, di in enumerate(dl):
#   if di["open_q"] == "В каком году родился Дмитрий Фёдорович Устинов?":
#     print(i)

6241


In [0]:
dl[0]

{'Q': 'Q142917',
 'answer_open': 'Евгений Викторович Вучетич',
 'atr': 'Q557355',
 'ent': 'Родина-мать (Киев)',
 'open_q': 'Кто создал скульптуру «Родина-мать»?',
 'rel': 'P84',
 'rel_r': 'архитектор (P84)',
 'type': 'author',
 'what': 'колоссальная статуя'}

In [0]:
questions = [d['open_q'] for d in dl]

In [0]:
questions[10:13]

['Кто был режиссёром фильма «Ирония судьбы, или С лёгким паром!»?',
 'Кто снял фильм «Ирония судьбы, или С лёгким паром!»?',
 'В каком году было Цусимское сражение?']

In [0]:
answers = odqa(questions[10:11])



In [0]:
answers

['Раиса Александровна Лукина']

In [0]:
def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]


In [0]:
from tqdm import tnrange, tqdm_notebook # might work
from time import sleep
import os
from datetime import datetime

count = 0


dirName = 'Answers' # add below
if not os.path.exists(dirName):
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ")
else:    
    print("Directory " , dirName ,  " already exists")
    

for j in tqdm_notebook(batch(dl[6248:],10), desc='a loop'):
  pack = []
  for i, d in enumerate(j):
    print(d)
    d['answer_dp'] = odqa([d['open_q']])
    pack.append(d)

  name = datetime.utcnow().strftime('%Y-%m-%d %H_%M_%S.%f')[:-3]
  filename = "Answers/%s.json"%name
  with open(filename, "w", encoding="utf-8") as f:
    json.dump(pack, f, ensure_ascii=False, indent=4)


  

In [0]:
answers

['Раиса Александровна Лукина', 'Раиса Александровна', 'около 14:30']

In [0]:
import json

with open("answers_open_dp.json", "w", encoding="utf-8") as f:
        json.dump(answers, f, ensure_ascii=False, indent=4)

In [0]:
for i, d in enumerate(dl):
  d['answer_dp'] = answers[i]

IndexError: ignored

In [0]:
import json

with open("open_with_dp_answers.json", "w", encoding="utf-8") as f:
        json.dump(dl, f, ensure_ascii=False, indent=4)

# Useful links

[DeepPavlov repository](https://github.com/deepmipt/DeepPavlov)

[DeepPavlov demo page](https://demo.ipavlov.ai)

[DeepPavlov documentation](https://docs.deeppavlov.ai)