In [1]:
import os
import sys

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
docqa_path = 'C:\\Users\\Вова\\PycharmProjects\\DocQA' # change the path if necessary
sys.path.append(docqa_path)
os.chdir(docqa_path)

## Description

DocQA library provides simple tools for Retriever, Ranker, Translator, CatBoost and QG pipelines creation and usage. It is also possible to combine different pipelines in a general Pipeline to achieve your purposes in developing your QA-system.

### Pipeline outputs

Each pipeline except for QgPipeline returns standardized output which is a list of dicts with 3 keys:
* input - an input string (question is expected)
* output - a dict with these keys:
    * answers - a list of dicts with these keys:
        * answer (only in general Pipeline) - an answer string
        * index (except for general Pipeline) - an index of answer in the list of document parts
        * total_score (except for general Pipeline) - an average score of all pipelines' scores
        * weights_sum (except for general Pipeline)- a sum of all pipelines' weights
        * scores - a dict in which key values are scores of pipelines
* modified_input - a preprocessed input string

Retriever, Ranker, Translator, CatBoost pipelines and a general pipeline are based on the same architecture, what makes the usage of the pipelines simple and predictable. Below there is an example of pipelines output behavior:

### Translator pipeline

In [4]:
from docQA.pipelines import TranslatorPipeline

pipe = TranslatorPipeline(num_beams=15)

In [5]:
input_text = 'Что такое ПДн?'

pipe(input_text)

[{'input': 'Что такое ПДн?',
  'output': {'answers': []},
  'modified_input': 'What is PDN?'}]

In [6]:
input_text = ['Что такое ПДн?', 'Что такое трансграничная передача персональных данных?']

pipe(input_text)

[{'input': 'Что такое ПДн?',
  'output': {'answers': []},
  'modified_input': 'What is PDN?'},
 {'input': 'Что такое трансграничная передача персональных данных?',
  'output': {'answers': []},
  'modified_input': 'What is cross-border transfer of personal data?'}]

In [7]:
input_text = ['Что такое ПДн?', 'Что такое трансграничная передача персональных данных?']

pipe(input_text, standardized=False) # only translator pipeline supports 'standardized' flag

['What is PDN?', 'What is cross-border transfer of personal data?']

### Retriever pipeline

Retriever pipeline works identically to the Ranker pipeline

In [8]:
from docQA.pipelines import RetrieverPipeline
from docQA.nodes.storage import Storage

storage = Storage(storage_name='base_storage', docs_links=['docs/152.txt']) # creating a storage based on 152 federal law of Russia

In [9]:
pipe = RetrieverPipeline(storage.retriever_docs_translated)

In [10]:
input_text = ['What is PDN?', 'What is cross-border transfer of personal data?']

pipe(input_text, return_num=3)

[{'input': 'What is PDN?',
  'output': {'answers': [{'index': 339,
     'total_score': 0.27039051055908203,
     'weights_sum': 1.0,
     'scores': {'retriever_cos_sim': 0.27039051055908203}},
    {'index': 3,
     'total_score': 0.25810524821281433,
     'weights_sum': 1.0,
     'scores': {'retriever_cos_sim': 0.25810524821281433}},
    {'index': 25,
     'total_score': 0.2117874026298523,
     'weights_sum': 1.0,
     'scores': {'retriever_cos_sim': 0.2117874026298523}}]},
  'modified_input': 'What is PDN?'},
 {'input': 'What is cross-border transfer of personal data?',
  'output': {'answers': [{'index': 36,
     'total_score': 0.7742951512336731,
     'weights_sum': 1.0,
     'scores': {'retriever_cos_sim': 0.7742951512336731}},
    {'index': 141,
     'total_score': 0.7269560098648071,
     'weights_sum': 1.0,
     'scores': {'retriever_cos_sim': 0.7269560098648071}},
    {'index': 142,
     'total_score': 0.6852957010269165,
     'weights_sum': 1.0,
     'scores': {'retriever_cos_

### Ranker pipeline

Ranker pipeline works identically to the Retriever pipeline

In [11]:
from docQA.pipelines import RankerPipeline

In [12]:
pipe = RankerPipeline(storage.ranker_docs_translated)

In [13]:
input_text = ['What is PDN?', 'What is cross-border transfer of personal data?']

pipe(input_text, return_num=3)

[{'input': 'What is PDN?',
  'output': {'answers': [{'index': 36,
     'total_score': 0.2962704002857208,
     'weights_sum': 1.0,
     'scores': {'ranker_cos_sim': 0.2962704002857208}},
    {'index': 256,
     'total_score': 0.27291399240493774,
     'weights_sum': 1.0,
     'scores': {'ranker_cos_sim': 0.27291399240493774}},
    {'index': 172,
     'total_score': 0.24463218450546265,
     'weights_sum': 1.0,
     'scores': {'ranker_cos_sim': 0.24463218450546265}}]},
  'modified_input': 'What is PDN?'},
 {'input': 'What is cross-border transfer of personal data?',
  'output': {'answers': [{'index': 137,
     'total_score': 0.886732280254364,
     'weights_sum': 1.0,
     'scores': {'ranker_cos_sim': 0.886732280254364}},
    {'index': 36,
     'total_score': 0.8719310164451599,
     'weights_sum': 1.0,
     'scores': {'ranker_cos_sim': 0.8719310164451599}},
    {'index': 168,
     'total_score': 0.8492573499679565,
     'weights_sum': 1.0,
     'scores': {'ranker_cos_sim': 0.8492573499

### General pipeline

In [14]:
from docQA.pipelines import Pipeline

pipe = Pipeline(storage)

In [15]:
pipe.add_node(TranslatorPipeline, name='translator', is_technical=True, demo_only=True, num_beams=15)
pipe.add_node(RetrieverPipeline, name='retriever')
pipe.add_node(RankerPipeline, name='ranker')

In [16]:
input_text = 'Что такое персональные данные?'

pipe(input_text)

[{'input': 'Что такое персональные данные?',
  'output': {'answers': [{'answer': '1) персональные данные - любая информация, относящаяся к прямо или косвенно определенному или определяемому физическому лицу (субъекту персональных данных);',
     'total_score': 0.7820469439029694,
     'scores': {'retriever_cos_sim': 0.7021254301071167,
      'ranker_cos_sim': 0.861968457698822}},
    {'answer': '3) предполагаемые пользователи персональных данных;',
     'total_score': 0.7196908891201019,
     'scores': {'retriever_cos_sim': 0.6950922012329102,
      'ranker_cos_sim': 0.7442895770072937}},
    {'answer': '2) цель обработки персональных данных;',
     'total_score': 0.6678484380245209,
     'scores': {'retriever_cos_sim': 0.6220505237579346,
      'ranker_cos_sim': 0.7136463522911072}},
    {'answer': '2) правовые основания и цели обработки персональных данных;',
     'total_score': 0.6541507244110107,
     'scores': {'retriever_cos_sim': 0.6059879660606384,
      'ranker_cos_sim': 0.702

### CatBoost pipeline

The CatBoost pipeline won't work without being fitted before. To see the usage of CatBoost pipeline go to pipeline fitting tutorial.