Build a Scalable Question Answering System

- **Time to complete**: 20 minutes
- **Nodes Used**: `ElasticsearchDocumentStore`, `BM25Retriever`, `FARMReader`


Set the logging level to INFO:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the ElasticsearchDocumentStore

A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) which connects to a running Elasticsearch service. It's a fast and scalable text-focused storage option. This service runs independently from Haystack and persists even after the Haystack program has finished running. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

2. Install requirements:

In [None]:
! pip install 'farm-haystack[preprocessing,elasticsearch]'

2. Start the server:

In [2]:
! docker-compose -f ../search-engine/docker-compose.yml up -d

Recreating elasticsearch_clinical ... 
[1Beating elasticsearch_clinical ... [32mdone[0m

In [None]:
import time
time.sleep(30)

4. Initialize the ElasticsearchDocumentStore:


In [1]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(
    host=host,
    port=9202,
    username="",
    password=""
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
document_store.delete_all_documents()

                1. delete_all_documents() method is deprecated, please use delete_documents method
                For more details, please refer to the issue: https://github.com/deepset-ai/haystack/issues/1045
                


ElasticsearchDocumentStore is up and running and ready to store the Documents.

## Indexing Documents with a Pipeline

The next step is adding the files to the DocumentStore. The indexing pipeline turns your files into Document objects and writes them to the DocumentStore. Our indexing pipeline will have two nodes: `TextConverter`, which turns `.txt` files into Haystack `Document` objects, and `PreProcessor`, which cleans and splits the text within a `Document`.

Once we combine these nodes into a pipeline, the pipeline will ingest `.txt` file paths, preprocess them, and write them into the DocumentStore.


2. Initialize the pipeline, TextConverter, and PreProcessor:

In [31]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=400,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)


To learn more about the parameters of the `PreProcessor`, see [Usage](https://docs.haystack.deepset.ai/docs/preprocessor#usage). To understand why document splitting is important for your question answering system's performance, see [Document Length](https://docs.haystack.deepset.ai/docs/optimization#document-length).

2. Add the nodes into an indexing pipeline. You should provide the `name` or `name`s of preceding nodes as the `input` argument. Note that in an indexing pipeline, the input to the first node is `File`.

In [32]:
import os

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])


3. Run the indexing pipeline to write the text data into the DocumentStore:

In [33]:
doc_dir = '../data/processed/n2c2_2022_it/'
files_to_index = [f'{doc_dir}{f}' for f in os.listdir(doc_dir) if '.txt' in f]
indexing_pipeline.run_batch(file_paths=files_to_index)

Converting files: 100%|██████████| 350/350 [00:00<00:00, 1348.13it/s]
Preprocessing:   0%|          | 0/350 [00:00<?, ?docs/s]We found one or more sentences whose word count is higher than the split length.
Preprocessing: 100%|██████████| 350/350 [00:00<00:00, 732.77docs/s]


{'documents': [<Document: {'content': "Data di registrazione: 2085-05-11\n\nNOME: Giusto, Quiana\nMRN: 9814048\n\nQuiana è una donna di 73 anni con ipertensione, diabete e stenosi spinale che\nviene per un fisico. Ha una stenosi spinale come documentato dalla sua precedente risonanza magnetica.\nDi solito può camminare per circa 5 minuti e poi sviluppa dolore alle cosce e\nvitelli. Ha qualche dolore al ginocchio mentre cammina, ma non è coerente e non lo fa\nverificarsi dopo brevi distanze di 5 minuti. Il dolore alla coscia e al polpaccio si risolve con il riposo. Lei\nnon sempre ha mal di schiena con dolore alla coscia e al polpaccio. Ha anche occasionali\ndolore al piede laterale destro che sembra un dolore artritico doloroso al piede. Lei prende\nVioxx occasionalmente per quel dolore al piede. Non ha alcuna disestesia o\nintorpidimento ai piedi o alle mani. Non ha traumi recenti, febbri o brividi. Lei ha\nnessun cambiamento nel suo intestino o nella vescica.\n\nHa il diabete di tipo

Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline.

## Initializing the Retriever

Our query pipeline is going to use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only those that are relevant to the question. This tutorial uses the BM25Retriever. This is the recommended Retriever for a question answering system like the one we're creating. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

In [34]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

The BM25Retriever is initialized and ready for the pipeline.

## Initializing the Reader

Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This tutorials uses a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with. To find a model that's best for your use case, see [Models](https://docs.haystack.deepset.ai/docs/reader#models).

In [35]:
from haystack.nodes import FARMReader

models = [
    "../models/medBIT-r3-plus_75/",
    'deepset/xlm-roberta-large-squad2'
]

reader = FARMReader(model_name_or_path=models[0], use_gpu=False)

## Creating the Retriever-Reader Pipeline

You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever. 

Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`.

In [36]:
from haystack import Pipeline

querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])


That's it! Your pipeline's ready to answer your questions!

## Asking a Question

1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).


In [64]:
prediction = querying_pipeline.run(
    query="ipertensione",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 10}
    }
)

Inferencing Samples: 100%|██████████| 1/1 [00:05<00:00,  5.10s/ Batches]


In [76]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)
prediction = pipe.run(
        query='ipertensione',
        params={
            "Retriever": {"top_k": 10},
            "Reader": {"top_k": 10}
        }
    )

Inferencing Samples: 100%|██████████| 1/1 [00:05<00:00,  5.25s/ Batches]


Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?

2. Print out the answers the pipeline returns:

In [80]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': 'diabete', 'type': 'extractive', 'score': 0.6587664484977722, 'context': ' mostrato malattia multivasale, BiV ICD, DDD St. Jude, 05/13/2081, CHF, diabete, ipertensione, ex fumatore.\n\nModifiche alle allergie\n\nNKA: nessuna all', 'offsets_in_document': [{'start': 1276, 'end': 1283}], 'offsets_in_context': [{'start': 72, 'end': 79}], 'document_ids': ['4fd49c28ae1056cb6befaef1cbdf0ebd'], 'meta': {'_split_id': 0, '_split_overlap': [{'range': [2337, 2538], 'doc_id': 'eeba194d04f547efc618365f0479f16c'}]}}>,
             <Answer {'answer': 'Malattia carotidea', 'type': 'extractive', 'score': 0.5369483232498169, 'context': "erventi CV\n\nAltra anamnesi medica/chirurgica passata\nIpertensione\nMalattia carotidea Stenosi del 20-49% bilateralmente nell'ICA bilateralmente\nnessuna", 'offsets_in_document': [{'start': 2182, 'end': 2200}], 'offsets_in_context': [{'start': 66, 'end': 84}], 'document_ids': ['a2c985b926cd4460bfd99546c69792fe'], 'meta': {'_split_id': 0, 

In [85]:
json_output = []
for document in prediction['documents']:
    output_dict = {}
    output_dict['text'] = document.content
    output_dict['document_score'] = document.score

    for answer in prediction['answers']:
        # print(answer.document_ids[0])
        if (answer.document_ids[0] == document.id):
            output_dict['context'] = answer.context
            output_dict['answer'] = answer.answer
    json_output.append(output_dict)

3. Simplify the printed answers:

In [79]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium` and `all`
)

'Query: ipertensione'
'Answers:'
[   {   'answer': 'diabete',
        'context': ' mostrato malattia multivasale, BiV ICD, DDD St. Jude, '
                   '05/13/2081, CHF, diabete, ipertensione, ex fumatore.\n'
                   '\n'
                   'Modifiche alle allergie\n'
                   '\n'
                   'NKA: nessuna all'},
    {   'answer': 'Malattia carotidea',
        'context': 'erventi CV\n'
                   '\n'
                   'Altra anamnesi medica/chirurgica passata\n'
                   'Ipertensione\n'
                   'Malattia carotidea Stenosi del 20-49% bilateralmente '
                   "nell'ICA bilateralmente\n"
                   'nessuna'},
    {   'answer': 'diabete con scarso controllo ipertensivo. Aggiungerò\n'
                  'idroclorotiazide 12,5 mg p.o. q.a.m. Per seguire in 2 '
                  'settimane.\n'
                  '\n'
                  '2. URI virale',
        'context': 'nsione e diabete con scarso controllo 

And there you have it! Congratulations on building a scalable machine learning based question answering system!

# Next Steps

To learn how to improve the performance of the Reader, see [Fine-Tune a Reader](https://haystack.deepset.ai/tutorials/02_finetune_a_model_on_your_data).