Build a Scalable Question Answering System

- **Time to complete**: 20 minutes
- **Nodes Used**: `ElasticsearchDocumentStore`, `BM25Retriever`, `FARMReader`


Set the logging level to INFO:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the ElasticsearchDocumentStore

A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) which connects to a running Elasticsearch service. It's a fast and scalable text-focused storage option. This service runs independently from Haystack and persists even after the Haystack program has finished running. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

2. Install requirements:

In [None]:
! pip install 'farm-haystack[preprocessing,elasticsearch]'

2. Start the server:

In [2]:
! docker-compose -f ../search-engine/docker-compose.yml up -d

Recreating elasticsearch_clinical ... 
[1Beating elasticsearch_clinical ... [32mdone[0m

In [None]:
import time
time.sleep(30)

4. Initialize the ElasticsearchDocumentStore:


In [3]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(
    host=host,
    port=9202,
    username="",
    password=""
)

  from .autonotebook import tqdm as notebook_tqdm


ElasticsearchDocumentStore is up and running and ready to store the Documents.

## Indexing Documents with a Pipeline

The next step is adding the files to the DocumentStore. The indexing pipeline turns your files into Document objects and writes them to the DocumentStore. Our indexing pipeline will have two nodes: `TextConverter`, which turns `.txt` files into Haystack `Document` objects, and `PreProcessor`, which cleans and splits the text within a `Document`.

Once we combine these nodes into a pipeline, the pipeline will ingest `.txt` file paths, preprocess them, and write them into the DocumentStore.


2. Initialize the pipeline, TextConverter, and PreProcessor:

In [4]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)


To learn more about the parameters of the `PreProcessor`, see [Usage](https://docs.haystack.deepset.ai/docs/preprocessor#usage). To understand why document splitting is important for your question answering system's performance, see [Document Length](https://docs.haystack.deepset.ai/docs/optimization#document-length).

2. Add the nodes into an indexing pipeline. You should provide the `name` or `name`s of preceding nodes as the `input` argument. Note that in an indexing pipeline, the input to the first node is `File`.

In [5]:
import os

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])


3. Run the indexing pipeline to write the text data into the DocumentStore:

In [10]:
doc_dir = '../data/raw/n2c2_2022/'
files_to_index = [f'{doc_dir}{f}' for f in os.listdir(doc_dir) if '.txt' in f]
indexing_pipeline.run_batch(file_paths=files_to_index)

Converting files: 100%|██████████| 350/350 [00:00<00:00, 2469.22it/s]
Preprocessing:   0%|          | 0/350 [00:00<?, ?docs/s]We found one or more sentences whose word count is higher than the split length.
Preprocessing: 100%|██████████| 350/350 [00:00<00:00, 564.03docs/s]


{'documents': [<Document: {'content': "\n\nRecord date: 2085-05-11\n\nNAME:    Justus, Quiana\nMRN:       9814048\n\nQuiana is a 73-year-old woman with hypertension, diabetes and spinal stenosis who\ncomes in for a physical.  She has spinal stenosis as documented by her prior MRI's.\nShe can usually walk about 5 minutes and then develops pain in her thighs and\ncalves.  She has some knee pain with walking, but it is not consistent and does not\noccur after short 5-minute distances.  The thigh and calf pain resolve with rest.  She\ndoes not always have back pain with the thigh and calf pain.  She also has occasional\nright lateral foot pain that feels like an achy arthritic pain in her foot.  She takes\nVioxx occasionally for that foot pain.  She does not have any dysesthesias or\nnumbness in her feet or hands.  She has no recent trauma, fevers or chills.  She has\nno change in her bowel or bladder.\n\nShe has type 2 diabetes.  She has been taking Glyburide 5 mg p.o. q.d.  Her weight is

Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline.

## Initializing the Retriever

Our query pipeline is going to use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only those that are relevant to the question. This tutorial uses the BM25Retriever. This is the recommended Retriever for a question answering system like the one we're creating. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

In [11]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

The BM25Retriever is initialized and ready for the pipeline.

## Initializing the Reader

Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This tutorials uses a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with. To find a model that's best for your use case, see [Models](https://docs.haystack.deepset.ai/docs/reader#models).

In [12]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

Downloading (…)lve/main/config.json: 100%|██████████| 571/571 [00:00<00:00, 99.5kB/s]
Downloading pytorch_model.bin: 100%|██████████| 496M/496M [00:04<00:00, 110MB/s]  
  return self.fget.__get__(instance, owner)()
Downloading (…)okenizer_config.json: 100%|██████████| 79.0/79.0 [00:00<00:00, 61.2kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.24MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.50MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 772/772 [00:00<00:00, 710kB/s]


## Creating the Retriever-Reader Pipeline

You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever. 

Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`.

In [13]:
from haystack import Pipeline

querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])


That's it! Your pipeline's ready to answer your questions!

## Asking a Question

1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).


In [17]:
prediction = querying_pipeline.run(
    query="quiana Justus had diabete?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.32s/ Batches]


Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?

2. Print out the answers the pipeline returns:

In [15]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': '2085-05-11', 'type': 'extractive', 'score': 0.14135760068893433, 'context': '\n\nRecord date: 2085-05-11\n\nNAME:    Justus, Quiana\nMRN:       9814048\n\nQuiana is a 73-year-old woman with hypertension, diabetes and spinal stenosis w', 'offsets_in_document': [{'start': 15, 'end': 25}], 'offsets_in_context': [{'start': 15, 'end': 25}], 'document_ids': ['d5e6cc2b6ce929e4d2c9bfdf8450dc9a'], 'meta': {'_split_id': 0, '_split_overlap': [{'range': [1001, 1114], 'doc_id': '647be74496df7f3f45650b86c4352680'}]}}>,
             <Answer {'answer': 'March of this year', 'type': 'extractive', 'score': 0.12971088290214539, 'context': 'rtion worsened his symptoms.  He has had these symptoms before in March of this year, but his symptoms today are more mild than when he presented then', 'offsets_in_document': [{'start': 203, 'end': 221}], 'offsets_in_context': [{'start': 66, 'end': 84}], 'document_ids': ['c3b5450f2a62bb26c6a46e96fb57103d'], 'meta': {'_split_id': 1, '_s

3. Simplify the printed answers:

In [18]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium` and `all`
)

'Query: quiana Justus had diabete?'
'Answers:'
[   {   'answer': 'Diabetes mellitus type 2',
        'context': 'has a long history of cardiac disease with stenting in the '
                   'pastDiabetes mellitus type 2-Dr Willis  : Eye appt-JHThe '
                   'patient is followed by Dr. Willis '},
    {   'answer': '2070-10-27',
        'context': '\n'
                   '\n'
                   'Record date: 2070-10-27\n'
                   '\n'
                   'CLERMONT COUNTY HOSPITAL\n'
                   '\n'
                   'ERVING, VERMONT\n'
                   '\n'
                   'GI CONSULT NOTE \t\t\t\t\t\t\t\t10/27/2070\n'
                   '\n'
                   'Patient: Langston, Sherman\n'
                   '\n'
                   'Unit #: 7223692\n'},
    {   'answer': 'RACINE, MAINE\t\t\t\tName:\tQuiana Justus\n'
                  '\n'
                  'NIHC #:\t981-40-48',
        'context': 'HEALTH CENTER\t\t        Date:\tNovember 17, 2083\n

And there you have it! Congratulations on building a scalable machine learning based question answering system!

# Next Steps

To learn how to improve the performance of the Reader, see [Fine-Tune a Reader](https://haystack.deepset.ai/tutorials/02_finetune_a_model_on_your_data).