# Scalable Question Answering System on PubMed

## Setup

### Download the dataset

Download the dataset and a utility script to extract the relevant information from PubMed XML documents.

In [None]:
!wget https://ccia.esei.uvigo.es/docencia/TM/2324/practicas/pubmed23n1083.mini.xml.gz
!wget https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed23n1000.xml.gz
!wget https://ccia.esei.uvigo.es/docencia/TM/2324/practicas/medline.py

--2023-12-05 22:50:42--  https://ccia.esei.uvigo.es/docencia/TM/2324/practicas/pubmed23n1083.mini.xml.gz
Resolving ccia.esei.uvigo.es (ccia.esei.uvigo.es)... 193.147.87.31
Connecting to ccia.esei.uvigo.es (ccia.esei.uvigo.es)|193.147.87.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18240 (18K) [application/x-gzip]
Saving to: ‘pubmed23n1083.mini.xml.gz’


2023-12-05 22:50:44 (74.4 KB/s) - ‘pubmed23n1083.mini.xml.gz’ saved [18240/18240]

--2023-12-05 22:50:44--  https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed23n1000.xml.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.7, 2607:f220:41e:250::7, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 62210931 (59M) [application/x-gzip]
Saving to: ‘pubmed23n1000.xml.gz’


2023-12-05 22:50:49 (15.3 MB/s) - ‘pubmed23n1000.xml.gz’ saved [62210931/62210931]

--2023-12-05 22:50:49-- 

### Install Haystack

Haystack is an open-source Python library for building end-to-end Search and Question Answering (QA) systems and NLP applications.
* Provides an easy to use framework and a set of components that cover all stages of an NLP project. Making possible to work with text data, perform document retrieval, and apply ML to extract answers from documents.
* Includes integration components to work with LLMs (_Large Language Models_) and to interface with models from [Hugging Face](https://huggingface.co/), [Sentence Bert](https://www.sbert.net/), [OpenAI](https://platform.openai.com/docs/models) and others.

Applications built on Haystack are based on the [Pipeline](https://docs.haystack.deepset.ai/docs/pipelines) concept, which organizes the sequence of tasks to be performed on processed text or documents.
- These _Pipelines_ are composed of different components (called _Nodes_ in Haystack) that perform the corresponding task.
- Pipeline components receive as input and emit as output core elements: [_Documents_, _Answers_, _Labels_](https://docs.haystack.deepset.ai/docs/documents_answers_labels).
- Available pipeline components are grouped according the function they perform (Data Handing, Semantic Search, Prompt and LLM, etc). See [Pipeline Components Overview](https://docs.haystack.deepset.ai/docs/nodes_overview).

In [None]:
%%bash
pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,inference]

Collecting pip
  Downloading pip-23.3.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 21.7 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.1
Collecting farm-haystack[colab,elasticsearch,inference,preprocessing]
  Downloading farm_haystack-1.22.1-py3-none-any.whl.metadata (28 kB)
Collecting boilerpy3 (from farm-haystack[colab,elasticsearch,inference,preprocessing])
  Downloading boilerpy3-1.0.7-py3-none-any.whl.metadata (5.8 kB)
Collecting events (from farm-haystack[colab,elasticsearch,inference,preprocessing])
  Downloading Events-0.5-py3-none-any.whl.metadata (3.9 kB)
Collecting httpx (from farm-haystack[colab,elasticsearch,inference,preprocessing])
  Downloading httpx-0.25.2-py3-none-any.whl.metadata (6.9 kB)
Collecting lazy-imports==0.3.1 (from farm-haystack[colab,elasti

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not installed.


Set Logging to INFO level:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

Download, extract, and set the permissions for the Elasticsearch installation image:

In [None]:
%%bash
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

Start the server:

In [None]:
%%bash --bg
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

Wait 30 seconds for the server to fully start up:

In [None]:
import time
time.sleep(30)

## Index Documents

Install the dependencies of `medline.py`, the utility script to extract the cleaned content from the PubMed XML documents:

- Pubmed Parser is a Python library for parsing the PubMed Open-Access (OA) subset , MEDLINE XML repositories, and Entrez Programming Utilities (E-utils).

- RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

In [None]:
%%bash
pip install pubmed_parser
pip install rake_nltk
python -c "import nltk; nltk.download('stopwords')"
python -c "import nltk; nltk.download('punkt')"

Collecting pubmed_parser
  Downloading pubmed_parser-0.3.1.tar.gz (21 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting unidecode (from pubmed_parser)
  Downloading Unidecode-1.3.7-py3-none-any.whl.metadata (13 kB)
Collecting pytest-cov (from pubmed_parser)
  Downloading pytest_cov-4.1.0-py3-none-any.whl.metadata (26 kB)
Collecting coverage>=5.2.1 (from coverage[toml]>=5.2.1->pytest-cov->pubmed_parser)
  Downloading coverage-7.3.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Downloading pytest_cov-4.1.0-py3-none-any.whl (21 kB)
Downloading Unidecode-1.3.7-py3-none-any.whl (235 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.5/235.5 kB 3.9 MB/s eta 0:00:00
Downloading coverage-7.3.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (227 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.5/227.5 kB 4

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Load the content (title + abstract) from the documents:

In [None]:
from medline import MedlineLoader

loader = MedlineLoader(add_keywords=False)
docs = loader.loadFromFile("./pubmed23n1000.xml.gz")
doc_texts = [d.title + " . " + d.abstractText for d in docs]

Create Haystack documents with the content of the documents:

In [None]:
from haystack import Document

documents = []
for idx, pubmed_doc in enumerate(doc_texts):
  doc = Document(
      content_type='text',
      content=pubmed_doc,
      id=idx + 1,
      meta={"name": f"PubMed Document {idx + 1}"}
  )
  documents.append(doc)

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


Clean and split the text within a Document. In many pipelines, the Reader will be the most computationally expensive component. For sparse Retrievers, very long documents pose a challenge since the signal of the relevant section of text can get washed out by the rest of the Document. To get a good balance between Reader speed and Retriever performance, we split documents to a maximum of 200 words. When splitting, it is generally not a good idea to let document boundaries fall in the middle of sentences. Doing so means that each document will contain incomplete sentence fragments which may be hard for both Retriever and Reader to interpret. Therefore we set `split_respect_sentence_boundy=True`.

**We have noted that splitting the documents result in poorer results. Also, the text is already cleaned by the** _MedlineLoader_ **from** `medline.py`.

In [None]:
# from haystack.nodes import PreProcessor

# preprocessor = PreProcessor(
#     clean_whitespace=True,
#     clean_header_footer=True,
#     clean_empty_lines=True,
#     split_by="word",
#     split_length=200,
#     split_overlap=0,
#     split_respect_sentence_boundary=True,
# )

# preprocessed_documents = preprocessor.process(documents)

Write the documents to a ElasticSearch store.

Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene. It is designed to handle large volumes of data and provide near-real-time search capabilities. It is commonly used for text mining, which involves extracting valuable information and insights from unstructured text data. Some key aspects of Elasticsearch and its benefits for text mining are:

- It is distributed in nature, allowing it to scale horizontally across multiple nodes. This enables it to handle large amounts of text data efficiently, making it suitable for applications with extensive text mining requirements.

- It supports various text analysis features, such as tokenization, stemming, and synonym expansion. These features help in preprocessing text data, making it easier to uncover meaningful patterns and relationships during text mining.

- It provides near-real-time indexing, meaning that newly ingested data becomes searchable almost immediately. This is crucial for text mining applications that require up-to-date information for analysis.

- It stores data in JSON format and treats each piece of data as a document. This document-oriented approach is well-suited for handling unstructured text data commonly encountered in text mining applications.

- It is open source, which means that it is freely available, and users can modify and extend its functionality according to their needs. The open-source nature encourages a large and active community, leading to continuous improvements and updates.

In summary, Elasticsearch provides a robust and scalable platform for text mining applications, offering powerful search capabilities, advanced querying, and flexible analysis features that are well-suited for extracting valuable insights from large volumes of unstructured text data.

In [None]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")
elasticsearch_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document", port=9200)
elasticsearch_store.write_documents(documents)
# elasticsearch_store.write_documents(preprocessed_documents)

In [None]:
# Display document texts and metadata
for doc in elasticsearch_store.get_all_documents():
    print(f"Document ID: {doc.id}")
    print(f"Document Text: {doc.content}")
    print(f"Document Meta: {doc.meta}")
    print("\n---\n")

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m

---

Document ID: 8741
Document Text: Arsenic impairs GLUT1 trafficking through the inhibition of the calpain system in lymphocytes. . Exposure to arsenic is associated with increased risk of developing insulin resistance and type 2 diabetes. The proteases calpain-1 (CAPN1), calpain-2 (CAPN2) and calpain-10 (CAPN10) and their endogenous inhibitor calpastatin (CAST) regulate glucose uptake in skeletal muscle and adipocytes. We investigated whether arsenic disrupts GLUT1 trafficking and function through calpain inhibition, using lymphocytes as a cell model. Lymphocytes from healthy subjects were treated with 0.1 or 1 μM of sodium arsenite for 72 h and challenged with 3.9 or 11.1 mM of glucose. Our results showed that arsenite inhibited GLUT1 trafficking, glucose uptake, and calpain activity in the presence of 11.1 mM of glucose. These correlated with a decrease in the autolytical fragment of 50 kDa of CAPN1 and i

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Document Meta: {'name': 'PubMed Document 3913'}

---

Document ID: 3914
Document Text: Error in a Data Point in Table 3. . 
Document Meta: {'name': 'PubMed Document 3914'}

---

Document ID: 3915
Document Text: Erroneous Cohort Totals in Abstract. . 
Document Meta: {'name': 'PubMed Document 3915'}

---

Document ID: 3916
Document Text: Error in Title. . 
Document Meta: {'name': 'PubMed Document 3916'}

---

Document ID: 3917
Document Text: Making Machine Learning Models Clinically Useful. . 
Document Meta: {'name': 'PubMed Document 3917'}

---

Document ID: 3918
Document Text: A Broad Impact for Global Oncology. . 
Document Meta: {'name': 'PubMed Document 3918'}

---

Document ID: 3919
Document Text: Association of Giant Cell Arteritis With Race. . IMPORTANCE. Giant cell arteritis (GCA) is the most common vasculitis in adults and is associated with significant morbidity and mortality. Its incidence has been carefully studied in white populations, yet its relevance among other racial an

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
Document Text: Effects of astaxanthin onaxonal regeneration via cAMP/PKA signaling pathway in mice with focal cerebral infarction. . OBJECTIVE. To investigate the effect of astaxanthin on the neurological function of the middle cerebral artery occlusion (MCAO) mice and its possible mechanism. . . MATERIALS AND METHODS. The male C57BL/6 mice were selected to establish the model of MCAO via electrocoagulation, and they were randomly divided into 4 groups: the sham operation group (Sham group), the cerebral ischemia model group (MCAO group), the astaxanthin intervention group (gavage with 30 mg/kg astaxanthin for 28 days, twice a day; Ast group), and astaxanthin + H89 group (Ast + H89 group). At 3, 7, 14, and 28 d after the operation, the Rotarod test and the balance beam footstep error test were performed. The brain tissues were taken for immunofluorescence to observe the expression of the growth-associated protei

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)





---

Document ID: 22258
Document Text: Identification of Sulfonated and Hydroxy-Sulfonated Polychlorinated Biphenyl (PCB) Metabolites in Soil: New Classes of Intermediate Products of PCB Degradation? . In this paper we describe the identification of two classes of contaminants: sulfonated-PCBs and hydroxy-sulfonated-PCBs. This is the first published report of the detection of these chemicals in soil. They were found, along with hydroxy-PCBs, in soil samples coming from a site historically contaminated by the industrial production of PCBs and in background soils. Sulfonated-PCB levels were approximately 0.4-0.8% of the native PCB levels in soils and about twice the levels of hydroxy-sulfonated-PCBs and hydroxy-PCBs. The identification of sulfonated-PCBs was confirmed by the chemical synthesis of reference standards, obtained through the sulfonation of an industrial mixture of PCBs. We then reviewed the literature to investigate for the potential agents responsible for the sulfonation.

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




---

Document ID: 16846
Document Text: Outcome Measures of Free-Living Activity in Spinal Cord Injury Rehabilitation. . PURPOSE OF REVIEW. The purpose of this article was to describe the utilization of body worn activity monitors in the SCI population and discuss the challenges of using body worn sensors in rehabilitation research. . . RECENT FINDINGS. Many activity monitor-based measures have been used and validated in the SCI population including stroke number, push frequency, upper limb activity counts and wheelchair propulsion distance measured from a sensor attached to the wheelchair. . . SUMMARY. The ability to accurately measure physical activity in the free-living environment using body-worn sensors has the potential to enhance the understanding of barriers to adequate activity and identify possible effective interventions. As the use of activity monitors used in SCI rehabilitation research continues to grow, care must be taken to overcome challenges related to participant adh

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




---

Document ID: 26729
Document Text: Use of Natural Language Processing Tools to Identify and Classify Periprosthetic Femur Fractures. . BACKGROUND. Manual chart review is labor-intensive and requires specialized knowledge possessed by highly trained medical professionals. The cost and infrastructure challenges required to implement this is prohibitive for most hospitals. Natural language processing (NLP) tools are distinctive in their ability to extract critical information from unstructured text in the electronic health records. As a simple proof-of-concept for the potential application of NLP technology in total hip arthroplasty (THA), we examined its ability to identify periprosthetic femur fractures (PPFFx) followed by more complex Vancouver classification. . . METHODS. PPFFx were identified among all THAs performed at a single academic institution between 1998 and 2016. A randomly selected training cohort (1538 THAs with 89 PPFFx cases) was used to develop the prototype NLP al

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Build the Q&A pipeline

Use the sparse BM25Retriever as retriever, an improved variant of TF-IDF that saturates TF after a set number of occurrences of the given term in the document, and normalises by document length so that short documents are favoured over long documents if they have the same amount of word overlap with the query. It performs document retrieval by sweeping through a DocumentStore and returning a set of candidate Documents that are relevant to the query.

In [None]:
from haystack.nodes import BM25Retriever
retriever = BM25Retriever(document_store=elasticsearch_store)

Use the FARMReader as reader, a transformer based model (in our case [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2)) for extractive Question Answering. It takes a question and a set of Documents as input and returns an answer by selecting a text span within the Documents.

In [None]:
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Connect the Retriever's output to the Reader's input by using a pre-defined QA pipeline.

In [None]:
from haystack.pipelines import ExtractiveQAPipeline
pipeline = ExtractiveQAPipeline(reader, retriever)

Run the pipeline to ask a question. We can type our query text as query argument of the method. In addition, we retrieve 10 documents with the Retriever and generate 5 answers with the Reader. The choice of Retriever `top-k` is a trade-off between speed and accuracy, especially when there is a Reader in the pipeline. Setting it higher means passing more documents to the Reader, thus reducing the chance that the answer-containing passage is missed. However, passing more documents to the Reader will create a larger workload for the component. `top_k=10 ` gives decent overall performance.

In [None]:
from haystack.utils import print_answers

prediction = pipeline.run(
    query="What causes the Foodborne illness?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

print_answers(prediction, details="medium")

Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.21s/ Batches]

'Query: What causes the Foodborne illness?'
'Answers:'
[   {   'answer': 'Campylobacter jejuni',
        'context': 'ble for global distribution without cold chain '
                   'infrastructure. . Campylobacter jejuni is a leading cause '
                   'of foodborne illness globally. In this study,',
        'score': 0.8362059593200684},
    {   'answer': 'Salmonella enterica serovar Typhimurium',
        'context': 'e with Salmonella Typhimurium. . The bacterial pathogen '
                   'Salmonella enterica serovar Typhimurium is one of the most '
                   'common causes of foodborne disease ',
        'score': 0.614564836025238},
    {   'answer': 'Enterotoxigenic Escherichia coli',
        'context': 'ence of Enterotoxigenic Escherichia coli Podophage LL11. . '
                   'Enterotoxigenic Escherichia coli (ETEC) is an '
                   'opportunistic pathogen that commonly causes f',
        'score': 0.5467362403869629},
    {   'answer': 'Campy


