This Master's project is part of the module Web-interface for Langugage Processing Systems in M.Sc. Intelligent Adaptive Systems. We build a biomedical question answering system that is composed of the following components:
-
An Text Extractor that extracts passages from the XML file of a research article. The XML data dump is taken from PubMed Open Access Non-Commercial Data Dump
-
A Query Formulation module composed of BioMedical Named Entity Recognition model to extract keywords from the user's natural language questions
-
An Passage Retrieval module to retrieve relevant passages from the PubMed text in two stage process
- Retrieve top 100 passages using BM25 with Fuzzy search
- Re-rank the passages based on cosine similarity scores between question embedding and passage embeddings. The embeddings were created using MiniLM model. [1] The top 50 passages are selected.
-
A Question Answering model that uses PubMedBert fine-tuned on SQuAD v.2 dataset [2] to retrieve answers from the passages
Prerequisite: Docker installed
-
Run the following command
docker-compse up --build
The system can be accessed at http://0.0.0.0:8000. The following is the screenshot of the landing page.
-
Use
Ctrl+C
to shutdown, thendocker-compose down
Prerequisite: Download the PubMed XML data dumps from the above link
The loader_script.py
needs to run from outside the docker container since the data is not copied into docker context.
-
Create a Python environment and install all dependencies.
python3 -m venv .env
source .env/bin/activate
pip3 install -r requirements.txt
-
Load the data by running
loader_script.py
. This file can be modified to load the data into HNSWLib Index as well.python3 app/loader_script.py
The user can now directly type in a medical question in natural language on the search input section. The system will display upto five passages with answers to the question highlighted in a separate section.
The quality of the question answering system can also be evaluated by pressing the Evaluate
button for each query individually. The user is provided with the option to mark each answer as Relevant
or Not-relevant
. The responses are stored in the system as JSON files.
[1] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the
2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 11
2019.
[2] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language
model pretraining for biomedical natural language processing,” ACM Transactions on Computing for Healthcare (HEALTH),
vol. 3, no. 1, pp. 1–23, 2021.