Skip to content

FindZebra/fz-openqa

Repository files navigation

Warning We are currently working on extending this code to other NLP tasks at VodLM.

Note Find an implementation of the sampling methods and the VOD objective VodLM/vod.

Variational Open-Domain Question Answering

Python PyTorch Lightning Config: hydra Code style: black

PWC

PWC

Abstract

Retrieval-augmented models have proven to be effective in natural language processing tasks, yet there remains a lack of research on their optimization using variational inference. We introduce the Variational Open-Domain (VOD) framework for end-to-end training and evaluation of retrieval-augmented models, focusing on open-domain question answering and language modelling. The VOD objective, a self-normalized estimate of the Rényi variational bound, is a lower bound to the task marginal likelihood and evaluated under samples drawn from an auxiliary sampling distribution (cached retriever and/or approximate posterior). It remains tractable, even for retriever distributions defined on large corpora. We demonstrate VOD's versatility by training reader-retriever BERT-sized models on multiple-choice medical exam questions. On the MedMCQA dataset, we outperform the domain-tuned Med-PaLM by +5.3% despite using 2.500x fewer parameters. Our retrieval-augmented BioLinkBERT model scored 62.9% on the MedMCQA and 55.0% on the MedQA-USMLE. Last, we show the effectiveness of our learned retriever component in the context of medical semantic search.

Setup

  1. Setting up the Python environment
curl -sSL https://install.python-poetry.org | python -
poetry install
  1. Setting up ElasticSearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.14.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.14.1-linux-x86_64.tar.gz

To run ElasticSearch navigate to the elasticsearch-7.14.1 folder in the terminal and run ./bin/elasticsearch.

  1. Run the main script
poetry run python run.py

Citation

@misc{https://doi.org/10.48550/arxiv.2210.06345,
  doi = {10.48550/ARXIV.2210.06345},
  url = {https://arxiv.org/abs/2210.06345},
  author = {Liévin, Valentin and Motzfeldt, Andreas Geert and Jensen, Ida Riis and Winther, Ole},
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7; H.3.3; I.2.1},
  title = {Variational Open-Domain Question Answering},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Credits

The package relies on:

  • Lightning to simplify training management including distributed computing, logging, checkpointing, early stopping, half precision training, ...
  • Hydra for a clean management of experiments (setting hyper-parameters, ...)
  • Weights and Biases for clean logging and experiment tracking
  • Poetry for stricter dependency management and easy packaging
  • The original template was copied form ashleve