Skip to content

Latest commit

 

History

History
50 lines (41 loc) · 3.76 KB

8-neural.md

File metadata and controls

50 lines (41 loc) · 3.76 KB

Neural search for question answering

The exercise introduces the problem of passage retrieval, an important step in factual question answering. This part concentrates on the methods for retrieving the content of documents that might be useful for answering the question. We compare lexical text representations (e.g. ElasticSearch default behaviour), with dense text representations (e.g. multilingual E5 neural model).

Tasks

Objectives (8 points)

  1. Read the documentation of the document store and the retriever in the Haystack framework.
  2. Install Haystack framework (e.g. with pip install 'farm-haystack[all]').
  3. Configure a document store based on Faiss supported by multilingual E5 model:
    1. For Faiss use multilingual E5 or silver retriever base encoder.
    2. Warning: If you use E5, make sure to properly configure the store.
    3. In the case you have problems using Faiss, you can use InMemoryDocumentStore, but this will require to re-index all documents each time the script is run, which is time consuming.
  4. Load the documents (passages) from the FiQA corpus.
  5. Use the set of questions and the scorings defined in this corpus, to compute NDCG@5 for the dense retriever.
  6. Compare the NDCG score from this exercise with the score from lab 2 and from lab 6.
  7. Bonus (+2p) Combine dense retrieval with classification model from lab 6 to implement a two-step retrieval. Compute NDCG@5 for this combined model.
  8. Bonus (+2p) Use a different dense encoder, e.g. E5 large or Polish Roberta Base and compute NDCG@5.

Questions (2 points)

  1. Which of the methods: lexical match (e.g. ElasticSearch) or dense representation works better?
  2. Which of the methods is faster?
  3. Try to determine the other pros and cons of using lexical search and dense document retrieval models.

Hints

  1. Haystack is a framework for buliding question answering applications.
  2. Lexical document retrieval is based on traditional NLP pipelines (e.g. lemmatization), i.e. models based on bag of words. ElasticSearch typical usage is based on lexical search model.
  3. Dense document retrieval is based on dense vector models provided by neural networks. These dense vectors might be generated directly, by e.g. avaraging the vectors of word embeddings belonging to a given text fragment. Yet such models performance is inferior to sparse models.
  4. More sophisticated models are trained directly on the document retrieval task. E.g. DPR uses a bi-encoder architecture that has a separate neural network for encoding the question and for encoding the passage. E5 model has a shared encoder architecture. These netowrks are trained to maximise the dot-product of the vectors produced by each of the networks.
  5. Using dense vector representation requires computing the dense vectors for all passages in the dataset. These vectors might be stored in document stores such as FAISS for faster retrieval, especially when the dataset is very large (does not fit into memory).
  6. Polish retrieval benchmark lists and compares the models that implement dense retrieval for Polish.