DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527

kolk · 2020-10-28T13:51:49Z

DPR training from haystack
DPR inference modules using FARM instead of transformers
modify test_dpr_retirever with correct test cases
Update tutorials
Update docs
Use FARM==0.5.0

query_encoder files are saved in ../saved_models/dpr-tutorial/lm1
passage_encoder files are saved in ../saved_models/dpr-tutorial/lm2
prediction_head, tokenizer, vocab details are saved in ../saved_models/

Intializing DPR:

doc_store = ElasticsearchDocumentStore()
retriever = DensePassageRetriever(
                 document_store=doc_store,
                 query_embedding_model="bert-base-uncased",
                 passage_embedding_model="bert-base-uncased",
                 use_gpu=True,
                 batch_size=16,
                 embed_title=True)

Note: remove_sep_tok_from_untitled_passages argument has been removed as TextSimilarityProcessor in FARM encodes untitled passages as [CLS] [SEP] [SEP] <passage_tok_1> <passage_tok_2> ... <passage_tok_n>[SEP] without significant changes to accuracy

sample script to train DPR:

retriever.train(data_dir="data/retriever",
              train_filename="nq-dev.json",
              dev_filename="nq-dev.json",
              test_filename="nq-dev.json",
              batch_size=2,
              embed_title=True,
              num_hard_negatives=1,
              num_negatives=0,
              n_epochs=3,
              evaluate_every=1000,
              n_gpu=1,
              learning_rate=1e-5,
              epsilon=1e-08,
              weight_decay=0.0,
              num_warmup_steps=100,
              grad_acc_steps=1,
              optimizer_name="TransformersAdamW",
              optimizer_correct_bias=True,
              save_dir="../saved_models/dpr-tutorial")

haystack/retriever/dense.py

lalitpagaria · 2020-10-30T13:13:14Z

just suggestion to include corresponding update in DPR tutorials and documentation as well in the task list.

…ystack into dense_dpr_training

tholor

Awesome. Good job! Let's merge it

mchari · 2020-11-16T23:55:54Z

@tholor, because the Retriever is initialized with a query_embedding_model and document_embedding_model, may I assume that retriever.train() will incrementally train on the set of documents given as input ?
So :

Retriever is initialized with query_embedding_model, document_embedding_model <- trained on a huge corpus of documents
incrementally update weights based on a new(much smaller) set of documents.

Is there a sample data file for train_filename(i can use as a template) that is input to train() ? Is it the same as in https://github.com/facebookresearch/DPR/ : Receiver Input data format ?
Thanks!

dpr training and inference code refactored with FARM modules

c366678

tholor reviewed Oct 30, 2020

View reviewed changes

kolk and others added 19 commits October 30, 2020 19:29

dpr test cases modified

76e5329

docstring and default arguments updated

66f5d81

dpr training docstring updated

1c6b342

bugfix in dense retriever inference, DPR tutorials modified

24a4414

Bump FARM to 0.5.0

9880744

update README for DPR

de83587

Merge branch 'dense_dpr_training' of https://github.com/deepset-ai/ha…

54fd0bd

…ystack into dense_dpr_training

dpr training and inference code refactored with FARM modules

a2fad9f

dpr test cases modified

d917a55

docstring and default arguments updated

086e26b

dpr training docstring updated

841d3ec

bugfix in dense retriever inference, DPR tutorials modified

c38b935

Bump FARM to 0.5.0

54b83f7

update README for DPR

c8fde16

Merge branch 'dense_dpr_training' of https://github.com/deepset-ai/ha…

4d5498e

…ystack into dense_dpr_training

Merge branch 'master' into dense_dpr_training

0bac43b

mypy errors fix

2ac7c3e

DPR instantiation bugfix

75c935d

Fix DPR init in RAG Tutorial

220cef4

tholor approved these changes Oct 30, 2020

View reviewed changes

tholor changed the title ~~[WIP] DensePassageRetriever refactoring with FARM modules~~ DensePassageRetriever: Add Training, Refactor Inference to FARM modules Oct 30, 2020

tholor merged commit 72b637a into master Oct 30, 2020

tholor mentioned this pull request Nov 2, 2020

Add training of DPR models #469

Closed

This was referenced Nov 2, 2020

Removing warnings from the Haystack codebase. #530

Merged

Indexing of files is not currently supported #453

Closed

tholor mentioned this pull request Nov 16, 2020

Make visible an interface to the Inverse Cloze Task (ORQA Fine-tuning) #475

Closed

julian-risch deleted the dense_dpr_training branch November 15, 2021 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527

kolk commented Oct 28, 2020 •

edited

Loading

lalitpagaria commented Oct 30, 2020

tholor left a comment

mchari commented Nov 16, 2020 •

edited

Loading

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527

Conversation

kolk commented Oct 28, 2020 • edited Loading

lalitpagaria commented Oct 30, 2020

tholor left a comment

Choose a reason for hiding this comment

mchari commented Nov 16, 2020 • edited Loading

kolk commented Oct 28, 2020 •

edited

Loading

mchari commented Nov 16, 2020 •

edited

Loading