Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527

Merged
merged 20 commits into from
Oct 30, 2020

Conversation

kolk
Copy link
Contributor

@kolk kolk commented Oct 28, 2020

  • DPR training from haystack
  • DPR inference modules using FARM instead of transformers
  • modify test_dpr_retirever with correct test cases
  • Update tutorials
  • Update docs
  • Use FARM==0.5.0
  1. query_encoder files are saved in ../saved_models/dpr-tutorial/lm1
  2. passage_encoder files are saved in ../saved_models/dpr-tutorial/lm2
  3. prediction_head, tokenizer, vocab details are saved in ../saved_models/

Intializing DPR:

doc_store = ElasticsearchDocumentStore()
retriever = DensePassageRetriever(
                 document_store=doc_store,
                 query_embedding_model="bert-base-uncased",
                 passage_embedding_model="bert-base-uncased",
                 use_gpu=True,
                 batch_size=16,
                 embed_title=True)

Note: remove_sep_tok_from_untitled_passages argument has been removed as TextSimilarityProcessor in FARM encodes untitled passages as [CLS] [SEP] [SEP] <passage_tok_1> <passage_tok_2> ... <passage_tok_n>[SEP] without significant changes to accuracy

sample script to train DPR:

retriever.train(data_dir="data/retriever",
              train_filename="nq-dev.json",
              dev_filename="nq-dev.json",
              test_filename="nq-dev.json",
              batch_size=2,
              embed_title=True,
              num_hard_negatives=1,
              num_negatives=0,
              n_epochs=3,
              evaluate_every=1000,
              n_gpu=1,
              learning_rate=1e-5,
              epsilon=1e-08,
              weight_decay=0.0,
              num_warmup_steps=100,
              grad_acc_steps=1,
              optimizer_name="TransformersAdamW",
              optimizer_correct_bias=True,
              save_dir="../saved_models/dpr-tutorial")

haystack/retriever/dense.py Outdated Show resolved Hide resolved
haystack/retriever/dense.py Outdated Show resolved Hide resolved
haystack/retriever/dense.py Outdated Show resolved Hide resolved
haystack/retriever/dense.py Show resolved Hide resolved
haystack/retriever/dense.py Outdated Show resolved Hide resolved
haystack/retriever/dense.py Outdated Show resolved Hide resolved
@lalitpagaria
Copy link
Contributor

just suggestion to include corresponding update in DPR tutorials and documentation as well in the task list.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Good job! Let's merge it

@tholor tholor changed the title [WIP] DensePassageRetriever refactoring with FARM modules DensePassageRetriever: Add Training, Refactor Inference to FARM modules Oct 30, 2020
@tholor tholor merged commit 72b637a into master Oct 30, 2020
@tholor tholor mentioned this pull request Nov 2, 2020
@mchari
Copy link

mchari commented Nov 16, 2020

@tholor, because the Retriever is initialized with a query_embedding_model and document_embedding_model, may I assume that retriever.train() will incrementally train on the set of documents given as input ?
So :

  • Retriever is initialized with query_embedding_model, document_embedding_model <- trained on a huge corpus of documents
  • incrementally update weights based on a new(much smaller) set of documents.

Is there a sample data file for train_filename(i can use as a template) that is input to train() ? Is it the same as in https://github.com/facebookresearch/DPR/ : Receiver Input data format ?
Thanks!

@julian-risch julian-risch deleted the dense_dpr_training branch November 15, 2021 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants