Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

FiD gold retrieved docs agent n-docs debug #4146

Merged
merged 5 commits into from Nov 8, 2021
Merged

Conversation

mojtaba-komeili
Copy link
Contributor

Patch description
Working with WizIntGoldDocRetrieverFiDAgent, I noticed it OOMing more often that it should. Looking into it, I realized that the agent was including all the retrieved docs. WoI doc chunk mutators split a document into 100 chunks or more and ObservationEchoRetriever returns these all docs---ignoring the n_docs parameter.

This patch fixed this issue by picking n_docs number of docs from the list of docs and then adding some filler docs.

NOTE: this problem could be fixed in the upstream by setting --woi-doc-max-chunks from woi_dropout_retrieved_docs as well. But this makes sure that it is safe to run it with other mutators, teachers as well.

Testing it
Ran 100 steps of training with a debug level message that shows the number of docs before and after trimming.

Screen Shot 2021-11-05 at 6 09 34 PM

Copy link
Contributor

@klshuster klshuster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for the WizIntGoldDocRetriever, but this should probably be respected at the top level as well --> i.e., in _set_query_vec, we should cut the retrieved documents to be no more than opt['n_docs']

continue

retrieved_docs.append(self._extract_doc_from_message(message, doc_idx))
if len(retrieved_docs) == self._n_docs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not checked before this for-loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we gain much by adding it before the for loop. It also makes the code a bit more succinct by having the loop that breaks right after start. But I can move it to the top of the loop to avoid extra check on already_added_doc_idx.

@@ -352,6 +353,14 @@ def get_retrieved_knowledge(self, message):

def _set_query_vec(self, observation: Message) -> Message:
retrieved_docs = self.get_retrieved_knowledge(observation)
if len(retrieved_docs) > self._n_docs:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klshuster picking the right documents, if there are more than n_docs, depends a lot on the data set. I add this warning here and trim it down to the first n_docs. I hope this properly addresses your comment. LMK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me. given how interactive this retriever is (i.e., interactive with the user --> the user has total control over what documents are being passed), it makes sense that the onus is on them to either 1) provide the right number of documents, or 2) set --n-docs correctly

@@ -352,6 +353,14 @@ def get_retrieved_knowledge(self, message):

def _set_query_vec(self, observation: Message) -> Message:
retrieved_docs = self.get_retrieved_knowledge(observation)
if len(retrieved_docs) > self._n_docs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me. given how interactive this retriever is (i.e., interactive with the user --> the user has total control over what documents are being passed), it makes sense that the onus is on them to either 1) provide the right number of documents, or 2) set --n-docs correctly

f'Your `get_retrieved_knowledge` method returned {len(retrieved_docs)} Documents, '
f'instead of the expected {self._n_docs}. '
f'This agent will only use the first {self._n_docs} Documents. '
'Consider modifying your implementation of `get_retrieved_knowledge` to avoid unexpected results.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also say 'or set the --n-docs parameter accordingly' to inform the user that they can do that too?

@mojtaba-komeili mojtaba-komeili merged commit 825a057 into main Nov 8, 2021
@mojtaba-komeili mojtaba-komeili deleted the fid-golddoc-ndoc branch November 8, 2021 21:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants