FiD gold retrieved docs agent n-docs debug #4146

mojtaba-komeili · 2021-11-05T22:30:21Z

Patch description
Working with WizIntGoldDocRetrieverFiDAgent, I noticed it OOMing more often that it should. Looking into it, I realized that the agent was including all the retrieved docs. WoI doc chunk mutators split a document into 100 chunks or more and ObservationEchoRetriever returns these all docs---ignoring the n_docs parameter.

This patch fixed this issue by picking n_docs number of docs from the list of docs and then adding some filler docs.

NOTE: this problem could be fixed in the upstream by setting --woi-doc-max-chunks from woi_dropout_retrieved_docs as well. But this makes sure that it is safe to run it with other mutators, teachers as well.

Testing it
Ran 100 steps of training with a debug level message that shows the number of docs before and after trimming.

klshuster

This looks good for the WizIntGoldDocRetriever, but this should probably be respected at the top level as well --> i.e., in _set_query_vec, we should cut the retrieved documents to be no more than opt['n_docs']

klshuster · 2021-11-08T15:59:32Z

parlai/agents/fid/fid.py

+                continue
+
+            retrieved_docs.append(self._extract_doc_from_message(message, doc_idx))
+            if len(retrieved_docs) == self._n_docs:


why is this not checked before this for-loop?

I don't think we gain much by adding it before the for loop. It also makes the code a bit more succinct by having the loop that breaks right after start. But I can move it to the top of the loop to avoid extra check on already_added_doc_idx.

mojtaba-komeili · 2021-11-08T18:23:56Z

parlai/agents/fid/fid.py

@@ -352,6 +353,14 @@ def get_retrieved_knowledge(self, message):

    def _set_query_vec(self, observation: Message) -> Message:
        retrieved_docs = self.get_retrieved_knowledge(observation)
+        if len(retrieved_docs) > self._n_docs:


@klshuster picking the right documents, if there are more than n_docs, depends a lot on the data set. I add this warning here and trim it down to the first n_docs. I hope this properly addresses your comment. LMK.

sounds good to me. given how interactive this retriever is (i.e., interactive with the user --> the user has total control over what documents are being passed), it makes sense that the onus is on them to either 1) provide the right number of documents, or 2) set --n-docs correctly

klshuster · 2021-11-08T20:15:12Z

parlai/agents/fid/fid.py

@@ -352,6 +353,14 @@ def get_retrieved_knowledge(self, message):

    def _set_query_vec(self, observation: Message) -> Message:
        retrieved_docs = self.get_retrieved_knowledge(observation)
+        if len(retrieved_docs) > self._n_docs:


sounds good to me. given how interactive this retriever is (i.e., interactive with the user --> the user has total control over what documents are being passed), it makes sense that the onus is on them to either 1) provide the right number of documents, or 2) set --n-docs correctly

klshuster · 2021-11-08T20:17:16Z

parlai/agents/fid/fid.py

+                f'Your `get_retrieved_knowledge` method returned {len(retrieved_docs)} Documents, '
+                f'instead of the expected {self._n_docs}. '
+                f'This agent will only use the first {self._n_docs} Documents. '
+                'Consider modifying your implementation of `get_retrieved_knowledge` to avoid unexpected results.'


can we also say 'or set the --n-docs parameter accordingly' to inform the user that they can do that too?

mojtaba-komeili requested review from klshuster and jaseweston November 5, 2021 22:30

facebook-github-bot added the CLA Signed label Nov 5, 2021

mojtaba-komeili added 3 commits November 5, 2021 18:39

trimming docs down to n_docs

bcd24fe

added debug message

07c1b26

random shuffling of docs

b1f9102

mojtaba-komeili force-pushed the fid-golddoc-ndoc branch from 0cd22ec to b1f9102 Compare November 6, 2021 01:52

klshuster reviewed Nov 8, 2021

View reviewed changes

pr comments

3f67e82

mojtaba-komeili commented Nov 8, 2021

View reviewed changes

klshuster approved these changes Nov 8, 2021

View reviewed changes

pr comments 2

6a1b558

mojtaba-komeili merged commit 825a057 into main Nov 8, 2021

mojtaba-komeili deleted the fid-golddoc-ndoc branch November 8, 2021 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FiD gold retrieved docs agent n-docs debug #4146

FiD gold retrieved docs agent n-docs debug #4146

mojtaba-komeili commented Nov 5, 2021

klshuster left a comment

klshuster Nov 8, 2021

mojtaba-komeili Nov 8, 2021

mojtaba-komeili Nov 8, 2021

klshuster Nov 8, 2021

klshuster Nov 8, 2021

klshuster Nov 8, 2021

FiD gold retrieved docs agent n-docs debug #4146

FiD gold retrieved docs agent n-docs debug #4146

Conversation

mojtaba-komeili commented Nov 5, 2021

klshuster left a comment

Choose a reason for hiding this comment

klshuster Nov 8, 2021

Choose a reason for hiding this comment

mojtaba-komeili Nov 8, 2021

Choose a reason for hiding this comment

mojtaba-komeili Nov 8, 2021

Choose a reason for hiding this comment

klshuster Nov 8, 2021

Choose a reason for hiding this comment

klshuster Nov 8, 2021

Choose a reason for hiding this comment

klshuster Nov 8, 2021

Choose a reason for hiding this comment