-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding of QA IDs #171
Encoding of QA IDs #171
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed and agreed, we need the squad IDs. E.g. when we create predictions to match them with the gold labels.
What about we have the squad-id in the samplebasket as string and a normal FARM ID that we can pass around in torch?
I just checked the code, this is how we already have it: squad_id = basket.raw["squad_id"]
But basket and predictions are just matched by index. This matching should happen by some ID that is consistent across predictions and baskets.
So why do we need the full squad_id encoding?
I was hoping we could find a simpler solution...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I think the ID solution is simple and useful now. There seems to be quite some code unrelated to the PR or even code that you did not write (Im unsure about that). please check my comments inside the changes
@@ -1166,6 +1164,8 @@ def reduce_preds(self, preds, n_best=5): | |||
returns the n_best predictions on the document level. """ | |||
|
|||
# Initialize some variables | |||
# no_ans_threshold is how much greater the no_answer logit needs to be over the pos_answer in order to be chosen | |||
no_ans_threshold = 0 | |||
document_no_answer = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have no answer threshold and a flag "document no answer"?
farm/utils.py
Outdated
# This lets us avoid cases in lm_finetuning where a chunk only has a single doc and hence cannot pick | ||
# a valid next sentence substitute from another document | ||
while num_dicts % multiprocessing_chunk_size == 1: | ||
multiprocessing_chunk_size -= -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this code in the PR? It seems as if you did not work on that?
if "farm_lm_name" in kwargs: | ||
albert.name = kwargs["farm_lm_name"] | ||
else: | ||
albert.name = pretrained_model_name_or_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its ok for now, but having unrelated changes in a PR is not best practice.
@@ -288,6 +289,8 @@ def _get_random_sentence(all_baskets, forbidden_doc): | |||
rand_sent_idx = random.randrange(len(rand_doc)) | |||
sentence = rand_doc[rand_sent_idx] | |||
break | |||
if sentence is None: | |||
raise Exception("Failed to pick out a suitable random substitute for next sentence") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changes about LM finetuning? Why is it in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good
This implements a more general id encoding strategy that is not specific just to SQUAD. There are also some general error reporting and print out improvements