Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding of QA IDs #171

Merged
merged 11 commits into from Jan 3, 2020
Merged

Encoding of QA IDs #171

merged 11 commits into from Jan 3, 2020

Conversation

brandenchan
Copy link
Contributor

This implements a more general id encoding strategy that is not specific just to SQUAD. There are also some general error reporting and print out improvements

@tholor tholor changed the title Encoding Encoding of QA IDs Dec 16, 2019
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed and agreed, we need the squad IDs. E.g. when we create predictions to match them with the gold labels.
What about we have the squad-id in the samplebasket as string and a normal FARM ID that we can pass around in torch?
I just checked the code, this is how we already have it: squad_id = basket.raw["squad_id"]
But basket and predictions are just matched by index. This matching should happen by some ID that is consistent across predictions and baskets.
So why do we need the full squad_id encoding?
I was hoping we could find a simpler solution...

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I think the ID solution is simple and useful now. There seems to be quite some code unrelated to the PR or even code that you did not write (Im unsure about that). please check my comments inside the changes

@@ -1166,6 +1164,8 @@ def reduce_preds(self, preds, n_best=5):
returns the n_best predictions on the document level. """

# Initialize some variables
# no_ans_threshold is how much greater the no_answer logit needs to be over the pos_answer in order to be chosen
no_ans_threshold = 0
document_no_answer = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have no answer threshold and a flag "document no answer"?

farm/utils.py Outdated
# This lets us avoid cases in lm_finetuning where a chunk only has a single doc and hence cannot pick
# a valid next sentence substitute from another document
while num_dicts % multiprocessing_chunk_size == 1:
multiprocessing_chunk_size -= -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this code in the PR? It seems as if you did not work on that?

if "farm_lm_name" in kwargs:
albert.name = kwargs["farm_lm_name"]
else:
albert.name = pretrained_model_name_or_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its ok for now, but having unrelated changes in a PR is not best practice.

@@ -288,6 +289,8 @@ def _get_random_sentence(all_baskets, forbidden_doc):
rand_sent_idx = random.randrange(len(rand_doc))
sentence = rand_doc[rand_sent_idx]
break
if sentence is None:
raise Exception("Failed to pick out a suitable random substitute for next sentence")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changes about LM finetuning? Why is it in here?

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

@brandenchan brandenchan merged commit efbeb16 into master Jan 3, 2020
@tholor tholor deleted the encoding_id branch April 28, 2020 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants