Encoding of QA IDs #171

brandenchan · 2019-12-12T14:35:03Z

This implements a more general id encoding strategy that is not specific just to SQUAD. There are also some general error reporting and print out improvements

Timoeller

As discussed and agreed, we need the squad IDs. E.g. when we create predictions to match them with the gold labels.
What about we have the squad-id in the samplebasket as string and a normal FARM ID that we can pass around in torch?
I just checked the code, this is how we already have it: squad_id = basket.raw["squad_id"]
But basket and predictions are just matched by index. This matching should happen by some ID that is consistent across predictions and baskets.
So why do we need the full squad_id encoding?
I was hoping we could find a simpler solution...

Timoeller

Hey, I think the ID solution is simple and useful now. There seems to be quite some code unrelated to the PR or even code that you did not write (Im unsure about that). please check my comments inside the changes

Timoeller · 2020-01-03T11:07:03Z

farm/modeling/prediction_head.py

@@ -1166,6 +1164,8 @@ def reduce_preds(self, preds, n_best=5):
        returns the n_best predictions on the document level. """

        # Initialize some variables
+        # no_ans_threshold is how much greater the no_answer logit needs to be over the pos_answer in order to be chosen
+        no_ans_threshold = 0
        document_no_answer = True


Why do we have no answer threshold and a flag "document no answer"?

Timoeller · 2020-01-03T11:07:39Z

farm/utils.py

+    # This lets us avoid cases in lm_finetuning where a chunk only has a single doc and hence cannot pick
+    # a valid next sentence substitute from another document
+    while num_dicts % multiprocessing_chunk_size == 1:
+        multiprocessing_chunk_size -= -1


Why is this code in the PR? It seems as if you did not work on that?

Timoeller · 2020-01-03T11:08:08Z

farm/modeling/language_model.py

+        if "farm_lm_name" in kwargs:
+            albert.name = kwargs["farm_lm_name"]
+        else:
+            albert.name = pretrained_model_name_or_path


I think its ok for now, but having unrelated changes in a PR is not best practice.

Timoeller · 2020-01-03T11:08:39Z

farm/data_handler/utils.py

@@ -288,6 +289,8 @@ def _get_random_sentence(all_baskets, forbidden_doc):
            rand_sent_idx = random.randrange(len(rand_doc))
            sentence = rand_doc[rand_sent_idx]
            break
+    if sentence is None:
+        raise Exception("Failed to pick out a suitable random substitute for next sentence")


This is changes about LM finetuning? Why is it in here?

Timoeller

looking good

brandenchan added 3 commits December 12, 2019 15:21

early stopping example

c38e3be

Merge branch 'master' of https://github.com/deepset-ai/FARM

524e941

add new encoding system

7333948

tholor requested a review from Timoeller December 12, 2019 14:55

brandenchan added 2 commits December 13, 2019 17:26

Add transformers 2.2.1 support

1925e80

Add transformers 2.2.1 support

7bb133b

tholor changed the title ~~Encoding~~ Encoding of QA IDs Dec 16, 2019

Timoeller reviewed Dec 18, 2019

View reviewed changes

brandenchan added 5 commits December 30, 2019 17:36

Merge branch 'master' of https://github.com/deepset-ai/FARM

264d435

merged master into branch

4c61ec5

Removed id encoding

ed835b6

delete archive folder

e764874

remove debug file

030cd94

Timoeller reviewed Jan 3, 2020

View reviewed changes

remove infinite loop

a20ef6b

Timoeller approved these changes Jan 3, 2020

View reviewed changes

brandenchan merged commit efbeb16 into master Jan 3, 2020

tholor deleted the encoding_id branch April 28, 2020 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding of QA IDs #171

Encoding of QA IDs #171

brandenchan commented Dec 12, 2019

Timoeller left a comment •

edited

Loading

Timoeller left a comment

Timoeller Jan 3, 2020

Timoeller Jan 3, 2020

Timoeller Jan 3, 2020

Timoeller Jan 3, 2020

Timoeller left a comment

Encoding of QA IDs #171

Encoding of QA IDs #171

Conversation

brandenchan commented Dec 12, 2019

Timoeller left a comment • edited Loading

Choose a reason for hiding this comment

Timoeller left a comment

Choose a reason for hiding this comment

Timoeller Jan 3, 2020

Choose a reason for hiding this comment

Timoeller Jan 3, 2020

Choose a reason for hiding this comment

Timoeller Jan 3, 2020

Choose a reason for hiding this comment

Timoeller Jan 3, 2020

Choose a reason for hiding this comment

Timoeller left a comment

Choose a reason for hiding this comment

Timoeller left a comment •

edited

Loading