unit-test bypass args by hakunanatasha · Pull Request #398 · bigscience-workshop/biomedical

hakunanatasha · 2022-04-09T20:03:09Z

I used @barthfab 's scripts as a template, since they are a good candidate of when they pass unit-tests except for the testing split which is understandably empty in the key that is to-be predicted.

Adds the following features for the unit-tests:

(1) Allows you to bypass testing for a split found in data generators (via ignoring in the test_schema for-loop)
(2) Allows you to bypass testing of a particular key in a schema (via removing from non_empty_features)
(3) Allows you to ignore a single key within a specified split

The defaults for all of these are empty.

You can test them as follows (I recommend using this PR to test):

# This will fail
python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py 

# These will work
# Omit "events" in test
python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_split_and_key test,events

python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_splits test

python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_keys events

# This will fail again, because only the test split is the issue
python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_splits train

I included the (split, key) combo, because there are cases where you want to ensure most of the schema passes and you know a single key (or a few keys) may be problematic. The others can be used generally with admin approval

galtay · 2022-04-09T20:35:26Z

+        )
+
+        # Omit a particular key in a split
+        if len(self.BYPASS_SPLIT_AND_KEY) > 0:


this seems like a lot of extra code ... is it possible to frame this as an early return condition in the original loop?

for split_name, split in self.datasets_bigbio[schema].items(): if split_name in self.BYPASS_SPLITS: continue ? self.assertEqual(split.info.features, features) for non_empty_feature in non_empty_features: if split_to_feature_counts[split_name][non_empty_feature] == 0: raise AssertionError(f"Required key '{non_empty_feature}' does not have any instances")

BYPASS_SPLIT_AND_KEY is a specific condition

In the above implementation, you can either bypass a data split via BYPASS_SPLITS (i.e. ignore ALL testing for "train"), bypass a key via BYPASS_KEYS (i.e. "ignore events"). This applies uniformly to all cases.

In certain cases (i.e. many of the bionlp datasets), the test set is the only dataset to have only 1 key missing (for example, the test set is missing "events" in an event extraction task). For this case, it's wasteful and possibly dangerous to waive testing on the ENTIRE split, when the input examples (like entities) may need to be checked. The logic splits because in certain cases you need a specific combination of (split name, key to ignore), hence the overhead here.

If you have a suggestion that can streamline it, please feel free. This isn't written particularly efficiently (or nicely 🤡) but it works on the cases I've tested to ensure both the new args, and old behavior is maintained.

I might be missing something (and if this implementation is totally off, we can ignore it) but I was thinking all 3 conditions could be incorporated with early returns ....

bypass_split_keys = [i.split(",") for i in self.BYPASS_SPLIT_AND_KEY] for split_name, split in self.datasets_bigbio[schema].items(): # skip entire split if split_name in self.BYPASS_SPLITS: logger.info(f"skipping {split_name}") continue self.assertEqual(split.info.features, features) for non_empty_feature in non_empty_features: # skip specific split,key combo split_key = [split_name, non_empty_feature] if split_key in bypass_split_keys: logger.info(f"skipping {split_key}") # skip key in every split if non_empty_features in self.BYPASS_KEYS: logger.info(f"skipping {non_empty_feature}") if split_to_feature_counts[split_name][non_empty_feature] == 0: raise AssertionError(f"Required key '{non_empty_feature}' does not have any instances")

ps, the above was just me trying to do some code re-factor ... if we want to just merge these updates as is and cleanup after the hackathon, I'm OK with that too!

this is good - i dont mind implementing this albeit AT earliest sometime tomorrow. If you can replace the logic feel free to pull the branch or edit directly on the file.

code-refactors are good; as mentioned earlier, I wrote this pretty inelegantly to just get it done. Always good to have a second pair of eyes to squash some of the inefficiencies.

sg-wbi · 2022-04-25T08:48:49Z

This PR #453 has multiple subset_ids. It has 2 schemas (TEXT and KB), however these are mutually exclusive, i.e. one subset_id has only TEXT and another has KB. When I run the test I get this error:

ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

but for subset_id "codiesp_p" there is not supposed to be a "codiesp_p_bigbio_kb". Am I doing something wrong or does this warrant a a bypass_schema flag?

sg-wbi · 2022-04-27T07:53:03Z

For reference: #488 (comment)

hakunanatasha · 2022-05-01T20:54:44Z

Superceded by #533

hakunanatasha added 4 commits April 9, 2022 10:53

ft + doc: adds 2 new args + updates docstring

30de13c

fix: style + adds split/key bypasser

ee2fd8f

ft: adds bypass split feature

baf3c36

ft: adds bypasser for split+key combo

9bc64ae

hakunanatasha requested review from galtay, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners April 9, 2022 20:03

galtay reviewed Apr 9, 2022

View reviewed changes

hakunanatasha mentioned this pull request Apr 10, 2022

Closes #217 #332

Merged

WojciechKusa mentioned this pull request Apr 20, 2022

Closes #61 [DEPRECATED] #422

Closed

8 tasks

MFreidank mentioned this pull request Apr 21, 2022

Closes #252 #432

Merged

8 tasks

sg-wbi mentioned this pull request Apr 25, 2022

Closes #64 #453

Merged

8 tasks

hakunanatasha mentioned this pull request Apr 27, 2022

Close #23 #519

Merged

8 tasks

hakunanatasha mentioned this pull request May 1, 2022

Unit-test bypass args #533

Merged

hakunanatasha closed this May 1, 2022

galtay deleted the bypass_split branch September 5, 2022 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unit-test bypass args#398

unit-test bypass args#398
hakunanatasha wants to merge 4 commits intomasterfrom
bypass_split

hakunanatasha commented Apr 9, 2022

Uh oh!

galtay Apr 9, 2022

Uh oh!

hakunanatasha Apr 10, 2022 •

edited

Loading

Uh oh!

galtay Apr 10, 2022 •

edited

Loading

Uh oh!

galtay Apr 10, 2022

Uh oh!

hakunanatasha Apr 11, 2022

Uh oh!

hakunanatasha Apr 11, 2022 •

edited

Loading

Uh oh!

sg-wbi commented Apr 25, 2022

Uh oh!

sg-wbi commented Apr 27, 2022

Uh oh!

hakunanatasha commented May 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hakunanatasha commented Apr 9, 2022

Uh oh!

galtay Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

hakunanatasha Apr 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galtay Apr 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galtay Apr 10, 2022

Choose a reason for hiding this comment

Uh oh!

hakunanatasha Apr 11, 2022

Choose a reason for hiding this comment

Uh oh!

hakunanatasha Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sg-wbi commented Apr 25, 2022

Uh oh!

sg-wbi commented Apr 27, 2022

Uh oh!

hakunanatasha commented May 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hakunanatasha Apr 10, 2022 •

edited

Loading

galtay Apr 10, 2022 •

edited

Loading

hakunanatasha Apr 11, 2022 •

edited

Loading