Skip to content

unit-test bypass args#398

Closed
hakunanatasha wants to merge 4 commits intomasterfrom
bypass_split
Closed

unit-test bypass args#398
hakunanatasha wants to merge 4 commits intomasterfrom
bypass_split

Conversation

@hakunanatasha
Copy link
Copy Markdown
Collaborator

@leonweber @galtay

I used @barthfab 's scripts as a template, since they are a good candidate of when they pass unit-tests except for the testing split which is understandably empty in the key that is to-be predicted.

Adds the following features for the unit-tests:

(1) Allows you to bypass testing for a split found in data generators (via ignoring in the test_schema for-loop)
(2) Allows you to bypass testing of a particular key in a schema (via removing from non_empty_features)
(3) Allows you to ignore a single key within a specified split

The defaults for all of these are empty.

You can test them as follows (I recommend using this PR to test):

# This will fail
python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py 

# These will work
# Omit "events" in test
python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_split_and_key test,events

python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_splits test

python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_keys events

# This will fail again, because only the test split is the issue
python -m tests.test_bigbio biodatasets/bionlp_st_2013_pc/bionlp_st_2013_pc.py --bypass_splits train

I included the (split, key) combo, because there are cases where you want to ensure most of the schema passes and you know a single key (or a few keys) may be problematic. The others can be used generally with admin approval

Comment thread tests/test_bigbio.py
)

# Omit a particular key in a split
if len(self.BYPASS_SPLIT_AND_KEY) > 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a lot of extra code ... is it possible to frame this as an early return condition in the original loop?

for split_name, split in self.datasets_bigbio[schema].items():
    if split_name in self.BYPASS_SPLITS:
        continue  ? 
    self.assertEqual(split.info.features, features)
    for non_empty_feature in non_empty_features:
        if split_to_feature_counts[split_name][non_empty_feature] == 0:
            raise AssertionError(f"Required key '{non_empty_feature}' does not have any instances")

Copy link
Copy Markdown
Collaborator Author

@hakunanatasha hakunanatasha Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BYPASS_SPLIT_AND_KEY is a specific condition

In the above implementation, you can either bypass a data split via BYPASS_SPLITS (i.e. ignore ALL testing for "train"), bypass a key via BYPASS_KEYS (i.e. "ignore events"). This applies uniformly to all cases.

In certain cases (i.e. many of the bionlp datasets), the test set is the only dataset to have only 1 key missing (for example, the test set is missing "events" in an event extraction task). For this case, it's wasteful and possibly dangerous to waive testing on the ENTIRE split, when the input examples (like entities) may need to be checked. The logic splits because in certain cases you need a specific combination of (split name, key to ignore), hence the overhead here.

If you have a suggestion that can streamline it, please feel free. This isn't written particularly efficiently (or nicely 🤡) but it works on the cases I've tested to ensure both the new args, and old behavior is maintained.

Copy link
Copy Markdown
Collaborator

@galtay galtay Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something (and if this implementation is totally off, we can ignore it) but I was thinking all 3 conditions could be incorporated with early returns ....

bypass_split_keys = [i.split(",") for i in self.BYPASS_SPLIT_AND_KEY]
for split_name, split in self.datasets_bigbio[schema].items():

    # skip entire split
    if split_name in self.BYPASS_SPLITS:
        logger.info(f"skipping {split_name}")
        continue  

    self.assertEqual(split.info.features, features)

    for non_empty_feature in non_empty_features:
        # skip specific split,key combo 
        split_key = [split_name, non_empty_feature]
        if split_key in bypass_split_keys:
            logger.info(f"skipping {split_key}")

        # skip key in every split
        if non_empty_features in self.BYPASS_KEYS:
            logger.info(f"skipping {non_empty_feature}")   

        if split_to_feature_counts[split_name][non_empty_feature] == 0:
            raise AssertionError(f"Required key '{non_empty_feature}' does not have any instances")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps, the above was just me trying to do some code re-factor ... if we want to just merge these updates as is and cleanup after the hackathon, I'm OK with that too!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good - i dont mind implementing this albeit AT earliest sometime tomorrow. If you can replace the logic feel free to pull the branch or edit directly on the file.

Copy link
Copy Markdown
Collaborator Author

@hakunanatasha hakunanatasha Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code-refactors are good; as mentioned earlier, I wrote this pretty inelegantly to just get it done. Always good to have a second pair of eyes to squash some of the inefficiencies.

@hakunanatasha hakunanatasha mentioned this pull request Apr 10, 2022
@WojciechKusa WojciechKusa mentioned this pull request Apr 20, 2022
8 tasks
@MFreidank MFreidank mentioned this pull request Apr 21, 2022
8 tasks
@sg-wbi
Copy link
Copy Markdown
Collaborator

sg-wbi commented Apr 25, 2022

This PR #453 has multiple subset_ids. It has 2 schemas (TEXT and KB), however these are mutually exclusive, i.e. one subset_id has only TEXT and another has KB. When I run the test I get this error:

ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']

but for subset_id "codiesp_p" there is not supposed to be a "codiesp_p_bigbio_kb". Am I doing something wrong or does this warrant a a bypass_schema flag?

@sg-wbi sg-wbi mentioned this pull request Apr 25, 2022
8 tasks
@hakunanatasha hakunanatasha mentioned this pull request Apr 27, 2022
8 tasks
@sg-wbi
Copy link
Copy Markdown
Collaborator

sg-wbi commented Apr 27, 2022

For reference: #488 (comment)

@hakunanatasha
Copy link
Copy Markdown
Collaborator Author

Superceded by #533

@galtay galtay deleted the bypass_split branch September 5, 2022 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants