Conversation
| ) | ||
|
|
||
| # Omit a particular key in a split | ||
| if len(self.BYPASS_SPLIT_AND_KEY) > 0: |
There was a problem hiding this comment.
this seems like a lot of extra code ... is it possible to frame this as an early return condition in the original loop?
for split_name, split in self.datasets_bigbio[schema].items():
if split_name in self.BYPASS_SPLITS:
continue ?
self.assertEqual(split.info.features, features)
for non_empty_feature in non_empty_features:
if split_to_feature_counts[split_name][non_empty_feature] == 0:
raise AssertionError(f"Required key '{non_empty_feature}' does not have any instances")
There was a problem hiding this comment.
BYPASS_SPLIT_AND_KEY is a specific condition
In the above implementation, you can either bypass a data split via BYPASS_SPLITS (i.e. ignore ALL testing for "train"), bypass a key via BYPASS_KEYS (i.e. "ignore events"). This applies uniformly to all cases.
In certain cases (i.e. many of the bionlp datasets), the test set is the only dataset to have only 1 key missing (for example, the test set is missing "events" in an event extraction task). For this case, it's wasteful and possibly dangerous to waive testing on the ENTIRE split, when the input examples (like entities) may need to be checked. The logic splits because in certain cases you need a specific combination of (split name, key to ignore), hence the overhead here.
If you have a suggestion that can streamline it, please feel free. This isn't written particularly efficiently (or nicely 🤡) but it works on the cases I've tested to ensure both the new args, and old behavior is maintained.
There was a problem hiding this comment.
I might be missing something (and if this implementation is totally off, we can ignore it) but I was thinking all 3 conditions could be incorporated with early returns ....
bypass_split_keys = [i.split(",") for i in self.BYPASS_SPLIT_AND_KEY]
for split_name, split in self.datasets_bigbio[schema].items():
# skip entire split
if split_name in self.BYPASS_SPLITS:
logger.info(f"skipping {split_name}")
continue
self.assertEqual(split.info.features, features)
for non_empty_feature in non_empty_features:
# skip specific split,key combo
split_key = [split_name, non_empty_feature]
if split_key in bypass_split_keys:
logger.info(f"skipping {split_key}")
# skip key in every split
if non_empty_features in self.BYPASS_KEYS:
logger.info(f"skipping {non_empty_feature}")
if split_to_feature_counts[split_name][non_empty_feature] == 0:
raise AssertionError(f"Required key '{non_empty_feature}' does not have any instances")
There was a problem hiding this comment.
ps, the above was just me trying to do some code re-factor ... if we want to just merge these updates as is and cleanup after the hackathon, I'm OK with that too!
There was a problem hiding this comment.
this is good - i dont mind implementing this albeit AT earliest sometime tomorrow. If you can replace the logic feel free to pull the branch or edit directly on the file.
There was a problem hiding this comment.
code-refactors are good; as mentioned earlier, I wrote this pretty inelegantly to just get it done. Always good to have a second pair of eyes to squash some of the inefficiencies.
|
This PR #453 has multiple ValueError: BuilderConfig codiesp_p_bigbio_kb not found. Available: ['codiesp_d_source', 'codiesp_p_source', 'codiesp_x_source', 'codiesp_extra_mesh_source', 'codiesp_extra_cie_source', 'codiesp_d_bigbio_text', 'codiesp_p_bigbio_text', 'codiesp_x_bigbio_kb', 'codiesp_extra_mesh_bigbio_text', 'codiesp_extra_cie_bigbio_text']but for |
|
For reference: #488 (comment) |
|
Superceded by #533 |
@leonweber @galtay
I used @barthfab 's scripts as a template, since they are a good candidate of when they pass unit-tests except for the testing split which is understandably empty in the key that is to-be predicted.
Adds the following features for the unit-tests:
(1) Allows you to bypass testing for a split found in data generators (via ignoring in the
test_schemafor-loop)(2) Allows you to bypass testing of a particular key in a schema (via removing from non_empty_features)
(3) Allows you to ignore a single key within a specified split
The defaults for all of these are empty.
You can test them as follows (I recommend using this PR to test):
I included the (split, key) combo, because there are cases where you want to ensure most of the schema passes and you know a single key (or a few keys) may be problematic. The others can be used generally with admin approval