-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean logging and error handling #639
Conversation
As it is right now, I have removed sample level error messages in NER that would, for example, say that a given sample has a label not in the label list, or that it doesn't have a label field at all. Instead, failed featurization is handled at a higher level so that we get a single line mentioning how many samples are not properly featurized. However, since we chunk our data, we actually have one line per chunk that states how many problematic samples there are. Pros:
Cons:
If we really want a single line of error printout, we will need to make changes to some very high level fns that will have impact across all tasks (e.g. This is also a more general problem that will apply to QA so its worth strategizing about how we should tackle this @Timoeller thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with problematic_ids we need some further testing, especially if other tasks break.
farm/data_handler/processor.py
Outdated
# This mode is for inference where we need to keep baskets | ||
if return_baskets: | ||
dataset, tensor_names = self._create_dataset(keep_baskets=True) | ||
return dataset, tensor_names, self.baskets | ||
ret = [dataset, tensor_names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the consequences, but lists are mutable and tuples (the return type before) are not. This might introduce some pretty weird behaviour. Do we need it?
Please also check the other return statement above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sure, I agree a tuple could be more stable. I wrote it like this to avoid having too many nested if statements since both return_baskets
and return_problematic
can alter the number of objects that are returned. I will convert ret
into a tuple in the next commit to avoid this mutability.
@@ -181,10 +181,12 @@ def _get_dataset(self, filename, dicts=None): | |||
if filename: | |||
desc += f" {filename}" | |||
with tqdm(total=len(dicts), unit=' Dicts', desc=desc) as pbar: | |||
for dataset, tensor_names in results: | |||
for dataset, tensor_names, problematic_samples in results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesnt this statement break in case when the problematic samples are not returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context, results
is always the output of _dataset_from_chunk( )
which I modified to always return 3 objects so I don't think this unpacking is an issue. As an asside, this function (i.e. _get_dataset( )
) is the only method in our code base which calls _dataset_from_chunk( )
so it shouldn't be problematic that I've modified it to return 3 objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, agreed, but this means we always need to return those ids.
We do have a flag in dataset_from_dicts, return_problematic=True, which seems to break the processing when set to false, right?
logger.error(f"Error message: {e}") | ||
curr_problematic_sample_ids.append(sample.id) | ||
if curr_problematic_sample_ids: | ||
self.problematic_sample_ids.update(curr_problematic_sample_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we updating the problematic_sample_ids here and when we collect results from MP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that we have two levels at which problematic_ids
are stored. Within multiprocessing, each Processor needs to collect problematic_ids
as it iterates through the baskets in its chunks. These are eventually returned out of multiprocessing via dataset_from_dicts( )
. At this point, these returned problematic_ids
are collected together in the non-multiprocessing Processor .
I went for this approach because it allows for us to implement logging problematic samples neatly as a method of the processor (Processor.log_problematic( )
) as long as the problematic sample ids are stored in the Processor
.
I do admit that it might be a little confusing coordinating between the higher level single Processor
and the multiple lower level Processor
s. Do you have any thoughts or suggestions on this?
This PR addresses many of the issues in #590 regarding error handling and logging.