Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean logging and error handling #639

Merged
merged 12 commits into from
Dec 3, 2020
Merged

Clean logging and error handling #639

merged 12 commits into from
Dec 3, 2020

Conversation

brandenchan
Copy link
Contributor

This PR addresses many of the issues in #590 regarding error handling and logging.

@brandenchan
Copy link
Contributor Author

brandenchan commented Nov 26, 2020

As it is right now, I have removed sample level error messages in NER that would, for example, say that a given sample has a label not in the label list, or that it doesn't have a label field at all.

Instead, failed featurization is handled at a higher level so that we get a single line mentioning how many samples are not properly featurized. However, since we chunk our data, we actually have one line per chunk that states how many problematic samples there are.

Pros:

  • Reduced number of error printouts

Cons:

  • No sample specific error messages
  • Potentially still quite a lot of lines of error logging

If we really want a single line of error printout, we will need to make changes to some very high level fns that will have impact across all tasks (e.g. Datasilo._dataset_from_chunk( ) or Processor.dataset_from_dicts( ) )

This is also a more general problem that will apply to QA so its worth strategizing about how we should tackle this

@Timoeller thoughts?

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with problematic_ids we need some further testing, especially if other tasks break.

# This mode is for inference where we need to keep baskets
if return_baskets:
dataset, tensor_names = self._create_dataset(keep_baskets=True)
return dataset, tensor_names, self.baskets
ret = [dataset, tensor_names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the consequences, but lists are mutable and tuples (the return type before) are not. This might introduce some pretty weird behaviour. Do we need it?

Please also check the other return statement above

Copy link
Contributor Author

@brandenchan brandenchan Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure, I agree a tuple could be more stable. I wrote it like this to avoid having too many nested if statements since both return_baskets and return_problematic can alter the number of objects that are returned. I will convert ret into a tuple in the next commit to avoid this mutability.

@@ -181,10 +181,12 @@ def _get_dataset(self, filename, dicts=None):
if filename:
desc += f" {filename}"
with tqdm(total=len(dicts), unit=' Dicts', desc=desc) as pbar:
for dataset, tensor_names in results:
for dataset, tensor_names, problematic_samples in results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesnt this statement break in case when the problematic samples are not returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context, results is always the output of _dataset_from_chunk( ) which I modified to always return 3 objects so I don't think this unpacking is an issue. As an asside, this function (i.e. _get_dataset( )) is the only method in our code base which calls _dataset_from_chunk( ) so it shouldn't be problematic that I've modified it to return 3 objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, agreed, but this means we always need to return those ids.
We do have a flag in dataset_from_dicts, return_problematic=True, which seems to break the processing when set to false, right?

logger.error(f"Error message: {e}")
curr_problematic_sample_ids.append(sample.id)
if curr_problematic_sample_ids:
self.problematic_sample_ids.update(curr_problematic_sample_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we updating the problematic_sample_ids here and when we collect results from MP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we have two levels at which problematic_ids are stored. Within multiprocessing, each Processor needs to collect problematic_ids as it iterates through the baskets in its chunks. These are eventually returned out of multiprocessing via dataset_from_dicts( ). At this point, these returned problematic_ids are collected together in the non-multiprocessing Processor .

I went for this approach because it allows for us to implement logging problematic samples neatly as a method of the processor (Processor.log_problematic( )) as long as the problematic sample ids are stored in the Processor.

I do admit that it might be a little confusing coordinating between the higher level single Processor and the multiple lower level Processors. Do you have any thoughts or suggestions on this?

@brandenchan brandenchan merged commit 9122f17 into master Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants