Clean logging and error handling #639

brandenchan · 2020-11-26T16:32:16Z

This PR addresses many of the issues in #590 regarding error handling and logging.

brandenchan · 2020-11-26T16:41:59Z

As it is right now, I have removed sample level error messages in NER that would, for example, say that a given sample has a label not in the label list, or that it doesn't have a label field at all.

Instead, failed featurization is handled at a higher level so that we get a single line mentioning how many samples are not properly featurized. However, since we chunk our data, we actually have one line per chunk that states how many problematic samples there are.

Pros:

Reduced number of error printouts

Cons:

No sample specific error messages
Potentially still quite a lot of lines of error logging

If we really want a single line of error printout, we will need to make changes to some very high level fns that will have impact across all tasks (e.g. Datasilo._dataset_from_chunk( ) or Processor.dataset_from_dicts( ) )

This is also a more general problem that will apply to QA so its worth strategizing about how we should tackle this

@Timoeller thoughts?

Timoeller

I think with problematic_ids we need some further testing, especially if other tasks break.

Timoeller · 2020-11-30T16:38:46Z

farm/data_handler/processor.py

        # This mode is for inference where we need to keep baskets
        if return_baskets:
            dataset, tensor_names = self._create_dataset(keep_baskets=True)
-            return dataset, tensor_names, self.baskets
+            ret = [dataset, tensor_names]


I am not sure about the consequences, but lists are mutable and tuples (the return type before) are not. This might introduce some pretty weird behaviour. Do we need it?

Please also check the other return statement above

Ok sure, I agree a tuple could be more stable. I wrote it like this to avoid having too many nested if statements since both return_baskets and return_problematic can alter the number of objects that are returned. I will convert ret into a tuple in the next commit to avoid this mutability.

Timoeller · 2020-11-30T16:43:14Z

farm/data_handler/data_silo.py

@@ -181,10 +181,12 @@ def _get_dataset(self, filename, dicts=None):
            if filename:
                desc += f" {filename}"
            with tqdm(total=len(dicts), unit=' Dicts', desc=desc) as pbar:
-                for dataset, tensor_names in results:
+                for dataset, tensor_names, problematic_samples in results:


Doesnt this statement break in case when the problematic samples are not returned?

In this context, results is always the output of _dataset_from_chunk( ) which I modified to always return 3 objects so I don't think this unpacking is an issue. As an asside, this function (i.e. _get_dataset( )) is the only method in our code base which calls _dataset_from_chunk( ) so it shouldn't be problematic that I've modified it to return 3 objects.

Ok, agreed, but this means we always need to return those ids.
We do have a flag in dataset_from_dicts, return_problematic=True, which seems to break the processing when set to false, right?

Timoeller · 2020-11-30T16:44:18Z

farm/data_handler/processor.py

-                    logger.error(f"Error message: {e}")
+                    curr_problematic_sample_ids.append(sample.id)
+        if curr_problematic_sample_ids:
+            self.problematic_sample_ids.update(curr_problematic_sample_ids)


Why are we updating the problematic_sample_ids here and when we collect results from MP?

The idea here is that we have two levels at which problematic_ids are stored. Within multiprocessing, each Processor needs to collect problematic_ids as it iterates through the baskets in its chunks. These are eventually returned out of multiprocessing via dataset_from_dicts( ). At this point, these returned problematic_ids are collected together in the non-multiprocessing Processor .

I went for this approach because it allows for us to implement logging problematic samples neatly as a method of the processor (Processor.log_problematic( )) as long as the problematic sample ids are stored in the Processor.

I do admit that it might be a little confusing coordinating between the higher level single Processor and the multiple lower level Processors. Do you have any thoughts or suggestions on this?

clean NER problematic sample logging

677130d

brandenchan added 9 commits November 27, 2020 10:46

Add structure to logging

343181b

WIP: Fixing NER

7a17841

Remove empty label list check

e7fd33c

Reintroduce per sample NER logging

43b7049

Fix deprecation message

3c0095b

Clean logging of problematic samples

f871250

Clean sample printout

eb5e112

Clean model loading messages

f30b61a

Fix failing tests

ca358df

brandenchan requested a review from Timoeller November 30, 2020 16:31

Timoeller suggested changes Nov 30, 2020

View reviewed changes

brandenchan added 2 commits December 1, 2020 11:14

Convert list to tuple

fa9829c

Clearer object returns, add inference support

200ccb9

brandenchan merged commit 9122f17 into master Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean logging and error handling #639

Clean logging and error handling #639

brandenchan commented Nov 26, 2020

brandenchan commented Nov 26, 2020 •

edited

Loading

Timoeller left a comment

Timoeller Nov 30, 2020

brandenchan Dec 1, 2020 •

edited

Loading

Timoeller Nov 30, 2020

brandenchan Dec 1, 2020

Timoeller Dec 1, 2020

Timoeller Nov 30, 2020

brandenchan Dec 1, 2020

Clean logging and error handling #639

Clean logging and error handling #639

Conversation

brandenchan commented Nov 26, 2020

brandenchan commented Nov 26, 2020 • edited Loading

Timoeller left a comment

Choose a reason for hiding this comment

Timoeller Nov 30, 2020

Choose a reason for hiding this comment

brandenchan Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

Timoeller Nov 30, 2020

Choose a reason for hiding this comment

brandenchan Dec 1, 2020

Choose a reason for hiding this comment

Timoeller Dec 1, 2020

Choose a reason for hiding this comment

Timoeller Nov 30, 2020

Choose a reason for hiding this comment

brandenchan Dec 1, 2020

Choose a reason for hiding this comment

brandenchan commented Nov 26, 2020 •

edited

Loading

brandenchan Dec 1, 2020 •

edited

Loading