Catch Errors/Warnings for integrity checks on non-raw datasets #102

maxschloegel · 2021-03-10T16:28:41Z

Issue Summary

Applying integrity checks on child-datasets or datasets on which filters were applied leads to KeyError or untrue Integrity Warnings.
Since Integrity checks are only supposed to be run on raw datasets anyways, these bugs can be caught by NotImplementedErrors. For clarity, see problem description below.

Solution

As mentioned above, these two issues can be resolved by introducing NotImplementedError when checking integrity on datasets with filters or datasets further down the pipeline.

Problem Details

Integrity checks on datasets with filters defined:

When defining filters (they don't have to be applied, defining is enough) for a dataset ds and then applying an integrity check on that dataset leads to a KeyError.
Example Code:

import dclab
from dclab.rtdc_dataset.check import IntegrityChecker
ds = dclab.new_dataset("calibration_beads.rtdc")
ic = IntegrityChecker(ds)
ds.config["filtering"]["fl1_max max"] = 1
ic.check()

This is due to differences between the dataset.config-dict (which was changed when defining the filter) compared to dfn.config_types-dict.

Integrity checks on child-datasets:

When creating a child dataset from a parent dataset, running integrity checks on that child dataset introduce new warnings, despite being the exact same dataset:
Example Code:

import dclab
from dclab.rtdc_dataset.check import IntegrityChecker
ds = dclab.new_dataset("calibration_beads.rtdc")
child = dclab.new_dataset(ds)
ic = IntegrityChecker(ds)
ic_child = IntegrityChecker(child)

Then ic_child.check() contains <ICue: 'Metadata: fluorescence channel count inconsistent' at ...>.

This is due to the fact that in __getitem__() of child the code refers to parents ds._events for scalar-values to save storage (I assume). Therefore the corresponding feature names don't need to be present in child._events.keys().
These keys (or rather their absence) are also used to generate certain warnings, as is the case for the Metadata warning "fluorescence channel count inconsistent", despite the fact that the fluorescence count is consistent with the data of child-dataset.

The text was updated successfully, but these errors were encountered:

maxschloegel · 2021-03-15T17:59:52Z

I was looking for an easy way to check if the apply_filter()-function of a dataset has been called and I was not able to find any. This would allow me to raise the NotImplementedError for datasets with applied filters, when running the check() function on an IntegrityChecker, without a more complicated comparison.
I was thinking about introducing a flag indicating if filters have been applied on a dataset (i.e. was_filtered), Is that OK @paulmueller , or is there some way to check that, which I have missed?

paulmueller · 2021-03-15T22:33:14Z

The only way to find out whether any filters have had an impact on a dataset is to check whether np.sum(ds.filter.all) == len(ds) is False. I think this is one thing that should be done. But checking whether a filter has been applied does not help with the first example (ds.config["filtering"]["fl1_max max"] = 1), because only the configuration has been changed, no filter applied. I noticed that there is a # TODO entry here:

https://github.com/ZELLMECHANIK-DRESDEN/dclab/blob/861c9fe2ae88b8a273b64f980f56a048e1708afc/dclab/rtdc_dataset/check.py#L533-L535

right below the lines that throw the KeyError. I assume that having the correct treatment for the configuration sections filtering and calculation (with the inclusion of the min/max box filters) would resolve this issue.

(Regarding the integrity checks on hierarchy children, a simple isinstance check would suffice).

maxschloegel · 2021-03-16T13:45:08Z

Ok, thank you!
Is it reasonable to suggest that we split this into two issues here?

The first one being the original issue of running integrity check on non-raw datasets (such as child datasets and dataset with applied filters).
The second one dealing with the implementation of the TODO you mention in your post and thus implicitly resolving the KeyError issue.

Both issues dont seem to be directly related and since we will improve the implementation of the check_metadata_bad()-function, we can deal with the KeyError at this point.

paulmueller · 2021-03-16T15:22:19Z

Yes, good plan!

Integrity checks are only supposed to run on raw datasets and not on dataset with applied filters or hierarchy datasets. To catch missuse, a 'NotImplementedError' was added for such cases. Fixes #102

maxschloegel added the bug label Mar 10, 2021

maxschloegel self-assigned this Mar 10, 2021

maxschloegel mentioned this issue Mar 17, 2021

fix: catch errors for integrity checks on non-raw datasets (#102) #106

Merged

paulmueller closed this as completed in #106 Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch Errors/Warnings for integrity checks on non-raw datasets #102

Catch Errors/Warnings for integrity checks on non-raw datasets #102

maxschloegel commented Mar 10, 2021 •

edited

Loading

maxschloegel commented Mar 15, 2021

paulmueller commented Mar 15, 2021

maxschloegel commented Mar 16, 2021 •

edited

Loading

paulmueller commented Mar 16, 2021

Catch Errors/Warnings for integrity checks on non-raw datasets #102

Catch Errors/Warnings for integrity checks on non-raw datasets #102

Comments

maxschloegel commented Mar 10, 2021 • edited Loading

Issue Summary

Solution

Problem Details

Integrity checks on datasets with filters defined:

Integrity checks on child-datasets:

maxschloegel commented Mar 15, 2021

paulmueller commented Mar 15, 2021

maxschloegel commented Mar 16, 2021 • edited Loading

paulmueller commented Mar 16, 2021

maxschloegel commented Mar 10, 2021 •

edited

Loading

maxschloegel commented Mar 16, 2021 •

edited

Loading