Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch Errors/Warnings for integrity checks on non-raw datasets #102

Closed
maxschloegel opened this issue Mar 10, 2021 · 4 comments · Fixed by #106
Closed

Catch Errors/Warnings for integrity checks on non-raw datasets #102

maxschloegel opened this issue Mar 10, 2021 · 4 comments · Fixed by #106
Assignees
Labels

Comments

@maxschloegel
Copy link
Contributor

maxschloegel commented Mar 10, 2021

Issue Summary

Applying integrity checks on child-datasets or datasets on which filters were applied leads to KeyError or untrue Integrity Warnings.
Since Integrity checks are only supposed to be run on raw datasets anyways, these bugs can be caught by NotImplementedErrors. For clarity, see problem description below.

Solution

As mentioned above, these two issues can be resolved by introducing NotImplementedError when checking integrity on datasets with filters or datasets further down the pipeline.


Problem Details

Integrity checks on datasets with filters defined:

When defining filters (they don't have to be applied, defining is enough) for a dataset ds and then applying an integrity check on that dataset leads to a KeyError.
Example Code:

import dclab
from dclab.rtdc_dataset.check import IntegrityChecker
ds = dclab.new_dataset("calibration_beads.rtdc")
ic = IntegrityChecker(ds)
ds.config["filtering"]["fl1_max max"] = 1
ic.check()

This is due to differences between the dataset.config-dict (which was changed when defining the filter) compared to dfn.config_types-dict.

Integrity checks on child-datasets:

When creating a child dataset from a parent dataset, running integrity checks on that child dataset introduce new warnings, despite being the exact same dataset:
Example Code:

import dclab
from dclab.rtdc_dataset.check import IntegrityChecker
ds = dclab.new_dataset("calibration_beads.rtdc")
child = dclab.new_dataset(ds)
ic = IntegrityChecker(ds)
ic_child = IntegrityChecker(child)

Then ic_child.check() contains <ICue: 'Metadata: fluorescence channel count inconsistent' at ...>.

This is due to the fact that in __getitem__() of child the code refers to parents ds._events for scalar-values to save storage (I assume). Therefore the corresponding feature names don't need to be present in child._events.keys().
These keys (or rather their absence) are also used to generate certain warnings, as is the case for the Metadata warning "fluorescence channel count inconsistent", despite the fact that the fluorescence count is consistent with the data of child-dataset.

@maxschloegel maxschloegel self-assigned this Mar 10, 2021
@maxschloegel
Copy link
Contributor Author

I was looking for an easy way to check if the apply_filter()-function of a dataset has been called and I was not able to find any. This would allow me to raise the NotImplementedError for datasets with applied filters, when running the check() function on an IntegrityChecker, without a more complicated comparison.
I was thinking about introducing a flag indicating if filters have been applied on a dataset (i.e. was_filtered), Is that OK @paulmueller , or is there some way to check that, which I have missed?

@paulmueller
Copy link
Member

The only way to find out whether any filters have had an impact on a dataset is to check whether np.sum(ds.filter.all) == len(ds) is False. I think this is one thing that should be done. But checking whether a filter has been applied does not help with the first example (ds.config["filtering"]["fl1_max max"] = 1), because only the configuration has been changed, no filter applied. I noticed that there is a # TODO entry here:

https://github.com/ZELLMECHANIK-DRESDEN/dclab/blob/861c9fe2ae88b8a273b64f980f56a048e1708afc/dclab/rtdc_dataset/check.py#L533-L535

right below the lines that throw the KeyError. I assume that having the correct treatment for the configuration sections filtering and calculation (with the inclusion of the min/max box filters) would resolve this issue.

(Regarding the integrity checks on hierarchy children, a simple isinstance check would suffice).

@maxschloegel
Copy link
Contributor Author

maxschloegel commented Mar 16, 2021

Ok, thank you!
Is it reasonable to suggest that we split this into two issues here?

  • The first one being the original issue of running integrity check on non-raw datasets (such as child datasets and dataset with applied filters).
  • The second one dealing with the implementation of the TODO you mention in your post and thus implicitly resolving the KeyError issue.

Both issues dont seem to be directly related and since we will improve the implementation of the check_metadata_bad()-function, we can deal with the KeyError at this point.

@paulmueller
Copy link
Member

Yes, good plan!

paulmueller pushed a commit that referenced this issue Mar 18, 2021
Integrity checks are only supposed to run on raw datasets and not
on dataset with applied filters or hierarchy datasets. To catch
missuse, a 'NotImplementedError' was added for such cases.
Fixes #102
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants