Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] ability to not log actual data in result or display #2320

Open
MichaelMarien opened this issue Feb 3, 2023 · 4 comments
Open

[FEAT] ability to not log actual data in result or display #2320

MichaelMarien opened this issue Feb 3, 2023 · 4 comments
Labels
feature Feature update or code change to the package good first issue Good for newcomers linear

Comments

@MichaelMarien
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Some industries have strict regulations about keeping sensitive client data, how to store it, who can access it etc. Some checks keep small examples of data as part of their output (result and display), e.g. TrainTestSamplesMix. This can conflict with the regulations as model metrics are stored and can be accessed in systems that do not comply with all restrictions.

Describe the solution you'd like
Provide a boolean flag "log_example_data" for each relevant check, that can be changed both on check and suite level.

@github-actions github-actions bot added needs triage Issue needs to be labeled and prioritized linear labels Feb 3, 2023
@noamzbr
Copy link
Collaborator

noamzbr commented Feb 8, 2023

Thanks for the suggestion @MichaelMarien!

@noamzbr noamzbr added feature Feature update or code change to the package good first issue Good for newcomers and removed needs triage Issue needs to be labeled and prioritized labels Feb 8, 2023
@MichaelMarien
Copy link
Contributor Author

hi @noamzbr if you can give some guidance on the approach, I wouldn't mind making a PR myself.

@MichaelMarien
Copy link
Contributor Author

Turns out most checks already have a parameter 'n_to_show' that allows to suppress the addition of data in the display. I made PR #2337 to allow this also in the TrainTestSamplesMix check (the only one I could find that did not have this).

However I'm still unsure about how to address this in general and make sure for instance the result (not just display) of TrainTestSamplesMix does not contain data (currently it does). Continuing as for the displays would introduce a new parameter in the context and a lot of if-else logic in every check, might be hard to maintain.

@noamzbr
Copy link
Collaborator

noamzbr commented Mar 9, 2023

Hi @MichaelMarien, admittedly I don't have a great solution for this, and I agree that adding a new parameter will add significant complexity / technical debt. I think that perhaps for now the best solution for such a case is for the user to anonymize their data prior to running the checks.

A solution we can consider - adding some function in deepchecks that can be used to anonymize two datasets automatically (while making sure they are anonymized in a way that enables comparison between the datasets). If you're interested in implementing something like this I'd be glad to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature update or code change to the package good first issue Good for newcomers linear
Projects
None yet
Development

No branches or pull requests

2 participants