Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report now showing when using a bit more data #47

Closed
rmminusrslash opened this issue Aug 11, 2021 · 4 comments
Closed

report now showing when using a bit more data #47

rmminusrslash opened this issue Aug 11, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@rmminusrslash
Copy link

Hey,

I wanted to run against a production dataset of small-mid size:

65 columns, 150K points in each dataframe.

If I reduce the dataset to one feature, the report shows. If I use all features, the report goes from 16MB to 600 MB and is not displaying (saved or in jupyter).

@emeli-dral
Copy link
Contributor

Hey @rmminusrslash,
Thanks for reporting! Unfortunately, this is the current limitation of the tool.

The report is large because the tool stores all the data necessary to generate interactive plots directly inside the HTML. We plan to fix it when we create a service version of the tool (where we decouple the data storage and the browser-based web service).

For now there are two workarounds:

  1. Use some sampling strategy for your dataset, for instance random sapling. For Jupyter notebook, that can be done directly with pandas. For command line interface, we have a configuration - you can choose random sampling or pick the n-th rows.
  2. Use JSON profile. This way, Evidently calculates the metrics and statistical tests but they can be logged or displayed elsewhere. We have an example for MLflow https://docs.evidentlyai.com/step-by-step-guides/integrations/evidently-+-mlflow and i am working now on one for Grafana.

We understand this limits how you can use the tool now, and are working hard to get to the more feature-full version!

@rmminusrslash
Copy link
Author

rmminusrslash commented Aug 16, 2021

Hey @emeli-dral,

ah, I probably should have been more clear about what I was asking. I tried sampling when I figured out the root cause, up to 10K datapoints worked.

Would it make sense to

  • add sampling as the default if the dataset exceeds current limits (display a message that sampling happened)
  • if you decide against it, at least raise an unsupported exception that mentions the sampling option and mention the limitation in the docs

The current behavior of failing silently might not be ideal until you release the full version (unless you expect people to try the tool mostly with toy data)

@emeli-dral
Copy link
Contributor

Hey @rmminusrslash ,
thanks for more details!

We thought about adding an error message based on data size. But the limit would depend on the user infrastructure especially if used locally, so it would be hard to set a universal threshold when sampling should be applied. And as a priority, we are also working right now to speed up the UI which should solve part of cases when reports are too large to display. Hopefully, it will help a lot 🤞

We are thinking about adding a flag later that the user can set on their own ("large dataset") which would then generate a variation of report that is best suited for larger datasets. It will include not only sampling but a different aggregated views for some parts of the report.

Agree on your comment of making the limitation for large datasets and sampling option even more clear for Jupyter notebook: we already added this now to the Quick-start part of the docs.

@emeli-dral emeli-dral added the bug Something isn't working label Jan 20, 2022
@emeli-dral
Copy link
Contributor

Now reports by default do not use any raw data plots and this reduces reports size significantly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants