Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document best practice for sampling mode #33

Closed
jbe456 opened this issue May 7, 2020 · 1 comment
Closed

Document best practice for sampling mode #33

jbe456 opened this issue May 7, 2020 · 1 comment
Assignees
Labels
📝 docs docs missing or incorrect
Milestone

Comments

@jbe456
Copy link
Contributor

jbe456 commented May 7, 2020

Description

It's not obvious for somebody using the API to know when to use load_all_data.

I see two reasons to turn sampling on when using large files:

  • I'm exploring a data set within a notebook, and therefore I might create temporary measures or intermediates queries or I might even try to join stores and I quickly want to get all those results
  • I haven't created a cube yet and since data from stores are reloaded after each join and create_cube methods call, I'd rather do it afterwards.

In https://github.com/atoti/notebooks/blob/master/retail/pricing-simulations-around-product-classes/main.ipynb I ended up writing:

# We can now load all the data so that visualizations operate on the entire dataset.
# NB: as a best practice, to optimize speed while exploring your data, we recommend keeping the default sampling mode enabled.
#     Once the model is ready, as it is the case in this notebook, you may call session.load_all_data() after creating the cube.
session.load_all_data()

Can you confirm what is the best practice? Could you document it as well somewhere?

@jbe456 jbe456 added 📝 docs docs missing or incorrect question ❓ labels May 7, 2020
@fabiencelier fabiencelier self-assigned this May 11, 2020
@fabiencelier
Copy link
Contributor

Atoti can handle very large volumes of data while still providing fast answers to queries. However loading a large amount of data during the modeling phase of the application is rarely a good idea because creating stores, cubes, hierarchies and measures are all operations that takes more time when there is more data.

Sampling is a way to have immediate feedback for each cell call so as a rule of thumb you can try to use session.load_all_data as late as possible in your project, even as the last line of your notebook if you can.

Think of it as first building your model with a sample of the data, then replaying every thing with the whole dataset but instead of replaying each cell you call session.load_all_data.

I encourage you to read this medium article about sampling.

I will add some doc to the session.load_all_data method to clarify that.

@fabiencelier fabiencelier added this to the Next release milestone May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📝 docs docs missing or incorrect
Projects
None yet
Development

No branches or pull requests

2 participants