Document best practice for sampling mode #33

jbe456 · 2020-05-07T20:18:19Z

Description

It's not obvious for somebody using the API to know when to use load_all_data.

I see two reasons to turn sampling on when using large files:

I'm exploring a data set within a notebook, and therefore I might create temporary measures or intermediates queries or I might even try to join stores and I quickly want to get all those results
I haven't created a cube yet and since data from stores are reloaded after each join and create_cube methods call, I'd rather do it afterwards.

In https://github.com/atoti/notebooks/blob/master/retail/pricing-simulations-around-product-classes/main.ipynb I ended up writing:

# We can now load all the data so that visualizations operate on the entire dataset.
# NB: as a best practice, to optimize speed while exploring your data, we recommend keeping the default sampling mode enabled.
#     Once the model is ready, as it is the case in this notebook, you may call session.load_all_data() after creating the cube.
session.load_all_data()

Can you confirm what is the best practice? Could you document it as well somewhere?

The text was updated successfully, but these errors were encountered:

fabiencelier · 2020-05-11T17:48:07Z

Atoti can handle very large volumes of data while still providing fast answers to queries. However loading a large amount of data during the modeling phase of the application is rarely a good idea because creating stores, cubes, hierarchies and measures are all operations that takes more time when there is more data.

Sampling is a way to have immediate feedback for each cell call so as a rule of thumb you can try to use session.load_all_data as late as possible in your project, even as the last line of your notebook if you can.

Think of it as first building your model with a sample of the data, then replaying every thing with the whole dataset but instead of replaying each cell you call session.load_all_data.

I encourage you to read this medium article about sampling.

I will add some doc to the session.load_all_data method to clarify that.

jbe456 added 📝 docs docs missing or incorrect question ❓ labels May 7, 2020

fabiencelier self-assigned this May 11, 2020

fabiencelier added this to the Next release milestone May 14, 2020

fabiencelier closed this as completed May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document best practice for sampling mode #33

Document best practice for sampling mode #33

jbe456 commented May 7, 2020

fabiencelier commented May 11, 2020

Document best practice for sampling mode #33

Document best practice for sampling mode #33

Comments

jbe456 commented May 7, 2020

Description

fabiencelier commented May 11, 2020