Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-mini-tutorial lessons #2

Merged
merged 17 commits into from
Sep 29, 2021
Merged

dask-mini-tutorial lessons #2

merged 17 commits into from
Sep 29, 2021

Conversation

ncclementi
Copy link
Contributor

This is the dask-mini-tutorial: the first 3 notebooks were build based on multiple materials of previous tutorials as well as material created by coiled people on previous tutorials.
The last two notebooks are very similar to the material from data-science-at-scale but with small modifications. Please feel free to take a look and make comments. I have not spell-checked the material, but code-wise everything runs as expected in an environment created with the file enviornment.yml on the binder directory.

cc: @pavithraes @rrpelgrim @jrbourbeau

@pavithraes
Copy link

pavithraes commented Sep 27, 2021

@ncclementi I love it! Thanks for creating this!

On a high level:

  • I feel like this material might take ~3h to cover at a reasonable pace, so it's probably worth trimming down some parts. What do you think?
  • I noticed some typos in the notebooks, just FYI, I generally use this spellchecker with JupyterLab :)
  • I bulky executed all the cells, and everything seems to be working fine!

Specific notes:

  • In 1-delayed notebook:
    • There's a heading "Delayed interface as a decorator", where you show compute using @delayed -- I thought the previous computations with delayed(inc)(1) are also delayed being used as a decorator?
  • In 2-schedulers notebook:
    • Under "Single Machine Schedulers" -> under "processes": mention that it's the default scheduler for Dask Bag?
    • Under "Futures interface": We say "Another interesting feature of the distributed scheduler is the futures interface." Personally, I prefer calling Futures a Dask 'Collection' or 'Interface', and not a 'feature' -- I feel it may cause confusion otherwise. We can say something like "The distributed scheduler enables another Dask Collection -- the Futures Interface"
    • Under Extra resources: we can link to Futures documentation, tutorial, etc?
    • I think it might be nice to decouple Futures from Schedulers and create a separate notebook for it, thoughts?
    • Under "Distributed Scheduler", maybe we can discuss some plots in detail: task stream, progress bar, workers memory?
  • In 4-machine learning notebook:
    • The "KMeans from Dask-ml" heading can be a level 3 heading?

@ncclementi
Copy link
Contributor Author

ncclementi commented Sep 27, 2021

@ncclementi I love it! Thanks for creating this!
On a high level:
I feel like this material might take ~3h to cover at a reasonable pace, so it's probably worth trimming down some parts. What do you think?

I will give it a test run just by myself and see how long it takes, but yes I might need to cut it even more.

I noticed some typos in the notebooks, just FYI, I generally use this spellchecker with JupyterLab :)

great tip, will take a look, in general, I use jupyter notebooks I like them more and hasn't been able to set up a spell checker but I'll switch to lab to fix this

Specific notes:
In 1-delayed notebook:
There's a heading "Delayed interface as a decorator", where you show compute using @delayed -- I thought the previous computations with delayed(inc)(1) are also delayed being used as a decorator?

That's an interesting point, I always thought that decorators were when with "decorate" a function with @. But maybe I'm wrong. I noticed that the documentation says the delayed() function "decorates your function" but then there is a separate section named "Decorator".
I'll dig deeper into this.

EDIT: After talking to James, he cleared my doubt both are decorators. Thank you @pavithraes for pointing this out, I'll push a fix to this soon.

In 2-schedulers notebook:
Under "Single Machine Schedulers" -> under "processes": mention that it's the default scheduler for Dask Bag?

Done.

Under "Futures interface": We say "Another interesting feature of the distributed scheduler is the futures interface." Personally, I prefer calling Futures a Dask 'Collection' or 'Interface', and not a 'feature' -- I feel it may cause confusion otherwise. We can say something like "The distributed scheduler enables another Dask Collection -- the Futures Interface"

That's a great point, I modified it.

Under Extra resources: we can link to Futures documentation, tutorial, etc?
Added the docs, there is no separate lesson on the tutorial about Futures, but there is something on advanced delayed which I added.

I think it might be nice to decouple Futures from Schedulers and create a separate notebook for it, thoughts?

I feel like since it's just a small section we can leave it there for now. Maybe in the future (no pawn intended) we might reconsider this as a separate more extended section on Futures.

Under "Distributed Scheduler", maybe we can discuss some plots in detail: task stream, progress bar, workers memory?

Yes, I was planning on doing this in the live but wasn't sure how to put it on the notebook itself. Probably with some screenshots. Will add something then.

In 4-machine learning notebook:
The "KMeans from Dask-ml" heading can be a level 3 heading?
yup.

@ncclementi
Copy link
Contributor Author

I'm including all the comments from @pavithraes, I will add a more extended dashboard section (see #3) after merging this PR because I want to test binder build first.

@ncclementi ncclementi merged commit eda4206 into main Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants