Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite #29

Merged
merged 12 commits into from Feb 23, 2017

Conversation

@martindurant
Copy link
Member

commented Feb 7, 2017

I am putting new notebooks with numbers prepended with _, before changing any of the existing ones.

Aside from some updating (delayed, parquet), distributed is assumed, and a new section describes it in some detail; the order has changed, since I thought it is better to describe how the system works before showing applications. I may be wrong about that. Probably more explanation up front or comparison to other things like ipyparallel, built-in multiprocessing... may help.

martindurant added some commits Jan 31, 2017

Begin with new modules
Side-by-side with old ones for now, because diff is hard for ipynb
Update dataframe and bag
Need work on arrays, because of hdf5 problem with distributed.
May yet put in some zarr.
@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 7, 2017

Fixes #14, #15, #16, #17

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 7, 2017

cc @JohnCrickett this may interest you

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 7, 2017

I looked very briefly through the changes here and things seem pretty good to me. I'm still in favor of removing raw graphs from the notebooks though. I don't think it's important to use or even to understand what Dask is doing. One might consider rewriting foundations using dask.delayed, and building up intuition using graphviz visualizations along with python for loopy code.

The foundations notebook also introduces other topics like the threading and subprocess module which may also be somewhat distracting.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 7, 2017

_02_foundations is written using delayed. You mean removing any mention of graphs at all?

I had wanted to put something in about background execution is "normal" usage, the kinds of things people might have seen before (but perhaps not). I could colour-code the sections by "important", "power-user", or somesuch, which is what the training group does.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 7, 2017

I think that discussing task graphs is valuable to provide intuition on how dask works. I think that the fact that we store these as python dictionaries of tuples may be distracting. Currently it looks like the student constructs graphs with delayed and then looks at the .dask attribute. I wonder if having them call .visualize would be a better approach.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 7, 2017

Looking at foundations again I'll also suggest that we drop the partial/lambda/closures discussion. This seems like a lot of technology to throw at the student. I get that there is a bit of gain here (delayed execution is a thing) but it seems like the sort of thing that might contribute towards overwhelming the student.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 7, 2017

How about I move the more detailed stuff to the end of foundations, and explicitly say "for the interested", so that people are not put off?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 7, 2017

@JohnCrickett

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2017

Some comments on 01:

  • I think it's worth including details on graphviz for those of us stuck on Windows etc.
  • Is it worth making 'python prep.py' executable within the notebook for those who have already started the notebook or overlook the readme.
@JohnCrickett

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2017

Some comments on 02:

  • It almost feels like Example 1 isn't quite finished, it almost leaves the reader asking 'so what'? Whereas it seems this is an intro into block algorithms? Perhaps a sentence or two to tie this to Dask would round it off well?
  • Maybe I'm biased as I usually end up working on Windows, but it would be nice if Example 2 was not platform specific. Downloading a web page shows the same thing.
    *Perhaps a few more words would help the flow from these examples into what Dask is, building on what you've illustrated?

More later....

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 8, 2017

"I think it's worth including details on graphviz for those of us stuck on Windows etc." - does conda install not work? I don't actually know what the instructions would be... http://www.graphviz.org/Download_windows.php ?

@JohnCrickett

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

conda install graphviz installs a python wrapper, Windows users also need graphviz, the instructions are:

Windows users can install graphviz as follows:

Install Graphviz from http://www.graphviz.org/Download_windows.php
Add C:\Program Files (x86)\Graphviz2.38\bin to the PATH
Run "pip install graphviz" on the command line

@JohnCrickett

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

Final comment on 02: It would be good to include an model answer that can be loaded - like in the current tutorial, particularly as this is the first exercise.

I hope that helps. Overall looks like a really good start.

@JohnCrickett

This comment has been minimized.

Copy link
Contributor

commented Feb 9, 2017

Comments on 01:

  • it's picky I'm afraid, but is it Dask or dask? The text is inconsistent.
  • "practical advice to air you understanding and using dask in every-day situations" - should that be: "practical advice to aid your understanding and application of dask in everyday situations"
  • "The layout of hte tutorial will be as follows" - fix 'the'

Comments on 02:

  • Should the example solution be Foundations-03.py or Foundations-01.py now seeing as the other two aren't used?
  • Picky; this sentence "The following examples show that the kinds of things dask does is not so far removed from normal python programming, typical steps that must be taken to deal with big data." doesn't quite flow right. Perhaps "The following examples show that the kinds of things dask does are not so far removed from normal python programming when dealing with big data"

Hope that helps.

@JohnCrickett

This comment has been minimized.

Copy link
Contributor

commented Feb 9, 2017

Comments on 03:

  • Would be worth a brief explanation of why async is good for debugging?
  • "Some of this monitoring is also available with explicit an progress bar and profiler" an and explicit the wrong way around?
  • "with the scheduler so cuh that tasks get " remove cuh
  • "There are soma automated options for achieving this" s/soma/some/
  • Most/all images broken
  • It would be useful to have something on debugging and logging for distributed clusters, even if it's just pointers to docs elsewhere.
@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 9, 2017

@mrocklin : Dask or dask? The front page of http://dask.pydata.org/en/latest/ contains both.

@martindurant martindurant changed the title (WIP) Rewrite Rewrite Feb 13, 2017

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2017

Notebook 1

multi-core and clusterer -> multi-core and distributed
Mention airflow next to luigi

Graphviz issues may now be resolved on the conda defaults channel.

Docs -> documentation ?

The distributed scheduler is now the recommended engine for executing task work, even on single workstations or laptops.

bag: Python iterators with a functional paradigm... perhaps mention PySpark RDDs here as well

Notebook 02

recommend incx -> x and incy -> y as in the following:

x = inc(15)
y = inc(30)

You may want to replace the current gif with this gif from one of Jim's blogposts.

Thoughts on using the README rather than the Dockerfile? Docker can be a scary word for some.

For the wordcount exercise it seems like the intended approach is to split on every line of the file. I expect this to be very expensive. I wouldn't want students repeating what they learn from this exercise in the wild. The overhead is likely to kill them.

I still find the subprocess, threading, and closure sections to be unnecessary and distracting. I don't think that a naive student may feel a bit overwhelmed by all of this technology that they're seeing. I also don't think it's clear to them that they don't need to understand these concepts to use Dask.

Notebook 03

I recommend showing them something juicy, like Dask.array, before going into the distributed scheduler. This is for a few reasons:

  1. I think that we have only a short amount of time before we demonstrate real value to them before they decide that they're not interested. The foundations notebook took a while, and they're now ready to see some fireworks. I think it's good to interleave foundational knowledge with easy-and-powerful, especially if the audience is not captive.
  2. I think that the distributed scheduler will be much more motivating after they know how to use a dask collection. This will let them quickly build computations and see the diagnostics. The distributed scheduler is really fun once you know a collection, it's tedious if you don't.

This code

total = delayed(add)(delayed(dec)(7), delayed(inc)(6))
total.compute()

Would probably be more approachable to novices if spelled out a bit more

x = delayed(inc)(1)
y = delayed(dec)(2)
total = delayed(add)(x, y)
total.compute()

I would save the concurrent.futures lesson for later on, perhaps after collections. Currently they've only seen dask.delayed. When we see submit and map they might rightfully ask "ok, these seem to do the same thing, why are you showing us the same thing?"

Notebook 04 Bags

I like that this notebook is focused around processing data. I think that we should move it (or something like it) further ahead in the progression.

Notebook 05 arrays

An image appears to be broken (at least when rendering in github)

The name of the make_cluster function is somewhat ambiguous. A novice reader may confuse this with making a cluster of computers. Perhaps make_clustering_data or something similar?

Is snakeviz guaranteed to be on the student's computer?

Notebook 06 Dataframes

I'm somewhat against showing them the dask graph early on. I think that it's scary and not very informative.

Notebook 07 Storage

This looks like it's more about storage for dataframes. Perhaps include this in the title.

Client() may cause some grief if they still have a client running from a previous notebook. Things will still work, but the bokeh web diagnostics will not be accurate.

It might be faster to skip the HDF discussion and go straight to Parquet. I would prefer to avoid showing users approaches that we don't want them to use.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2017

I suspect that most of my concerns above come from having taught tutorials or classes in very time-constrained situations (conferences) or with audiences with a low attention span (some undergraduates). I'm generally critical about any content that is not essential and I'm wary of spending significant amounts of time in between viscerally satisfying sections.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 22, 2017

Graphviz issues may now be resolved on the conda defaults channel.

I like the separation of content into the "Appendix: Further detail and examples" section. That seems like a good compromise between being thorough and being concise.

Bag

# This starts some dask workers;
# be sure to close other notebooks first

Closing notebook windows doesn't kill the kernels. If they have a distributed cluster running somewhere they'll still run into issues here. They have to restart the kernel if they have called Client() before (though they may not have done so yet).

Also, given the way things are rearranged, they may not need the Client yet. They may be able to get by with the default multiprocessing scheduler? This sounds nicer to start with.

Imperative

I still think that a better name for this could be found. The term "Imperative" means something completely different to most people than what I think we mean here.

This section refers to custom graphs, which we haven't yet covered by this point:

Originally we parallelized this by constructing a dask graph explicitly

dsk = {'a': 1, 
       'b': (inc, 'a'),
       
       'x': 10,
       'y': (inc, 'x'),
       
       'z': (add, 'b', 'y')}

This continues on for a few exercises in this section

Distributed

This sentence seems oddly formed:

The scheduler In a terminal, type the following:

Before: The top line gives the address at which the scheduler is waiting...
After: The top line gives the address, 192.168.0.1:8786, at which the scheduler is waiting...

I've stopped using the distributed namespace in public facing documents (it's a bad name), instead preferring from dask.distributed import Client.

The following, again, won't work unless students actively close kernels running clusters. Fortunately, at this point in the tutorial they probably won't have done so (especially if we don't use dask.distributed in the earlier Bag notebook). Additionally students will probably also run into issues if they have created the cluster manually as shown above.

# be sure to close other notebooks first

I don't think there is any reason to mention loopback. Generally I think that we should avoid exposing technicalities like this to students.

The scheduler is listening on the local loopback network

Array

Thoughts on introducing Dataframes before arrays? (not pushing for this, just asking)

Why from_array(..., lock=True)?

Advanced distributed

It would be nice not to dive into the project hierarchy like distributed.diagnostics.progress. This makes the tutorial brittle to future internal shifts of code.

Using persist is pretty important for using the distributed scheduler well. It would be nice to see it get more play here.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 22, 2017

Whoops, misfire on the comment here. Will edit it and update when it's complete.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 22, 2017

Updated.

Generally I think that this tutorial would benefit from being run through a few times, first by a potential instructor alone, then by an instructor with a new student. I think that one would identify a number of speedbumps this way.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2017

"Generally I think that this tutorial would benefit from being run through a few times" - but I should make these changes, merge, and then wait for further comments/changes?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 22, 2017

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2017

Where did you find the dsk/imperative section?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 22, 2017

Oh I see. I clicked "View" on the imperative notebook in the diff, but it is only in the diff because it has been removed. My apologies.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2017

"Thoughts on introducing Dataframes before arrays?" - It may be true that the audience is more likely to be business analysts rather than scientists... but in every training progressions I've ever seen, arrays tend to come before dataframes, as the latter is the more specialised form. If we left arrays last, there would be a good chance of them nor being covered at all.

I was thinking of adding a section (not in this PR) about more involved data analysis processes, and this could be weather data (xarray) for arrays, taxi for dataframes, census for visualisation. All those examples exist elsewhere.

@martindurant martindurant merged commit 81fe01e into dask:master Feb 23, 2017

@martindurant martindurant deleted the martindurant:rewrite branch Feb 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.