New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask optimizations #392

Merged
merged 13 commits into from Sep 18, 2017

Conversation

Projects
None yet
2 participants
@nickhand
Member

nickhand commented Sep 9, 2017

This adds two main features:

  1. A set_options context manager (as in dask) to specify default dask chunk size and cache sizes. I've also increased the default chunk size to 5e6 -- dask docs recommend chunk sizes holding between 10-100 MB to avoid overhead. I definitely see overhead delays when reading large randoms catalogs on my laptop with the old chunk size (1e5)
  2. #391 : An optimized selection procedure for selecting subsets of CatalogSource objects -- selection tasks are inserted into the task graph at the earliest point once the CatalogSource size is fixed. So operations on columns will be performed on the selected subset automatically.

nickhand added some commits Sep 8, 2017

add a set options context manager, similar to dask;
- current global variables are the dask cache size and dask chunk size
- increase dask chunk size to 5e6 from 1e5. This appears to help with overhead for large data sets and should still be small enough to fit into memory
add optimized selection function for dask arrays with test
This builds on the idea that the size of catalog is fixed after a certain point in the dask task graph.
So when we perform a selection, we can insert the selection task directly after the point in the
graph when the size last changed. The subsequent operations are only performed on the
selected data.

This saves a good amount of time -- for example, computing "Position" on the randoms catalog
if slicing only to a single redshift bin

@nickhand nickhand requested a review from rainwoodman Sep 9, 2017

@nickhand

This comment has been minimized.

Member

nickhand commented Sep 13, 2017

Any thoughts on this @rainwoodman ?

dsk.update(dsk2)
chunks = tuple(blockdims) + arr.chunks[ndim:]
return da.Array(dsk, name, chunks, dtype=arr.dtype)
class ColumnAccessor(da.Array):

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

This requires dask > 0.14.2. I just found yesterday.

This comment has been minimized.

@nickhand

nickhand Sep 13, 2017

Member

I'll pin the version in requirements.txt

if isinstance(sel, da.Array):
sel = self.catalog.compute(sel)
try:
d = optimized_selection(self, sel)

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

This try except looks strange -- could you catch a DaskGraphOptimizationFailure instead?

This comment has been minimized.

@nickhand

nickhand Sep 13, 2017

Member

yeah that is a clearer solution

@@ -6,6 +6,33 @@
import dask
dask.set_options(get=dask.get)
_globals = {}

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

_global_options?

This comment has been minimized.

@nickhand

nickhand Sep 13, 2017

Member

yes, I'll change it

s = UniformCatalog(1000, 1.0, seed=42)
# slice once
subset1 = s[:20]

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

Is this :20 per rank or :20 of the entire dataset?

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

Maybe we shall define gslice(a, b, c) that represents global selection cross ranks? It would be useful in abundance matching a sorted catalog.

This comment has been minimized.

@nickhand

nickhand Sep 13, 2017

Member

Yes, currently the slicing is all local. A global slice would be nice. I think I have some code I started on for that already. I'll open up a new PR for it

CurrentMPIComm.set(comm)
with set_options(dask_cache_size=5e9, dask_chunk_size=75):

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

Can I use set_options not as a context manager? The name doesn't suggest it is a context manager at all.

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

If it cannot be used without as a context manager, then it can be renamed to option_context.

with nbkit.option_context(....) as ctx:
    ...
    ...

This comment has been minimized.

@nickhand

nickhand Sep 13, 2017

Member

Yes it will set the globals if it is not called as a context manager. It largely follows the dask set_options class.

This comment has been minimized.

@rainwoodman

rainwoodman Sep 13, 2017

Member

I guess it probably came from numpy's formatting options -- though I am still a bit reserved on their choice of name -- a context named like an action. But looks like we are stuck with such names.

@nickhand nickhand merged commit c3079b6 into bccp:master Sep 18, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.05%) to 95.349%
Details
@rainwoodman

This comment has been minimized.

Member

rainwoodman commented on nbodykit/base/catalogmesh.py in 4e41aab Nov 10, 2017

This and line 211 is causing errors when the data set doesn't fit into the memory -- do you remember anything about them?

This comment has been minimized.

Member

nickhand replied Nov 10, 2017

I think this is a remnant of when the dask optimizations were added. I suspect it's issues with the dask engine not doing what we want it to do in terms of selection... I think we can revert back to the old logic if we need to

This comment has been minimized.

Member

rainwoodman replied Nov 10, 2017

OK I'll file a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment