Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add `db.map`, `Bag.starmap`, fix `Bag.map` #2339

Merged
merged 8 commits into from May 16, 2017

Conversation

Projects
None yet
3 participants
@jcrist
Copy link
Member

jcrist commented May 15, 2017

  • Implement db.map, a mapping function that has the same semantics as
    builtins.map, with the following extensions:

    • Supports keyword arguments
    • Non-bag arguments are "broadcast" across all function calls
  • Implement db.Bag.starmap. This is similar to itertools.starmap, except it also supports keyword arguments.

  • Deprecate magic splatting behavior in db.Bag.map, pointing users to use db.Bag.starmap instead.

  • Add support for positional arguments to db.Bag.map (just calls db.map).

Fixes #2332
Fixes #1906
Fixes #1572

jcrist added some commits May 15, 2017

Add db.map
Implement `db.map`, a mapping function that has the same semantics as
`builtins.map`, with the following extensions:
- Supports keyword arguments
- Non-bag arguments are "broadcast" across all function calls
Add Bag.starmp
Similar to `itertools.starmap`, except also supports keyword arguments.
Note that all ``Bag`` arguments must have the same number of partitions.

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 15, 2017

Member

Perhaps "must be partitioned identically". This operation will not realign partitions.

[5, 7, 9, 11, 13]
Non-bag arguments are "broadcast" across all calls to the mapped
function:

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 15, 2017

Member

Perhaps drop quotes around broadcast.

>>> db.map(myadd, b, b.max()).compute()
[4, 5, 6, 7, 8]
"""
name = 'map-%s-%s' % (funcname(func), tokenize(func, args, kwargs))

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 15, 2017

Member

Do we need to splat out the args and kwargs here?

This comment has been minimized.

Copy link
@jcrist

jcrist May 15, 2017

Author Member

No, tokenize traverses these as is.

args2.append((itertools.repeat, a.name))
dsk.update(a.dask)
else:
args2.append((itertools.repeat, a))

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 15, 2017

Member

Thoughts on other non-bag scalar objects like delayed or dataframe.Scalar?

return (f(*a, **k) for a, k in zip(zip(*args), kw_iter))
return (f(**k) for k in kw_iter)

return map(f, *args)

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 15, 2017

Member

Thoughts on raising a runtime error when partitions are mismatched?

This comment has been minimized.

Copy link
@jcrist

jcrist May 15, 2017

Author Member

Added.

jcrist added some commits May 15, 2017

Respond to comments:
- Support `dask.Delayed` as argument to bag functions
- Update docstrings
@jcrist

This comment has been minimized.

Copy link
Member Author

jcrist commented May 15, 2017

One caveat here is that we have breaking behavior for varargs functions (which will no-longer be splatted by default). This was really hard to deprecate smoothly, I'm not sure of a better way. I do think that due to how inconsistent the existing behavior was that it probably is infrequently used, but it still is breaking behavior.

@jcrist

This comment has been minimized.

Copy link
Member Author

jcrist commented May 16, 2017

Planning to merge later today if no comment.

args2.append((itertools.repeat, a.key))
dsk.update(a.dask)
else:
args2.append((itertools.repeat, a))

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 16, 2017

Member

OK, I'm going to be a bit picky here. Any thoughts on how to generalize this further? For example I can imagine putting a Future or dataframe Scalar object here? I don't know of a way to check for futures except by an explicit type check (and I'm not saying that we should do this, users can also wrap in delayed if they need this) but other objects that have a .dask attribute may be relevant. Alternatively, we might consider starting to replace Item, Scalar, etc, by delayed when possible.

Nothing to do here, just commenting for general discussion.

This comment has been minimized.

Copy link
@jcrist

jcrist May 16, 2017

Author Member

I think that the use case for dask.dataframe.scalar or dask.array with zero dimensions or a Future would need to wait on use-cases. It was easy to add support for Delayed, less so for the others as they'd involve imports that may not be available (hard to check if pandas isn't installed, and don't want to import pandas unnecessarily).

I'd rather wait to add support for other types until a user asks for it, but up to you.

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 16, 2017

Member

Agreed. Just bringing it up. I think that an alternative approach would be to look for a .dask and .key attribute or something. This would be a new convention though. I'm generally +1 on this PR as is.

@jcrist jcrist merged commit f7a44ad into dask:master May 16, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jcrist jcrist deleted the jcrist:bag-map branch May 16, 2017

mrocklin added a commit to mrocklin/dask that referenced this pull request Jun 6, 2017

Optimize out reify calls in to_delayed
This has two major changes:

1.  The lazify optimization traverses down lists.  This is important
    after changes in dask#2339
2.  Call optimize when we convert from bags to delayed objects

See dask#2423

mrocklin added a commit to mrocklin/dask that referenced this pull request Jun 6, 2017

Optimize out reify calls in to_delayed
This has two major changes:

1.  The lazify optimization traverses down lists.  This is important
    after changes in dask#2339
2.  Call optimize when we convert from bags to delayed objects

See dask#2423

mrocklin added a commit that referenced this pull request Jun 6, 2017

Optimize out reify calls in to_delayed (#2425)
* Optimize out reify calls in to_delayed

This has two major changes:

1.  The lazify optimization traverses down lists.  This is important
    after changes in #2339
2.  Call optimize when we convert from bags to delayed objects

See #2423

* Optimize in Array/DataFrame.to_delayed

* Use bag optimizations within to_textfiles

@sinhrks sinhrks added this to the 0.15.0 milestone Aug 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.