Implemented logic to add extra arguments to apply#3256
Conversation
Note this change could break existing code because meta becomes just an extra keyword
mrocklin
left a comment
There was a problem hiding this comment.
This looks pretty good to me. I left a couple of minor comments. Can I also ask you to add a note in the changelog at docs/source/changelog.rst ? If this is your first time then you will also have to add your name and a link at the bottom of that document.
dask/dataframe/tests/test_groupby.py
Outdated
| return df.assign(b=df.b - df.b.mean() + c * d) | ||
|
|
||
| assert_eq(df.groupby('a').apply(func, 1), | ||
| ddf.groupby('a').apply(func, 1)) |
There was a problem hiding this comment.
Can we also test the keyword argument d= in another line?
dask/dataframe/tests/test_groupby.py
Outdated
| ddf = dd.from_pandas(df, npartitions=3) | ||
|
|
||
| pytest.raises(Exception, lambda: df.groupby('does_not_exist')) | ||
| pytest.raises(Exception, lambda: df.groupby('a').does_not_exist) |
There was a problem hiding this comment.
Should these raise particular kinds of exceptions, like AttributeError ?
|
|
||
| @insert_meta_param_description(pad=12) | ||
| def apply(self, func, meta=no_default): | ||
| def apply(self, func, *args, **kwargs): |
There was a problem hiding this comment.
Is there any reasonable way to check that args[0] is in fact a meta and give deprecation warning? In py3 you can have positionally default argument, that would work too to avoid API breakage.
There was a problem hiding this comment.
Precisely, this really bugs me.
But also with positially default arguments, I can totally see people do it in this way:
df.groupby().apply(f, arg1, arg2)
And fail miserably.
Another way is to make meta a required argument.
There was a problem hiding this comment.
I'm fine with having things fail here, people really should be using keyword arguments by name instead of by position.
|
@mrocklin thanks for reviewing. There's also an extra issue that bugs me. Let's say we pass a Future in args or kwargs, because you have a large argument I did have some issues especially with meta estimation. I will add a test-case to illustrate the issue and maybe we can find a way. |
Yes, I agree both that that is important, and that it may be challenging. There are probably two challenges here:
|
|
@mrocklin I will make the test pass, for now I'll skip the metadata issue. |
dask/dataframe/tests/test_groupby.py
Outdated
| c = 1 | ||
| d = 2 | ||
| c_scalar = dd.core.Scalar({'my-scalar': c}, 'my-scalar', int) | ||
| d_scalar = dd.core.Scalar({'my-scalar': d}, 'my-scalar', int) |
There was a problem hiding this comment.
Rather than use internal API I recommend that we us something like df.a.sum(). This will make this test less brittle to internal changes and make it easier for novice maintainers to understand.
|
Nice test! |
dask/dataframe/tests/test_groupby.py
Outdated
| d = 2 | ||
|
|
||
| c_scalar = _make_scalar(c) | ||
| d_scalar = _make_scalar(d) |
There was a problem hiding this comment.
I suggest the following instead:
c_scalar = ddf.a.sum()
d_scalar = ddf.b.mean()|
I'm having some issues dealing with delayed kwargs. I believe these don't get replaced in apply_and_enforced, I used a workaround, let me know if it looks OK. |
|
Have you taken a look at def to_task_dask(expr):
"""Normalize a python object and merge all sub-graphs.
- Replace ``Delayed`` with their keys
- Convert literals to things the schedulers can handle
- Extract dask graphs from all enclosed values
Parameters
----------
expr : object
The object to be normalized. This function knows how to handle
``Delayed``s, as well as most builtin python types.
Returns
-------
task : normalized task to be run
dask : a merged dask graph that forms the dag for this task
Examples
--------
>>> a = delayed(1, 'a')
>>> b = delayed(2, 'b')
>>> task, dask = to_task_dask([a, b, 3])
>>> task # doctest: +SKIP
['a', 'b', 3]
>>> dict(dask) # doctest: +SKIP
{'a': 1, 'b': 2}
>>> task, dasks = to_task_dask({a: 1, b: 2})
>>> task # doctest: +SKIP
(dict, [['a', 1], ['b', 2]])
>>> dict(dask) # doctest: +SKIP
{'a': 1, 'b': 2} |
|
@mrocklin It's possible to replace the kwargs handling with a task using to_dask_task. There's however another test that conflicts with both strategies. Is there any specific reason for the test or I can edit? |
|
It seems perfectly reasonable to change that test. My guess is that the intent behind that test is that we maintain keyword arguments in the task graph directly in some easy-to-interpret way, rather than including them as closures or something else more exotic. |
|
This looks good to me, but @jcrist might have a better eye for issues here. Lets give him 24 hours to see if he has time to respond (he may be out of contact for the next few days). |
jcrist
left a comment
There was a problem hiding this comment.
Apologies for the delayed review, overall this looks good to me.
dask/dataframe/core.py
Outdated
| func : function | ||
| Function applied to each partition. | ||
| args, kwargs : | ||
| args, kwargs : Scalar, Delayed or object |
There was a problem hiding this comment.
args and kwargs can contain these types, but aren't these types themselves. I'd remove the type note here, and move it to the description below. Something like Arguments and keywords may contain ``Scalar``, ``Delayed``, regular python objects.
dask/dataframe/core.py
Outdated
|
|
||
| Ensures the output has the same columns, even if empty.""" | ||
| df = func(*args, **kwargs) | ||
| df = func(*args, **dict(kwargs)) |
There was a problem hiding this comment.
Why this change? I don't think this should be necessary.
There was a problem hiding this comment.
It was something I forgot to clean up, I'll remove it
dask/dataframe/core.py
Outdated
| name = '{0}-{1}'.format(name, token) | ||
|
|
||
| from .multi import _maybe_align_partitions | ||
| args, args_dasks = _process_lazy_args(args) |
There was a problem hiding this comment.
I think this can be moved down to where the call to to_task_dask is with no issues (since alignment also ignores Delayed objects). I'd prefer to group those together if possible.
|
|
||
| @insert_meta_param_description(pad=12) | ||
| def apply(self, func, meta=no_default): | ||
| def apply(self, func, *args, **kwargs): |
There was a problem hiding this comment.
I'm fine with having things fail here, people really should be using keyword arguments by name instead of by position.
|
LGTM. Thanks @gabrielelanaro! Merging. |
Note this change could break existing code because meta becomes just an
extra keyword with no positional meaning.
flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new API