WIP: Feature/dataframe aggregate (implements #1619) by chmp · Pull Request #1678 · dask/dask

chmp · 2016-10-18T22:13:00Z

After the great experience with #1666, I started to implement dd.DataFrame(...).groupby(...).agg(...) on a recent, particularly long, train ride. The implementation is pretty much feature-complete. However, support for var and std is still missing. As of now, there are no changes to the existing dask code base for implementing this feature. Since this fact results in some code duplication and there are further design consideration to be discussed, I wanted to open a PR early to ask for feedback. I hope opening such a big issue as a PR without prior announcement is OK.

Implementation notes

The implementation supports all documented argument values of pandas aggregate:

function (.agg('mean')
list of functions (.agg(['mean', 'sum']))
dict of columns -> functions (.agg({'b': 'mean', 'c': 'max'}))
nested dict of names -> dicts of functions (.agg({'b': {'c': 'mean'}, 'c': {'a': 'max', 'a': 'min'}}))

The aggregation functions sum, min, max, count, size, and mean are currently supported. When function objects are used, they are matched by name against known aggregates. Support for std and var is still missing.

I followed closely the implementation of the existing aggregate functions. However, I added an additional finalize-step to make it easy to support tree aggregations (since this topic seems to be discussed). This design results in three stages: chunk functions, aggregate functions, and finalizers. In aggregate, aca (chunk functions + aggregate functions) is followed by map_partitions (finalizers).

Each stage applies multiple function objects to the (grouped) input dataframe and returns a fragment of the result dataframe. The full results is obtained by concatenating all fragments (_groupby_apply_funcs and _agg_finalize).

Most of the heavy lifting is performed by generating function objects that are applied to the dataframe and return fragments of the result dataframe (_build_agg_args). Since aggregate and finalize steps work on intermediate results of prior phases, agg forms an implicit expression graph. As of now, each aggregate generates its own independent graph by writing anonymous columns in intermediate dataframes (next_id). An alternative would be to use fixed names for intermediates and thereby reuse them between aggregates. For example, .agg(['sum', 'mean', 'count']) would only compute one sum and one count per input-column.

Currently, all aggregations are implemented by calling individual aggregate functions on grouped dataframes (following existing code). Another option would be to call the pandas agg implementation to generate intermediates. Depending on the perfomance of the pandas implementation this change could improve overall performance.

Implements #1619.

mrocklin

Hey, nice work. This looks like a very good start. I've left some comments below. cc'ing @jreback who might be able to use this functionality immediately and @jcrist who has done similar work recently.

mrocklin · 2016-10-18T22:28:39Z

dask/dataframe/groupby.py

+                                           next_id, simple_impl[func])
+
+        elif func == 'var':
+            raise NotImplementedError()


The functions in dask/dataframe/groupby.py like _var_chunk may be of value here.

Thanks, for the pointer. I will probably reimplement it to be able to reuse intermediates across different reductions, if this is ok for you?

As long as single-groupby-aggregation performance isn't seriously affected or made significanty more complex then, yes, modifying existing code sounds great.

mrocklin · 2016-10-18T22:30:04Z

dask/dataframe/groupby.py

+
+
+AggArgs = collections.namedtuple('AggArgs', ['chunk_funcs', 'aggregate_funcs',
+                                             'finalizers'])


Would a dictionary do ok here? Most dask development has relatively few custom classes. Namedtuples are pretty benign, I agree, but the majority of the codebase is pretty barebones.

I'll replace the namedtuple with a dict.

mrocklin · 2016-10-18T22:31:47Z

dask/dataframe/groupby.py

+            partial(_apply_func_to_column, int_sum, int_sum, M.sum),
+            partial(_apply_func_to_column, int_count, int_count, M.sum)
+        ],
+        finalizers=[(result_column, partial(_finalize_mean, int_sum, int_count))],


There is a non-trivial cost to using partial functions in a distributed setting. They are harder to serialize and harder to reason about when people need to debug the graphs themselves (which sometimes core developers do need to do).

This isn't a blocking request (fine if it's too hard) but if its possible to put these arguments directly into the graph rather than implicitly with a partial then that would be better.

Thanks for the explanation. To be honest, I was wondering when reading the dask code base about this design choice, makes perfect sense.

I'll put the arguments directly into the graph. My proposal would be to reverse the arguments to _apply_func_to_column, convert all partial arguments to keyword arguments, and to replace the partials with pairs of function and kwargs.

Sounds good to me

mrocklin · 2016-10-18T22:33:59Z

dask/dataframe/tests/test_groupby.py

+    assert eq(pdf.groupby('a').agg(spec), ddf.groupby('a').agg(spec))
+
+    spec = 'mean'
+    assert eq(pdf.groupby('a').agg(spec), ddf.groupby('a').agg(spec))


It might be nice to put all of the specs into a list and then loop over them. Or perhaps use a @pytest.parametrize testing fixture.

I will go with pytest.parametrize (and fix the failing tests due to the recent changes on master).

mrocklin · 2016-10-18T22:35:35Z

dask/dataframe/groupby.py

+        'max': (M.max, M.max),
+        'count': (M.count, M.sum),
+        'size': (M.size, M.sum),
+    }


You might want to see the very recent work of tree reductions here: #1663

This ends up having a significant performance impact for some groupbys. It would be nice (eventually) to have the same logic in aggregate. This would require adding another intermediate combine function to each of these tuples.

Perfect, I wasn't aware that it has been already merged. The dask development has quite a rapid pace to it, if I may say.

I'll rebase and add tree reductions to aggregate.

I'm wondering: after the rebase, aren't I using tree reductions already? Since I do not set the split_every=False in apply_concat_apply, the tree reduction code should be run as is. In any case, I will expose the split_every argument on aggregate and double-check whether the default of combine==aggregate works for the reductions used in agg.

Right, according to the docstring it looks like the intermediate combine function defaults to the final aggregate function if none is provided. @jcrist thoughts on this default? An alternative default would be to use split_every=False if users don't provide a combine function, which might be safer for user-defined functions.

Regardless, I think it would be safer here to explicitly include a combine function if one is known, so for example:

# 'count': (M.count, M.sum) # before 'count': (M.count, M.sum, M.sum) # after

I went with that default because it matches the interface of tree reductions in bag and array. Many of the reductions defined here also have identical aggregate and combine functions, so it's slightly more concise to use this default :). My thought is that custom reductions are advanced enough that we can but the burden on users to properly use this method.

For all aggregation supported combine == aggregate. While, I would be happy to add the corresponding tuple elements, dictionary entries in the _build_ ... functions, I feel it would needlessly clutter the code. However, I think adding the corresponding keyword arguments for apply_concat_apply would make the assumption of combine == aggregate explicit.

chmp · 2016-10-19T20:29:04Z

I rebased to master and addressed your comments (namedtuples, partial, pytest parametrize). Further, I changed the semantics of the the funcs argument in _groupby_apply_funcs. Now, each function returns exactly one column. This changed allowed me to skip recomputing intermediate sums. For certain aggregate calls this should reduce the communication / computation overhead by a factor of two or more. A specific example would be df.groupby(...).agg(['mean', 'sum', 'count']). Finally, I also implemented var and std.

Next, I will tackle tree reductions.

chmp · 2016-10-19T20:30:20Z

dask/dataframe/groupby.py

+    # generated it is reused in subsequent id requests.
+    ids = it.count()
+    # TODO: add func, input_column to intermediate name for easier debugging
+    id_map = collections.defaultdict(lambda: 'intermediate-{}'.format(next(ids)))


Currently, the skipping of recomputation is implemented by a somewhat arcane combination of itertools.count and defaultdict. What do you think about implementing it as a separate class? This would also make it simpler to generate easy to interpret intermediate names.

Are these names used as keys in the task graph or just as column names? Is there a convenient deterministic way to create these names?

Generally I avoid classes, probably somewhat irrationally.

Rightly so: now that I think about it, this functionality can also be easily implemented with a closure.

The keys are only used as column names in intermediate data frames.

With regard to generating these names deterministically. I am somewhat certain that prefixing the repr of the column key by the function should be a unique key, as long as the separator is not used inside the function name. However, I thought better to make sure by adding a unique component to the name.

A possible implementation could be:

def _make_agg_id_factory(): integers = it.count() ids = {} def factory(func, column): try: return ids[func, column] except KeyError: new_id = '{}-{!s}-{!r}'.format(next(integers), func, column) return ids.setdefault((func, column), new_id) return factory

chmp · 2016-10-19T20:50:54Z

Regarding the failed travis check: I cannot' see any place where I would have changed code that is relevant to that test. Are there any issues known? Maybe it is some very unlikely combination of the random number generated.

Update: it seems to be a stochastic failure of the pivot_table test. After pushing 3fd0ef9 it worked again.

mrocklin · 2016-10-20T12:13:17Z

dask/dataframe/tests/test_groupby.py

+    assert len(no_mean_aggs) == len(with_mean_aggs)
+
+    assert len(no_mean_finalizers) == len(no_mean_spec)
+    assert len(with_mean_finalizers) == len(with_mean_spec)


It would be nice to check some additional things about the graphs produced here:

Do they work well with split_every. The dask.dataframe.utils.assert_max_deps function might be useful here

Do they produce deterministic key names. Is df.groupby(...).aggregate(...).dask == df.groupby(...).aggregate(...).dask for equivalent values of ...?

Are intermediates shared? Perhaps by checking that the lengths of the resulting dask graphs are shorter than one might expect from a naive handling.

I will add these tests. Check 3 should always be satisfied. Irrespective of the chosen aggregations only a single dataframe will be generated per partition. Therefore, the number of elements inside the dask graphs should always be same. Only the columns added to the dataframes do depend on the argument to aggregate.

Check 3 should always be satisfied. Irrespective of the chosen aggregations only a single dataframe will be generated per partition. Therefore, the number of elements inside the dask graphs should always be same.

Very cool. Besides being more convenient I anticipate that this will have some very nice performance benefits :)

chmp · 2016-10-20T23:13:05Z

dask/dataframe/tests/test_groupby.py

+
+    assert_eq(pdf.groupby(['a', 'd']).agg(spec),
+              ddf.groupby(['a', 'd']).agg(spec))
+


After adapting the tests. I found a couple of problems:

using series as groupby arguments fails miserably

pandas is using additional magic where it tries to cast the result back to its original type. The relevant code can be found in _GroupBy._try_cast. (see also Intermittent failure in pivot_table test #1685). While it is definitely possible to ensure the final data frame has the correct types after .compute(). Such a step would also mean, that there is no way of knowing the correct meta object before performing the actual aggregation

.groupby(...).agg('size') returns a series for inconsistence sake

My recent commits address issues 1 and 3. For Issue 2 however, i.e., the recasting of the result, I don't see any way to fix it.

chmp · 2016-10-23T09:21:51Z

dask/dataframe/groupby.py

+        else:
+            levels = 0
+
+        # TODO: add normed spec as an additional part to the token?


Are there any specifics I need to watch out when generating the token or does apply-concat-apply ensures that the resulting dask keys are unique?

The apply_concat_apply function should handle things here

chmp · 2016-10-23T09:25:55Z

dask/dataframe/groupby.py

-                index = list(index.columns)
-
-        self.index = index
+        self.index = _normalize_index(df, index)


Here, I added support for lists of series as a grouper, which was previously not supported. This logic is required to handle df.groupby([df['a'], df['b']]) consistent with pandas. The recursion in _normalize_index is not truly necessary (since lists of lists should not be valid groupers), but it was easiest way to implement it.

chmp · 2016-10-23T09:27:50Z

dask/dataframe/groupby.py

+        meta = _aggregate_meta(self.obj._meta, meta_groupby, 0, chunk_funcs,
+                               aggregate_funcs, finalizers)
+
+        if not isinstance(self.index, list):


Here, I'm adding multiple series as varargs to the cunk functions. This way no additional logic for aligning the different series is required. The chunk functions have been adapted accordingly.

chmp · 2016-10-23T09:30:18Z

dask/dataframe/tests/test_groupby.py

+        agg_dask1 = get_agg_dask(result1)
+        agg_dask2 = get_agg_dask(result2)
+
+        core_agg_dask1 = {k: v for (k, v) in agg_dask1.dask.items()


map_partitions used in the finalize step passes the meta dataframe. Therefore, == can no longer be used for comparison.

mrocklin · 2016-10-23T11:55:51Z

In travis.ci it looks like there is an encoding issue with one of the files and some flake8 issues generally. Nothing major though.

mrocklin · 2016-10-23T12:04:27Z

dask/dataframe/groupby.py

+
+        meta_groupby = pd.Series([], dtype=bool, index=self.obj._meta.index)
+        meta = _aggregate_meta(self.obj._meta, meta_groupby, 0, chunk_funcs,
+                               aggregate_funcs, finalizers)


It looks like this function is only a few lines and is only used here. It might be easier to read if this function was just inlined here instead.

I inlined it.

mrocklin · 2016-10-23T12:07:24Z

dask/dataframe/groupby.py

+
+        elif isinstance(self.index, list):
+            group_columns = {i for i in self.index
+                             if isinstance(i, tuple) or np.isscalar(i)}


Does the input order not matter here?

The group_columns columns are only used to create the list of non_group_columns. Since only __contains__ is checked, the order of group_columns is irrelevant.

mrocklin · 2016-10-23T12:10:07Z

dask/dataframe/groupby.py

+
+        # TODO: add normed spec as an additional part to the token?
+        token = 'aggregate'
+        finalize_token = '{}-finalize'.format(token)


Also, it might be nicer to just inline 'aggregate-finalize' directly in the map_partitions call.

Removing small logic like this should make it easier for other developers to maintain this function in the future.

I assumed additional logic was required. Since this is not the case, I removed both tokens and inlined them directly into the aca and map_partitions calls.

mrocklin · 2016-10-23T12:11:47Z

dask/dataframe/groupby.py

+
+def _normalize_spec(spec, non_group_columns):
+    """
+    Return a list of ``(result_column, func, input_column)`` tuples.


Can I ask for a quick Examples section here with a couple of worked examples to help future developers quickly understand how this function operates.

Examples -------- >>> _normalize_spec(simple, inputs) ... simple outputs ...

I added further explanation to the docstring and different examples that should cover the functionality.

mrocklin · 2016-10-23T12:18:46Z

dask/dataframe/groupby.py

+#
+###############################################################
+
+def _make_agg_id_factory():


Should this function be tested independently?

I added an extra test.

mrocklin · 2016-10-23T12:20:11Z

dask/dataframe/groupby.py

+
+def _groupby_apply_funcs(df, *index, **kwargs):
+    """
+    TODO: document args


Lingering TODO here

Add full description of this function with all its arguments.

mrocklin · 2016-10-23T13:11:29Z

dask/dataframe/groupby.py

+    --------
+
+        >>> _normalize_spec('mean', ['a', 'b', 'c'])
+        [('a', 'mean', 'a'), ('b', 'mean', 'b'), ('c', 'mean', 'c')]


These should be unindented to line up with the normal text on the left. We should also drop the newline after the ----.

http://dask.readthedocs.io/en/latest/develop.html#docstrings

mrocklin · 2016-10-23T13:12:12Z

dask/dataframe/groupby.py

+        ...                  'b': {'e': 'count', 'f': 'var'}},
+        ...                 ['a', 'b', 'c'])
+        [(('a', 'mean'), 'mean', 'a'), (('a', 'size'), 'size', 'a'),
+         (('b', 'e'), 'count', 'b'), (('b', 'f'), 'var', 'b')]


These are super-informative by the way. Thanks!

mrocklin · 2016-10-23T13:12:37Z

dask/dataframe/groupby.py

+    funcs: list of result-colum, function, keywordargument triples
+        The list of functions that are applied on the grouped data frame.
+        Has to be passed as a keyword argument.
+


Should drop endlines here

mrocklin · 2016-10-23T13:18:17Z

This is looking pretty good to me. It would be nice to have someone more familiar with dask.dataframe take a look at it before merging. @jcrist if you have time tomorrow this could use your review.

Example: # previously supported: df['a'] -> 'a' # additionally supported [df['a'], df['b]] -> ['a', 'b']

- inline _aggregate_meta - add arg docs, examples for _normalize_spec - add arg docs for _groupy_apply_funcs - inline token arguments in aggregate

jreback · 2016-10-24T14:17:52Z

first (test) usage, seemed to work nicely.

intentional to support only DataFrameGroupby ATM? (fine by me), maybe put in a NotImplementedError on SeriesGroupby usage though)

jreback · 2016-10-24T16:28:19Z


In [27]: ddf = dd.from_pandas(pd.DataFrame({'A' : [1,2,3], 'B' : [1, 2, 3]}), npartitions=3)

In [28]: ddf
Out[28]: dd.DataFrame<from_pa..., npartitions=2, divisions=(0, 1, 2)>

In [29]: ddf.compute()
Out[29]:
   A  B
0  1  1
1  2  2
2  3  3

In [30]: ddf2 = ddf.groupby('A').agg(['sum', 'mean'])

In [31]: ddf2.compute()
Out[31]:
    B
  sum mean
A
1   1  1.0
2   2  2.0
3   3  3.0

In [32]: ddf2.columns = ['foo', 'bar']

In [33]: ddf2.compute()
Out[33]:
    B
  sum mean
A
1   1  1.0
2   2  2.0
3   3  3.0

In [35]: import dask

In [36]: dask.__version__
Out[36]: '0.11.1+54.g7f411f8'

so the rename [32] is not propogating

In [37]: ddf3 = dd.concat([ddf.groupby('A').sum(), ddf.groupby('A').mean()], axis=1)
C:\Anaconda\envs\bmc3\lib\site-packages\dask\dataframe\multi.py:653: UserWarning: Concatenating dataframes with unknown divisions.
We're assuming that the indexes of each dataframes are aligned
.This assumption is not generally safe.
  warn("Concatenating dataframes with unknown divisions.\n"

In [38]: ddf3.compute()
Out[38]:
   B    B
A
1  1  1.0
2  2  2.0
3  3  3.0

In [39]: ddf3.columns = ['foo', 'bar']

In [40]: ddf3.compute()
Out[40]:
   bar  bar
A
1    1  1.0
2    2  2.0
3    3  3.0

So I am not sure this is directly in this PR

chmp · 2016-10-24T16:36:52Z

Regarding the missing aggregate for Series: this is a mere oversight, since it did not come up in my use case. It should be easy to add, I'll have a look at it.

jcrist · 2016-10-24T16:54:53Z

dask/dataframe/groupby.py

+    integers = it.count()
+    ids = {}
+
+    def factory(func, column):


Why not just use dask.base.tokenize? This provides deterministic keys, without the need for cache.

tokenize(func, column)

Missing knowledge about the dask internals. Using tokenize is a good idea, I will change this part.

chmp · 2016-10-24T16:57:49Z

The rename is not related to my code. dask does not seem to handle multilevel columns too well (for example also df['A', 'mean] does not work). A failing test for the rename functionality could be:

def test_renames_multilevelcolumns():
    pdf = pd.DataFrame({('A', '0') : [1,2,2], ('B', 1) : [1, 2, 3]})
    ddf = dd.from_pandas(pdf, npartitions=3)

    pdf.columns = ['foo', 'bar']
    ddf.columns = ['foo', 'bar']

    assert_eq(pdf, ddf)

# E       AssertionError: Index are different
# E       
# E       Index classes are not equivalent
# E       [left]:  Index(['foo', 'bar'], dtype='object')
# E       [right]: MultiIndex(levels=[['A', 'B'], [1, '0']],
# E                  labels=[[0, 1], [1, 0]])

The culprit is the dask.dataframe.core._rename implementation:

def test_renames_pandas():
    pdf = pd.DataFrame({('A', '0') : [1,2,2], ('B', '1') : [1, 2, 3]})

    expected = pdf.copy()
    expected.columns = ['foo', 'bar']

    # df.rename(columns=dict(zip(df.columns, columns))) in core._rename
    actual = pdf.rename(columns=dict(zip(pdf.columns, ['foo', 'bar'])))

    assert_eq(expected, actual)

# result:
# E       AssertionError: DataFrame.columns are different
# E       
# E       DataFrame.columns classes are not equivalent
# E       [left]:  Index(['foo', 'bar'], dtype='object')
# E       [right]: MultiIndex(levels=[['A', 'B'], ['0', '1']],
# E                  labels=[[0, 1], [0, 1]])

However, as of now, I have no good idea how to fix this.

jreback · 2016-10-24T17:16:13Z

@chmp in pandas you do this as df.loc[('A', 'mean')]

chmp · 2016-10-24T17:25:01Z

I posted a comment on the __getitem__ issue.

mrocklin · 2016-10-24T18:06:22Z

I think that we can safely ignore the multi-named columns issue for this PR. It looks to me like the only outstanding issue it to use tokenize rather than the factory naming solution.

Any other comments?

chmp · 2016-10-25T07:02:36Z

My most recent change addresses the two open comments:

The factory for intermediate ids is gone. The unique part of the id comes now from tokenize.
Series aggregate is implemented.

However, currently series-aggregate does not work with multiple groupers, i.e., s.groupby([s1, s2]).aggregate(...). The reason is a limitation of the groupby implementation for series in dask master: when I comment out the check in SeriesGroupby.__init__, the short-circuit branches fail, whereas the full aggregate code works as intended. I could have a look at this issue as part of the PR, however I would prefer to open a separate issue to address this topic.

For reference when I remove all checks in SeriesGroupy.__init__, the test results are

dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-spec0] XPASS
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-spec1] XPASS
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-spec2] XPASS
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-sum] xfail
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-size] xfail

mrocklin · 2016-10-25T11:41:58Z

dask/dataframe/groupby.py

+#
+###############################################################
+def _make_agg_id(func, column):
+    return '{!s}-{!s}-{}'.format(func, column, tokenize(func, column))


Are func and column always strings? If so then perhaps just func + '-' + column?

Not necessarily. In principle columns can be any valid python type (that is hash-able as far as I know). In particular integers, booleans, and tuples come to my mind.

mrocklin

This looks pretty good to me

mrocklin · 2016-10-25T11:46:18Z

dask/dataframe/tests/test_groupby.py

+    ddf = dd.from_pandas(pdf, npartitions=10)
+
+    assert_eq(pdf.groupby(grouper(pdf)).agg(spec),
+              ddf.groupby(grouper(ddf)).agg(spec, split_every=split_every))


Thanks. After your little nudge towards pytest.mark.parametrize, I became I fan of it :)

mrocklin · 2016-10-25T11:46:55Z

dask/dataframe/tests/test_groupby.py

+    ['sum', 'mean', 'min', 'max', 'count', 'size', 'std', 'var'],
+    'sum', 'size',
+])
+@pytest.mark.parametrize('split_every', [None])#[False, None])


Why not test split_every = False here?

An oversight: I commented it out to increase the test speed and forgot to comment it in. I'll run the travis tests on a branch in my repo and push the changes after everything is green.

mrocklin · 2016-10-25T18:23:32Z

Feel free to use travis.ci here. There isn't a big need to isolate your
testing if you don't find it convenient.

On Tue, Oct 25, 2016 at 12:32 PM, Christopher Prohm <
notifications@github.com> wrote:

@chmp commented on this pull request.

In dask/dataframe/tests/test_groupby.py
#1678:
                   'c': [0, 1, 2, 3, 4, 5, 6, 7, 8] \* 10,
                   'd': [3, 2, 1, 3, 2, 1, 2, 6, 4] \* 10},
                  columns=['c', 'b', 'a', 'd'])
ddf = dd.from_pandas(pdf, npartitions=10)
assert_eq(pdf.groupby(grouper(pdf)).agg(spec),
         ddf.groupby(grouper(ddf)).agg(spec, split_every=split_every))
+@pytest.mark.parametrize('spec', [
{'b': 'sum', 'c': 'min', 'd': 'max'},

['sum'],

['sum', 'mean', 'min', 'max', 'count', 'size', 'std', 'var'],

'sum', 'size',
+])
+@pytest.mark.parametrize('split_every', [None])#[False, None])
An oversight: I commented it out to increase the test speed and forgot to
comment it in. I'll run the travis tests on a branch in my repo and push
the changes after everything is green.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1678, or mute the thread
https://github.com/notifications/unsubscribe-auth/AASszG-I3iFmr2neQlS3EXGrHBcpd5htks5q3i8ygaJpZM4KaWX0
.

chmp · 2016-10-25T18:27:40Z

After I enabled travis for my fork, I only have to push into a branch that is not connected to this PR. Then travis is running the tests isolated from the dask repo. Since, dask is already integrated with travis, this whole procedure is very convenient :)

mrocklin · 2016-10-25T20:23:01Z

Any further comments on this PR?

mrocklin · 2016-10-26T12:44:22Z

Alrighty. Merging. Thanks for all of the effort on this one @chmp ! Nice work.

mrocklin reviewed Oct 18, 2016

View reviewed changes

chmp commented Oct 19, 2016

View reviewed changes

sinhrks added the dataframe label Oct 20, 2016

mrocklin mentioned this pull request Oct 20, 2016

Intermittent failure in pivot_table test #1685

Closed

mrocklin reviewed Oct 20, 2016

View reviewed changes

chmp commented Oct 20, 2016

View reviewed changes

chmp commented Oct 23, 2016

View reviewed changes

chmp mentioned this pull request Oct 23, 2016

Feature/groupby reductions multiple levels #1692

Merged

mrocklin reviewed Oct 23, 2016

View reviewed changes

chmp added 9 commits October 23, 2016 15:48

Add partial aggregate support (missing: std, var, nunique)

38cf3d8

Add aggregate docs

283d0c2

Handle agg for numpy funcs and for non dictionary specs

cc437de

Use pytest parametrize fixtures

05521a4

Remove use of partial and namedtuple

cf00c47

Reuse intermediates between results in aggregates.

c6f0aa0

Implement aggregate for std / var

6de3fc0

Expose the split_every argument of apply_concat_apply in aggregate

bcb752a

Update docs of _build_arg_s

cbfdfa5

chmp added 9 commits October 23, 2016 15:48

Pass combine function explicitly to document the dask construction

7da2fc8

Add the function/column to intermediate names

6a3c656

Recursively normalize the index in groupby

1abfeaf

Example: # previously supported: df['a'] -> 'a' # additionally supported [df['a'], df['b]] -> ['a', 'b']

Fix style (flake8)

7eb178e

Remove non ascii dashes

5b2c6db

Address PR comments:

12194db

- inline _aggregate_meta - add arg docs, examples for _normalize_spec - add arg docs for _groupy_apply_funcs - inline token arguments in aggregate

Brush up comment strings

c5b82c4

Rebase to master, remove duplicated pytest import

e5cca65

Fix doctests: ignore whitespace, use OrderedDicts

7f411f8

jcrist reviewed Oct 24, 2016

View reviewed changes

mrocklin mentioned this pull request Oct 24, 2016

Renaming columns fails on multi-level columns #1694

Closed

mrocklin reviewed Oct 25, 2016

View reviewed changes

Implement series-groupby, use tokenize for intermediate ids

6b09d19

mrocklin merged commit 6481410 into dask:master Oct 26, 2016

jcrist mentioned this pull request Oct 27, 2016

support .agg() in dask groupby #1619

Closed

sinhrks added this to the 0.12.0 milestone Nov 7, 2016

mrocklin mentioned this pull request Apr 13, 2018

CommClosedError when using own Jupyter on Kubernetes #3399

Closed



		AggArgs = collections.namedtuple('AggArgs', ['chunk_funcs', 'aggregate_funcs',
		'finalizers'])


		assert_eq(pdf.groupby(['a', 'd']).agg(spec),
		ddf.groupby(['a', 'd']).agg(spec))

Uh oh!

Conversation

chmp commented Oct 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chmp Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chmp Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chmp commented Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chmp commented Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chmp Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chmp Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

chmp commented Oct 18, 2016 •

edited

Loading

chmp Oct 19, 2016 •

edited

Loading

chmp Oct 19, 2016 •

edited

Loading

chmp commented Oct 19, 2016 •

edited

Loading

chmp commented Oct 19, 2016 •

edited

Loading

chmp Oct 20, 2016 •

edited

Loading

chmp Oct 23, 2016 •

edited

Loading

chmp Oct 23, 2016 •

edited

Loading

chmp Oct 23, 2016 •

edited

Loading