Skip to content

WIP: Feature/dataframe aggregate (implements #1619)#1678

Merged
mrocklin merged 23 commits intodask:masterfrom
chmp:feature/dataframe-aggregate
Oct 26, 2016
Merged

WIP: Feature/dataframe aggregate (implements #1619)#1678
mrocklin merged 23 commits intodask:masterfrom
chmp:feature/dataframe-aggregate

Conversation

@chmp
Copy link
Copy Markdown
Contributor

@chmp chmp commented Oct 18, 2016

After the great experience with #1666, I started to implement dd.DataFrame(...).groupby(...).agg(...) on a recent, particularly long, train ride. The implementation is pretty much feature-complete. However, support for var and std is still missing. As of now, there are no changes to the existing dask code base for implementing this feature. Since this fact results in some code duplication and there are further design consideration to be discussed, I wanted to open a PR early to ask for feedback. I hope opening such a big issue as a PR without prior announcement is OK.

Implementation notes

The implementation supports all documented argument values of pandas aggregate:

  • function (.agg('mean')
  • list of functions (.agg(['mean', 'sum']))
  • dict of columns -> functions (.agg({'b': 'mean', 'c': 'max'}))
  • nested dict of names -> dicts of functions (.agg({'b': {'c': 'mean'}, 'c': {'a': 'max', 'a': 'min'}}))

The aggregation functions sum, min, max, count, size, and mean are currently supported. When function objects are used, they are matched by name against known aggregates. Support for std and var is still missing.

I followed closely the implementation of the existing aggregate functions. However, I added an additional finalize-step to make it easy to support tree aggregations (since this topic seems to be discussed). This design results in three stages: chunk functions, aggregate functions, and finalizers. In aggregate, aca (chunk functions + aggregate functions) is followed by map_partitions (finalizers).

Each stage applies multiple function objects to the (grouped) input dataframe and returns a fragment of the result dataframe. The full results is obtained by concatenating all fragments (_groupby_apply_funcs and _agg_finalize).

Most of the heavy lifting is performed by generating function objects that are applied to the dataframe and return fragments of the result dataframe (_build_agg_args). Since aggregate and finalize steps work on intermediate results of prior phases, agg forms an implicit expression graph. As of now, each aggregate generates its own independent graph by writing anonymous columns in intermediate dataframes (next_id). An alternative would be to use fixed names for intermediates and thereby reuse them between aggregates. For example, .agg(['sum', 'mean', 'count']) would only compute one sum and one count per input-column.

Currently, all aggregations are implemented by calling individual aggregate functions on grouped dataframes (following existing code). Another option would be to call the pandas agg implementation to generate intermediates. Depending on the perfomance of the pandas implementation this change could improve overall performance.

Implements #1619.

Copy link
Copy Markdown
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, nice work. This looks like a very good start. I've left some comments below. cc'ing @jreback who might be able to use this functionality immediately and @jcrist who has done similar work recently.

next_id, simple_impl[func])

elif func == 'var':
raise NotImplementedError()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions in dask/dataframe/groupby.py like _var_chunk may be of value here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, for the pointer. I will probably reimplement it to be able to reuse intermediates across different reductions, if this is ok for you?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as single-groupby-aggregation performance isn't seriously affected or made significanty more complex then, yes, modifying existing code sounds great.



AggArgs = collections.namedtuple('AggArgs', ['chunk_funcs', 'aggregate_funcs',
'finalizers'])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a dictionary do ok here? Most dask development has relatively few custom classes. Namedtuples are pretty benign, I agree, but the majority of the codebase is pretty barebones.

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll replace the namedtuple with a dict.

partial(_apply_func_to_column, int_sum, int_sum, M.sum),
partial(_apply_func_to_column, int_count, int_count, M.sum)
],
finalizers=[(result_column, partial(_finalize_mean, int_sum, int_count))],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a non-trivial cost to using partial functions in a distributed setting. They are harder to serialize and harder to reason about when people need to debug the graphs themselves (which sometimes core developers do need to do).

This isn't a blocking request (fine if it's too hard) but if its possible to put these arguments directly into the graph rather than implicitly with a partial then that would be better.

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. To be honest, I was wondering when reading the dask code base about this design choice, makes perfect sense.

I'll put the arguments directly into the graph. My proposal would be to reverse the arguments to _apply_func_to_column, convert all partial arguments to keyword arguments, and to replace the partials with pairs of function and kwargs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

assert eq(pdf.groupby('a').agg(spec), ddf.groupby('a').agg(spec))

spec = 'mean'
assert eq(pdf.groupby('a').agg(spec), ddf.groupby('a').agg(spec))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to put all of the specs into a list and then loop over them. Or perhaps use a @pytest.parametrize testing fixture.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go with pytest.parametrize (and fix the failing tests due to the recent changes on master).

'max': (M.max, M.max),
'count': (M.count, M.sum),
'size': (M.size, M.sum),
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to see the very recent work of tree reductions here: #1663

This ends up having a significant performance impact for some groupbys. It would be nice (eventually) to have the same logic in aggregate. This would require adding another intermediate combine function to each of these tuples.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, I wasn't aware that it has been already merged. The dask development has quite a rapid pace to it, if I may say.

I'll rebase and add tree reductions to aggregate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering: after the rebase, aren't I using tree reductions already? Since I do not set the split_every=False in apply_concat_apply, the tree reduction code should be run as is. In any case, I will expose the split_every argument on aggregate and double-check whether the default of combine==aggregate works for the reductions used in agg.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, according to the docstring it looks like the intermediate combine function defaults to the final aggregate function if none is provided. @jcrist thoughts on this default? An alternative default would be to use split_every=False if users don't provide a combine function, which might be safer for user-defined functions.

Regardless, I think it would be safer here to explicitly include a combine function if one is known, so for example:

# 'count': (M.count, M.sum)  #  before
  'count': (M.count, M.sum, M.sum)  # after

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with that default because it matches the interface of tree reductions in bag and array. Many of the reductions defined here also have identical aggregate and combine functions, so it's slightly more concise to use this default :). My thought is that custom reductions are advanced enough that we can but the burden on users to properly use this method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all aggregation supported combine == aggregate. While, I would be happy to add the corresponding tuple elements, dictionary entries in the _build_ ... functions, I feel it would needlessly clutter the code. However, I think adding the corresponding keyword arguments for apply_concat_apply would make the assumption of combine == aggregate explicit.

@chmp
Copy link
Copy Markdown
Contributor Author

chmp commented Oct 19, 2016

I rebased to master and addressed your comments (namedtuples, partial, pytest parametrize). Further, I changed the semantics of the the funcs argument in _groupby_apply_funcs. Now, each function returns exactly one column. This changed allowed me to skip recomputing intermediate sums. For certain aggregate calls this should reduce the communication / computation overhead by a factor of two or more. A specific example would be df.groupby(...).agg(['mean', 'sum', 'count']). Finally, I also implemented var and std.

Next, I will tackle tree reductions.

# generated it is reused in subsequent id requests.
ids = it.count()
# TODO: add func, input_column to intermediate name for easier debugging
id_map = collections.defaultdict(lambda: 'intermediate-{}'.format(next(ids)))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the skipping of recomputation is implemented by a somewhat arcane combination of itertools.count and defaultdict. What do you think about implementing it as a separate class? This would also make it simpler to generate easy to interpret intermediate names.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these names used as keys in the task graph or just as column names? Is there a convenient deterministic way to create these names?

Generally I avoid classes, probably somewhat irrationally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rightly so: now that I think about it, this functionality can also be easily implemented with a closure.

The keys are only used as column names in intermediate data frames.

With regard to generating these names deterministically. I am somewhat certain that prefixing the repr of the column key by the function should be a unique key, as long as the separator is not used inside the function name. However, I thought better to make sure by adding a unique component to the name.

A possible implementation could be:

def _make_agg_id_factory():
    integers = it.count()
    ids = {}

    def factory(func, column):
        try:
            return ids[func, column]

        except KeyError:
            new_id = '{}-{!s}-{!r}'.format(next(integers), func, column)
            return ids.setdefault((func, column), new_id)

    return factory

@chmp
Copy link
Copy Markdown
Contributor Author

chmp commented Oct 19, 2016

Regarding the failed travis check: I cannot' see any place where I would have changed code that is relevant to that test. Are there any issues known? Maybe it is some very unlikely combination of the random number generated.

Update: it seems to be a stochastic failure of the pivot_table test. After pushing 3fd0ef9 it worked again.

assert len(no_mean_aggs) == len(with_mean_aggs)

assert len(no_mean_finalizers) == len(no_mean_spec)
assert len(with_mean_finalizers) == len(with_mean_spec)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to check some additional things about the graphs produced here:

  1. Do they work well with split_every. The dask.dataframe.utils.assert_max_deps function might be useful here
  2. Do they produce deterministic key names. Is df.groupby(...).aggregate(...).dask == df.groupby(...).aggregate(...).dask for equivalent values of ...?
  3. Are intermediates shared? Perhaps by checking that the lengths of the resulting dask graphs are shorter than one might expect from a naive handling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add these tests. Check 3 should always be satisfied. Irrespective of the chosen aggregations only a single dataframe will be generated per partition. Therefore, the number of elements inside the dask graphs should always be same. Only the columns added to the dataframes do depend on the argument to aggregate.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check 3 should always be satisfied. Irrespective of the chosen aggregations only a single dataframe will be generated per partition. Therefore, the number of elements inside the dask graphs should always be same.

Very cool. Besides being more convenient I anticipate that this will have some very nice performance benefits :)


assert_eq(pdf.groupby(['a', 'd']).agg(spec),
ddf.groupby(['a', 'd']).agg(spec))

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adapting the tests. I found a couple of problems:

  1. using series as groupby arguments fails miserably
  2. pandas is using additional magic where it tries to cast the result back to its original type. The relevant code can be found in _GroupBy._try_cast. (see also Intermittent failure in pivot_table test #1685). While it is definitely possible to ensure the final data frame has the correct types after .compute(). Such a step would also mean, that there is no way of knowing the correct meta object before performing the actual aggregation
  3. .groupby(...).agg('size') returns a series for inconsistence sake

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recent commits address issues 1 and 3. For Issue 2 however, i.e., the recasting of the result, I don't see any way to fix it.

else:
levels = 0

# TODO: add normed spec as an additional part to the token?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any specifics I need to watch out when generating the token or does apply-concat-apply ensures that the resulting dask keys are unique?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apply_concat_apply function should handle things here

index = list(index.columns)

self.index = index
self.index = _normalize_index(df, index)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I added support for lists of series as a grouper, which was previously not supported. This logic is required to handle df.groupby([df['a'], df['b']]) consistent with pandas. The recursion in _normalize_index is not truly necessary (since lists of lists should not be valid groupers), but it was easiest way to implement it.

meta = _aggregate_meta(self.obj._meta, meta_groupby, 0, chunk_funcs,
aggregate_funcs, finalizers)

if not isinstance(self.index, list):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I'm adding multiple series as varargs to the cunk functions. This way no additional logic for aligning the different series is required. The chunk functions have been adapted accordingly.

agg_dask1 = get_agg_dask(result1)
agg_dask2 = get_agg_dask(result2)

core_agg_dask1 = {k: v for (k, v) in agg_dask1.dask.items()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_partitions used in the finalize step passes the meta dataframe. Therefore, == can no longer be used for comparison.

@mrocklin
Copy link
Copy Markdown
Member

In travis.ci it looks like there is an encoding issue with one of the files and some flake8 issues generally. Nothing major though.


meta_groupby = pd.Series([], dtype=bool, index=self.obj._meta.index)
meta = _aggregate_meta(self.obj._meta, meta_groupby, 0, chunk_funcs,
aggregate_funcs, finalizers)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function is only a few lines and is only used here. It might be easier to read if this function was just inlined here instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inlined it.


elif isinstance(self.index, list):
group_columns = {i for i in self.index
if isinstance(i, tuple) or np.isscalar(i)}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the input order not matter here?

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group_columns columns are only used to create the list of non_group_columns. Since only __contains__ is checked, the order of group_columns is irrelevant.


# TODO: add normed spec as an additional part to the token?
token = 'aggregate'
finalize_token = '{}-finalize'.format(token)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it might be nicer to just inline 'aggregate-finalize' directly in the map_partitions call.

Removing small logic like this should make it easier for other developers to maintain this function in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed additional logic was required. Since this is not the case, I removed both tokens and inlined them directly into the aca and map_partitions calls.


def _normalize_spec(spec, non_group_columns):
"""
Return a list of ``(result_column, func, input_column)`` tuples.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask for a quick Examples section here with a couple of worked examples to help future developers quickly understand how this function operates.

Examples
--------

>>> _normalize_spec(simple, inputs)
... simple outputs ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added further explanation to the docstring and different examples that should cover the functionality.

#
###############################################################

def _make_agg_id_factory():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be tested independently?

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an extra test.


def _groupby_apply_funcs(df, *index, **kwargs):
"""
TODO: document args
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lingering TODO here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add full description of this function with all its arguments.

--------

>>> _normalize_spec('mean', ['a', 'b', 'c'])
[('a', 'mean', 'a'), ('b', 'mean', 'b'), ('c', 'mean', 'c')]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be unindented to line up with the normal text on the left. We should also drop the newline after the ----.

http://dask.readthedocs.io/en/latest/develop.html#docstrings

... 'b': {'e': 'count', 'f': 'var'}},
... ['a', 'b', 'c'])
[(('a', 'mean'), 'mean', 'a'), (('a', 'size'), 'size', 'a'),
(('b', 'e'), 'count', 'b'), (('b', 'f'), 'var', 'b')]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are super-informative by the way. Thanks!

funcs: list of result-colum, function, keywordargument triples
The list of functions that are applied on the grouped data frame.
Has to be passed as a keyword argument.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should drop endlines here

@mrocklin
Copy link
Copy Markdown
Member

This is looking pretty good to me. It would be nice to have someone more familiar with dask.dataframe take a look at it before merging. @jcrist if you have time tomorrow this could use your review.

@jreback
Copy link
Copy Markdown
Contributor

jreback commented Oct 24, 2016

first (test) usage, seemed to work nicely.

intentional to support only DataFrameGroupby ATM? (fine by me), maybe put in a NotImplementedError on SeriesGroupby usage though)

@jreback
Copy link
Copy Markdown
Contributor

jreback commented Oct 24, 2016


In [27]: ddf = dd.from_pandas(pd.DataFrame({'A' : [1,2,3], 'B' : [1, 2, 3]}), npartitions=3)

In [28]: ddf
Out[28]: dd.DataFrame<from_pa..., npartitions=2, divisions=(0, 1, 2)>

In [29]: ddf.compute()
Out[29]:
   A  B
0  1  1
1  2  2
2  3  3

In [30]: ddf2 = ddf.groupby('A').agg(['sum', 'mean'])

In [31]: ddf2.compute()
Out[31]:
    B
  sum mean
A
1   1  1.0
2   2  2.0
3   3  3.0

In [32]: ddf2.columns = ['foo', 'bar']

In [33]: ddf2.compute()
Out[33]:
    B
  sum mean
A
1   1  1.0
2   2  2.0
3   3  3.0

In [35]: import dask

In [36]: dask.__version__
Out[36]: '0.11.1+54.g7f411f8'

so the rename [32] is not propogating

In [37]: ddf3 = dd.concat([ddf.groupby('A').sum(), ddf.groupby('A').mean()], axis=1)
C:\Anaconda\envs\bmc3\lib\site-packages\dask\dataframe\multi.py:653: UserWarning: Concatenating dataframes with unknown divisions.
We're assuming that the indexes of each dataframes are aligned
.This assumption is not generally safe.
  warn("Concatenating dataframes with unknown divisions.\n"

In [38]: ddf3.compute()
Out[38]:
   B    B
A
1  1  1.0
2  2  2.0
3  3  3.0

In [39]: ddf3.columns = ['foo', 'bar']

In [40]: ddf3.compute()
Out[40]:
   bar  bar
A
1    1  1.0
2    2  2.0
3    3  3.0

So I am not sure this is directly in this PR

@chmp
Copy link
Copy Markdown
Contributor Author

chmp commented Oct 24, 2016

Regarding the missing aggregate for Series: this is a mere oversight, since it did not come up in my use case. It should be easy to add, I'll have a look at it.

integers = it.count()
ids = {}

def factory(func, column):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use dask.base.tokenize? This provides deterministic keys, without the need for cache.

tokenize(func, column)

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing knowledge about the dask internals. Using tokenize is a good idea, I will change this part.

@chmp
Copy link
Copy Markdown
Contributor Author

chmp commented Oct 24, 2016

The rename is not related to my code. dask does not seem to handle multilevel columns too well (for example also df['A', 'mean] does not work). A failing test for the rename functionality could be:

def test_renames_multilevelcolumns():
    pdf = pd.DataFrame({('A', '0') : [1,2,2], ('B', 1) : [1, 2, 3]})
    ddf = dd.from_pandas(pdf, npartitions=3)

    pdf.columns = ['foo', 'bar']
    ddf.columns = ['foo', 'bar']

    assert_eq(pdf, ddf)

# E       AssertionError: Index are different
# E       
# E       Index classes are not equivalent
# E       [left]:  Index(['foo', 'bar'], dtype='object')
# E       [right]: MultiIndex(levels=[['A', 'B'], [1, '0']],
# E                  labels=[[0, 1], [1, 0]])

The culprit is the dask.dataframe.core._rename implementation:

def test_renames_pandas():
    pdf = pd.DataFrame({('A', '0') : [1,2,2], ('B', '1') : [1, 2, 3]})

    expected = pdf.copy()
    expected.columns = ['foo', 'bar']

    # df.rename(columns=dict(zip(df.columns, columns))) in core._rename
    actual = pdf.rename(columns=dict(zip(pdf.columns, ['foo', 'bar'])))

    assert_eq(expected, actual)

# result:
# E       AssertionError: DataFrame.columns are different
# E       
# E       DataFrame.columns classes are not equivalent
# E       [left]:  Index(['foo', 'bar'], dtype='object')
# E       [right]: MultiIndex(levels=[['A', 'B'], ['0', '1']],
# E                  labels=[[0, 1], [0, 1]])

However, as of now, I have no good idea how to fix this.

@jreback
Copy link
Copy Markdown
Contributor

jreback commented Oct 24, 2016

@chmp in pandas you do this as df.loc[('A', 'mean')]

@chmp
Copy link
Copy Markdown
Contributor Author

chmp commented Oct 24, 2016

I posted a comment on the __getitem__ issue.

@mrocklin
Copy link
Copy Markdown
Member

I think that we can safely ignore the multi-named columns issue for this PR. It looks to me like the only outstanding issue it to use tokenize rather than the factory naming solution.

Any other comments?

@chmp
Copy link
Copy Markdown
Contributor Author

chmp commented Oct 25, 2016

My most recent change addresses the two open comments:

  • The factory for intermediate ids is gone. The unique part of the id comes now from tokenize.
  • Series aggregate is implemented.

However, currently series-aggregate does not work with multiple groupers, i.e., s.groupby([s1, s2]).aggregate(...). The reason is a limitation of the groupby implementation for series in dask master: when I comment out the check in SeriesGroupby.__init__, the short-circuit branches fail, whereas the full aggregate code works as intended. I could have a look at this issue as part of the PR, however I would prefer to open a separate issue to address this topic.

For reference when I remove all checks in SeriesGroupy.__init__, the test results are

dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-spec0] XPASS
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-spec1] XPASS
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-spec2] XPASS
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-sum] xfail
dask/dataframe/tests/test_groupby.py::test_series_aggregate__examples[grouper0-None-size] xfail

#
###############################################################
def _make_agg_id(func, column):
return '{!s}-{!s}-{}'.format(func, column, tokenize(func, column))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are func and column always strings? If so then perhaps just func + '-' + column?

Copy link
Copy Markdown
Contributor Author

@chmp chmp Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. In principle columns can be any valid python type (that is hash-able as far as I know). In particular integers, booleans, and tuples come to my mind.

Copy link
Copy Markdown
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me

ddf = dd.from_pandas(pdf, npartitions=10)

assert_eq(pdf.groupby(grouper(pdf)).agg(spec),
ddf.groupby(grouper(ddf)).agg(spec, split_every=split_every))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. After your little nudge towards pytest.mark.parametrize, I became I fan of it :)

['sum', 'mean', 'min', 'max', 'count', 'size', 'std', 'var'],
'sum', 'size',
])
@pytest.mark.parametrize('split_every', [None])#[False, None])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test split_every = False here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An oversight: I commented it out to increase the test speed and forgot to comment it in. I'll run the travis tests on a branch in my repo and push the changes after everything is green.

@mrocklin
Copy link
Copy Markdown
Member

Feel free to use travis.ci here. There isn't a big need to isolate your
testing if you don't find it convenient.

On Tue, Oct 25, 2016 at 12:32 PM, Christopher Prohm <
notifications@github.com> wrote:

@chmp commented on this pull request.

In dask/dataframe/tests/test_groupby.py
#1678:

  •                    'c': [0, 1, 2, 3, 4, 5, 6, 7, 8] \* 10,
    
  •                    'd': [3, 2, 1, 3, 2, 1, 2, 6, 4] \* 10},
    
  •                   columns=['c', 'b', 'a', 'd'])
    
  • ddf = dd.from_pandas(pdf, npartitions=10)
  • assert_eq(pdf.groupby(grouper(pdf)).agg(spec),
  •          ddf.groupby(grouper(ddf)).agg(spec, split_every=split_every))
    
    +@pytest.mark.parametrize('spec', [
  • {'b': 'sum', 'c': 'min', 'd': 'max'},
  • ['sum'],
  • ['sum', 'mean', 'min', 'max', 'count', 'size', 'std', 'var'],
  • 'sum', 'size',
    +])
    +@pytest.mark.parametrize('split_every', [None])#[False, None])

An oversight: I commented it out to increase the test speed and forgot to
comment it in. I'll run the travis tests on a branch in my repo and push
the changes after everything is green.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1678, or mute the thread
https://github.com/notifications/unsubscribe-auth/AASszG-I3iFmr2neQlS3EXGrHBcpd5htks5q3i8ygaJpZM4KaWX0
.

@chmp
Copy link
Copy Markdown
Contributor Author

chmp commented Oct 25, 2016

After I enabled travis for my fork, I only have to push into a branch that is not connected to this PR. Then travis is running the tests isolated from the dask repo. Since, dask is already integrated with travis, this whole procedure is very convenient :)

@mrocklin
Copy link
Copy Markdown
Member

Any further comments on this PR?

@mrocklin
Copy link
Copy Markdown
Member

Alrighty. Merging. Thanks for all of the effort on this one @chmp ! Nice work.

@mrocklin mrocklin merged commit 6481410 into dask:master Oct 26, 2016
@sinhrks sinhrks added this to the 0.12.0 milestone Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants