[REVIEW] Initial version of non-groupby _take_last by beckernick · Pull Request #4736 · dask/dask

beckernick · 2019-04-25T15:00:42Z

Summary of Changes

Refactors _take_last to remove the groupby.last() usage
- Uses a two-tiered approach to find the last valid value in order to maintain speed. New approach is significantly faster than the groupby.last() approach for large data sizes and comparable at smaller sizes.
Replaces explicit pandas references with a check for whether the object is dataframe-like and uses the appropriate Series constructor based on the types.
Existing tests passed
Added and passed test for empty partitions scenario
Passes flake8 dask

This closes #4732 and will also close #4743.

…ependency

mrocklin

In principle this looks fine to me. I did make a bunch of small nitpicky requests though (hope you don't mind) mostly around reducing indirection for ease of future review.

dask/dataframe/core.py

…d rename function

dask/dataframe/core.py

beckernick · 2019-04-25T16:04:27Z

I'd still like to remove the explicit pd.Series if possible as I think it may cause cumulative aggregations to break for cudf eventually when we create DataFrame methods. For now, I believe it shouldn't break anything in cudf as cumulative aggregations are only at the Series level.

Is there a general way to do this in dask or do you think the best approach would be to look at _cum_agg (the only place this function is used) to see if it really needs this information as a Series?

Also making a change per your most recent suggestion.

mrocklin · 2019-04-25T16:14:35Z

I'd still like to remove the explicit pd.Series if possible as I think it may cause cumulative aggregations to break for cudf eventually when we create DataFrame methods

Right, so it sounds like we either need a way to get a Series type from a DataFrame or Series. This might be with something like

if is_series_like(obj):
    typ = type(obj)
elif is_dataframe_like(obj):
    typ = type(obj.iloc[0])

Alternatively, we could ask cudf to support the addition of Pandas series objects, which probably isn't a bad idea anyway.

beckernick · 2019-04-25T16:38:36Z

Would the first approach you've listed make cudf a dependency of dask? I'd think this is not something we would want to do. I'm not sure if storing the type in a variable lets us do anything downstream to create the correct Series (cudf or pandas) without bringing in cudf, but I may be unaware of a function/method that would help in this (perhaps some kind of constructor wrapper based on the dataframe backend, maybe?). Would love to learn more.

mrocklin · 2019-04-25T17:54:01Z

Would the first approach you've listed make cudf a dependency of dask? I'd think this is not something we would want to do. I'm not sure if storing the type in a variable lets us do anything downstream to create the correct Series (cudf or pandas) without bringing in cudf, but I may be unaware of a function/method that would help in this

No, it would not force a dependency of cudf. This would grab the type of the object we were given dynamically at runtime. The only way the type will be a cudf.Series object is if the input was a cudf object, so we'll obviously have already imported cudf at that time.

That being said, making cudf robust to interacting with pandas series objects is probably a good thing to do anyway.

beckernick · 2019-04-25T18:02:55Z

I agree. Making cudf robust to interacting with pandas does sound like a good idea.

To make sure I'm following what you mean, Are you saying that something like:

if is_dataframe_like(obj):
    typ = typ(obj.iloc[0])
    if typ is cudf like:
        cudf.Series(obj)

Will be fine because the only situation in which the type would be cudf.Series is when we already have cudf and can thus use cudf.Series, which is also why we couldn't use isinstance(obj, cudf.Series) here because we will not have cudf.Series in every scenario. If so, we're now on the same page!

mrocklin · 2019-04-25T18:08:05Z

I'm saying this

if is_series_like(obj):
    typ = type(obj)
elif is_dataframe_like(obj):
    typ = type(obj.iloc[0])

return typ(out, index=a.columns)

beckernick · 2019-04-25T18:10:40Z

Wow. Learned something great here. I was not aware that type had this kind of behavior.

mrocklin · 2019-04-25T18:19:25Z

s = pd.Series([1, 2 ,3])
typ = type(s)
assert typ is pd.Series

so this is the same:

pd.Series([1, 2, 3]) == typ([1, 2, 3])

…ference to dynamically grab type and construct an appropriate Series

mrocklin · 2019-04-26T19:20:24Z

dask/dataframe/core.py

+        # in each column
+        if is_dataframe_like(a):
+            # create Series of appropriate backend dataframe library
+            series_typ = type(a.iloc[0])


Hrm, this will fail if there are empty partitions, as in this example.

import pandas as pd import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6, 7, 8]}) ddf = dd.from_pandas(df, npartitions=4) ddf[ddf.x < 5].cumsum().compute()

(sorry for not noticing this before)

Maybe type(a.head(0).sum()) ? Actually no, cudf doesn't support this. We'll need to find some cheap way to get the Series type of a DataFrame object that always works :/

Good catch. Will think about this.

Could we leverage the fact that slices behave differently? Instead of grabbing with a.iloc[0] let's perhaps grab with a.iloc[:0] or a.iloc[0:1]? This appears to solve the indexing issue but results in a downstream issue with the meta check in is_dataframe_like. ~~Looking into it.~~ This will gives us a dataframe actually, so not quite what we need and what causes the meta check issue.

Also, as a note, it looks like the original _take_last also fails when there are empty partitions so this will be good to fix.

... /Users/nickbecker/NVIDIA/dask/dask/dataframe/core.py in _take_last(a, skipna) 4213 last_row = a.groupby(group_dummy).last() 4214 if isinstance(a, pd.DataFrame): # TODO: handle explicit pandas reference -> 4215 return pd.Series(last_row.values[0], index=a.columns) 4216 else: 4217 return last_row.values[0] IndexError: index 0 is out of bounds for axis 0 with size 0

@mrocklin , the following snippet appears to solve the empty partition issue:

if is_dataframe_like(a): # create Series of appropriate backend dataframe library series_typ = type(a.loc[0:1, a.columns[0]]) # if it's dataframe like, it must have at least one column if a.empty: return series_typ([]) # we dont want to index this since its empty return series_typ({col: _last_valid(a[col]) for col in a.columns}, index=a.columns)

If this failed before then we can probably leave this as is and raise an issue.

Ah, seems like you just did this in #4743

Yep. Defer to you on whether you want to leave this as is since it was already failing on empty partition dataframes or try to fix it in this PR.

The above snippet does solve the problem for cumsum and cumprod, but still causes issues with cummin and cummax when it tries to compare a non-empty series with an empty series:

... /dask/dask/dataframe/methods.py in cummin_aggregate(x, y) 147 def cummin_aggregate(x, y): 148 if isinstance(x, (pd.Series, pd.DataFrame)): --> 149 return x.where((x < y) | x.isnull(), y, axis=x.ndim - 1) 150 else: # scalar 151 return x if x < y else y /conda/envs/cudf/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis) 1674 1675 elif isinstance(other, ABCSeries) and not self._indexed_same(other): -> 1676 raise ValueError("Can only compare identically-labeled " 1677 "Series objects") 1678 ValueError: Can only compare identically-labeled Series objects ipdb> self x 0 dtype: int64 ipdb> other Series([], dtype: float64) ipdb>

Maybe we could name the empty series like the column name, but that doesn't solve the situation you described when the dataframe is empty and has no columns.

I'd be happy if you wanted to include it. You might also include a test with the currently passing cum-reductions (and, if you wanted to get fancy, xfailed parameters for the failing ones).

@pytest.mark.parametrize('func', [ M.cumsum, M.cumprod, pytest.param(M.cummin, marks=[pytest.mark.xfail(reason="...")]), ... ]) def test_cum_reduction_empty_partitions(func): df = ... ...

(I could be totally wrong about the pytest.param syntax by the way. I would check existing tests.

I'll include the snippet and add tests for the empty partition case so that we still can at least pass cumsum/cumprod with empty partitions, which is an improvement on the current situation. I'll match the xfail style for the cummin/cummax on empty partitions to the style in the other tests.

mrocklin · 2019-04-26T21:32:51Z

Merging this tomorrow if there are no further comments

mrocklin · 2019-04-27T14:16:08Z

This is in. Thanks @beckernick !

Previously we would use groupby.last in order to get the last non-null value of a partition. This was suboptimal in two ways: - Less advanced dataframe libraries, like cudf, might not implement groupby-last, and so couldn't use this algorithm - It was somewhat expensive - It failed if there were empty partitions Our new approach tries two paths: - First, use a normal Python for loop and `.iloc` on the last ten rows of a column. This is faster than the groupby.last call and only uses iloc, which is somewhat more likely to be around. - If that didn't work (there are lots of nulls) then use `s[s.notna()][-1]`, which does a full scan, but is doesn't iterate with Python

initial version of non-groupby _take_last, with one explicit pandas d…

1227037

…ependency

mrocklin requested changes Apr 25, 2019

View reviewed changes

switch to forward range, replace function with dict comprehension, an…

339f569

…d rename function

mrocklin requested changes Apr 25, 2019

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

beckernick added 2 commits April 25, 2019 12:05

remove unnecessary size variable in favor of len()

cde6d5d

switch columns= to index= in pandas series constructor

dc6deb2

increase range end by 1 to reach first row, switch explicit pandas re…

8052cff

…ference to dynamically grab type and construct an appropriate Series

beckernick changed the title ~~[WIP] Initial version of non-groupby _take_last~~ [REVIEW] Initial version of non-groupby _take_last Apr 25, 2019

linting oversight

e2c62b9

mrocklin reviewed Apr 26, 2019

View reviewed changes

handle empty partition indexing issue

0de5c8e

mrocklin approved these changes Apr 26, 2019

View reviewed changes

add empty partition tests (cumsum/cumprod pass, cummin/cummax xfailing)

b66e040

mrocklin merged commit 8ce1ab7 into dask:master Apr 27, 2019

beckernick mentioned this pull request May 26, 2020

[FEA] Implement Groupby.last() rapidsai/cudf#1476

Closed

Uh oh!

Conversation

beckernick commented Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beckernick commented Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beckernick commented Apr 25, 2019

Uh oh!

mrocklin commented Apr 25, 2019

Uh oh!

beckernick commented Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 25, 2019

Uh oh!

beckernick commented Apr 25, 2019

Uh oh!

mrocklin commented Apr 25, 2019

Uh oh!

mrocklin Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

beckernick Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

beckernick Apr 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beckernick Apr 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beckernick Apr 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

beckernick Apr 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

beckernick Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 26, 2019

Uh oh!

mrocklin commented Apr 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

beckernick commented Apr 25, 2019 •

edited

Loading

beckernick commented Apr 25, 2019 •

edited

Loading

mrocklin commented Apr 25, 2019 •

edited

Loading

beckernick commented Apr 25, 2019 •

edited

Loading

beckernick Apr 26, 2019 •

edited

Loading

beckernick Apr 26, 2019 •

edited

Loading

beckernick Apr 26, 2019 •

edited

Loading

beckernick Apr 26, 2019 •

edited

Loading