New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds distinct on #1159
Adds distinct on #1159
Conversation
@@ -97,8 +102,17 @@ class Distinct(Expr): | |||
>>> from blaze.compute.python import compute | |||
>>> sorted(compute(e, data)) | |||
[('Alice', 100, 1), ('Bob', 200, 2)] | |||
|
|||
Using a subset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to put a space after these comments otherwise sphinx won't render them correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pull in master after i merge #1161, numba changed the way they |
c865ccd
to
923f592
Compare
@@ -199,6 +199,8 @@ def compute_up(t, x, **kwargs): | |||
|
|||
@dispatch(Distinct, np.ndarray) | |||
def compute_up(t, x, **kwargs): | |||
if t.on: | |||
raise ValueError('numpy backend cannot specify what to distinct on') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe say numpy backend cannot specify what column to distinct on
ceb4d60
to
581e5c5
Compare
note: I need to get the string conversion for the pandas round trip and add tests for np and probably the other error cases before this is ready. |
d5dfad4
to
adf9158
Compare
And it's passing! |
def compute_up(t, df, **kwargs): | ||
return df.drop_duplicates().reset_index(drop=True) | ||
return df.drop_duplicates( | ||
subset=t.on if t.on else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor style comment: should these be t.on or None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, then I previously was doing a check against None
raises=NotImplementedError, | ||
reason='cannot specify columns to distinct on yet', | ||
) | ||
def test_distinct_in(rdd): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be test_distinct_on
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, yup
if getattr(arr.dtype, 'names', None) is not None: | ||
return pd.DataFrame.from_records(arr).drop_duplicates( | ||
subset=t.on if t.on else None, | ||
).reset_index(drop=True).value.astype(arr.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you might not be covering this in tests since value
isn't an attribute of DataFrame
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, you probably want to do to_records(index=False).astype(arr.dtype)
here because values
will use a lot more memory if you have different types in your DataFrame
.
@dispatch(Distinct, np.ndarray) | ||
def compute_up(t, x, **kwargs): | ||
return np.unique(x) | ||
def compute_up(t, arr, _recarray_distinct=recarray_distinct, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to just call recarray_distinct
in the body of the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a slight perf increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it's necessary. what kind perf increase are we talking? saving the cost of a global lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it swaps out a global lookup for a local lookup. It's not a huge gain but it does slightly cut down the cost of calling the function.
ok this looks good to me |
No description provided.