New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds distinct on #1159

Merged
merged 12 commits into from Jul 13, 2015

Conversation

Projects
None yet
2 participants
@llllllllll
Member

llllllllll commented Jul 7, 2015

No description provided.

@llllllllll llllllllll added this to the 0.8.2 milestone Jul 7, 2015

@llllllllll llllllllll added postgresql and removed sql labels Jul 7, 2015

@@ -97,8 +102,17 @@ class Distinct(Expr):
>>> from blaze.compute.python import compute
>>> sorted(compute(e, data))
[('Alice', 100, 1), ('Bob', 200, 2)]
Using a subset

This comment has been minimized.

@cpcloud

cpcloud Jul 7, 2015

Member

You have to put a space after these comments otherwise sphinx won't render them correctly.

This comment has been minimized.

@llllllllll
@cpcloud

This comment has been minimized.

Member

cpcloud commented Jul 7, 2015

pull in master after i merge #1161, numba changed the way they __repr__d functions so i went ahead and wrote some proper tests instead of depending on doctest

@llllllllll llllllllll force-pushed the quantopian:distinct branch 2 times, most recently from c865ccd to 923f592 Jul 7, 2015

@@ -199,6 +199,8 @@ def compute_up(t, x, **kwargs):
@dispatch(Distinct, np.ndarray)
def compute_up(t, x, **kwargs):
if t.on:
raise ValueError('numpy backend cannot specify what to distinct on')

This comment has been minimized.

@cpcloud

cpcloud Jul 7, 2015

Member

maybe say numpy backend cannot specify what column to distinct on

@dispatch(Distinct, Series)
def compute_up(t, s, **kwargs):
if t.on:
raise ValueError('series distinct cannot specify what to distinct on')

This comment has been minimized.

@cpcloud

cpcloud Jul 7, 2015

Member

Maybe have a message here that indicates that this operation doesn't make sense

This comment has been minimized.

@llllllllll

llllllllll Jul 7, 2015

Member

Yeah, I will try to think of some shared language for this an the ndarray case. Any ideas?

This comment has been minimized.

@llllllllll

llllllllll Jul 7, 2015

Member

Actually, I think that this is only shared when the ndarray is a vector not a matrix

This comment has been minimized.

@cpcloud

cpcloud Jul 7, 2015

Member

the ndarray case makes sense for struct and recarrays but not for anything else so sharing the language doesn't make sense in every case.

i'd do something like this:

if getattr(data.dtype, 'names', None) is None:  # single dtyped array
    raise ValueError('operation does not make sense')
else:  # struct/recarray
    raise NotImplementedError('makes sense, but numpy does not have an API for this operation')

This comment has been minimized.

@llllllllll

llllllllll Jul 7, 2015

Member

For the recarray case we could just throw it into a df and then pull the underlying array back out

This comment has been minimized.

@cpcloud

cpcloud Jul 7, 2015

Member

i hope that doesn't seem like a lot of extra work. i think we should try to provide error messages as soon and as accurate as possible

This comment has been minimized.

@cpcloud

cpcloud Jul 7, 2015

Member

For the recarray case we could just throw it into a df and then pull the underlying array back out

Yep, that sounds good. The only issue is converting strings back to their original fixed length dtype, but that shouldn't be too difficult

This comment has been minimized.

@llllllllll

llllllllll Jul 7, 2015

Member

This is not too much, I agree that error messages are very important for the usability of a project.

@llllllllll llllllllll force-pushed the quantopian:distinct branch 2 times, most recently from ceb4d60 to 581e5c5 Jul 7, 2015

@llllllllll

This comment has been minimized.

Member

llllllllll commented Jul 8, 2015

note: I need to get the string conversion for the pandas round trip and add tests for np and probably the other error cases before this is ready.

@llllllllll llllllllll added the wip label Jul 8, 2015

@llllllllll llllllllll force-pushed the quantopian:distinct branch 2 times, most recently from d5dfad4 to adf9158 Jul 8, 2015

@cpcloud cpcloud modified the milestones: 0.8.2, 0.8.3 Jul 9, 2015

@llllllllll llllllllll force-pushed the quantopian:distinct branch from adf9158 to f82d56c Jul 9, 2015

@llllllllll llllllllll removed the wip label Jul 9, 2015

@llllllllll

This comment has been minimized.

Member

llllllllll commented Jul 9, 2015

And it's passing!

def compute_up(t, df, **kwargs):
return df.drop_duplicates().reset_index(drop=True)
return df.drop_duplicates(
subset=t.on if t.on else None,

This comment has been minimized.

@cpcloud

cpcloud Jul 9, 2015

Member

minor style comment: should these be t.on or None?

This comment has been minimized.

@llllllllll

llllllllll Jul 9, 2015

Member

ah yes, then I previously was doing a check against None

raises=NotImplementedError,
reason='cannot specify columns to distinct on yet',
)
def test_distinct_in(rdd):

This comment has been minimized.

@cpcloud

cpcloud Jul 10, 2015

Member

should this be test_distinct_on?

This comment has been minimized.

@llllllllll

llllllllll Jul 10, 2015

Member

oops, yup

if getattr(arr.dtype, 'names', None) is not None:
return pd.DataFrame.from_records(arr).drop_duplicates(
subset=t.on if t.on else None,
).reset_index(drop=True).value.astype(arr.dtype)

This comment has been minimized.

@cpcloud

cpcloud Jul 10, 2015

Member

i think you might not be covering this in tests since value isn't an attribute of DataFrames

This comment has been minimized.

@cpcloud

cpcloud Jul 10, 2015

Member

also, you probably want to do to_records(index=False).astype(arr.dtype) here because values will use a lot more memory if you have different types in your DataFrame.

@llllllllll llllllllll force-pushed the quantopian:distinct branch from c872a56 to 381daa1 Jul 11, 2015

@dispatch(Distinct, np.ndarray)
def compute_up(t, x, **kwargs):
return np.unique(x)
def compute_up(t, arr, _recarray_distinct=recarray_distinct, **kwargs):

This comment has been minimized.

@cpcloud

cpcloud Jul 13, 2015

Member

any reason not to just call recarray_distinct in the body of the function?

This comment has been minimized.

@llllllllll

llllllllll Jul 13, 2015

Member

this is just a slight perf increase.

This comment has been minimized.

@cpcloud

cpcloud Jul 13, 2015

Member

i don't think it's necessary. what kind perf increase are we talking? saving the cost of a global lookup?

This comment has been minimized.

@llllllllll

llllllllll Jul 13, 2015

Member

Yeah, it swaps out a global lookup for a local lookup. It's not a huge gain but it does slightly cut down the cost of calling the function.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jul 13, 2015

ok this looks good to me

llllllllll added a commit that referenced this pull request Jul 13, 2015

@llllllllll llllllllll merged commit f3c5fe6 into blaze:master Jul 13, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@llllllllll llllllllll deleted the quantopian:distinct branch Jul 13, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment