Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds distinct on #1159

Merged
merged 12 commits into from Jul 13, 2015
Merged

Adds distinct on #1159

merged 12 commits into from Jul 13, 2015

Conversation

llllllllll
Copy link
Member

No description provided.

@@ -97,8 +102,17 @@ class Distinct(Expr):
>>> from blaze.compute.python import compute
>>> sorted(compute(e, data))
[('Alice', 100, 1), ('Bob', 200, 2)]

Using a subset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to put a space after these comments otherwise sphinx won't render them correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cpcloud
Copy link
Member

cpcloud commented Jul 7, 2015

pull in master after i merge #1161, numba changed the way they __repr__d functions so i went ahead and wrote some proper tests instead of depending on doctest

@llllllllll llllllllll force-pushed the distinct branch 2 times, most recently from c865ccd to 923f592 Compare July 7, 2015 21:14
@@ -199,6 +199,8 @@ def compute_up(t, x, **kwargs):

@dispatch(Distinct, np.ndarray)
def compute_up(t, x, **kwargs):
if t.on:
raise ValueError('numpy backend cannot specify what to distinct on')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe say numpy backend cannot specify what column to distinct on

@llllllllll llllllllll force-pushed the distinct branch 2 times, most recently from ceb4d60 to 581e5c5 Compare July 7, 2015 23:32
@llllllllll
Copy link
Member Author

note: I need to get the string conversion for the pandas round trip and add tests for np and probably the other error cases before this is ready.

@llllllllll llllllllll added the wip label Jul 8, 2015
@llllllllll llllllllll force-pushed the distinct branch 2 times, most recently from d5dfad4 to adf9158 Compare July 8, 2015 22:34
@cpcloud cpcloud modified the milestones: 0.8.2, 0.8.3 Jul 9, 2015
@llllllllll llllllllll removed the wip label Jul 9, 2015
@llllllllll
Copy link
Member Author

And it's passing!

def compute_up(t, df, **kwargs):
return df.drop_duplicates().reset_index(drop=True)
return df.drop_duplicates(
subset=t.on if t.on else None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor style comment: should these be t.on or None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, then I previously was doing a check against None

raises=NotImplementedError,
reason='cannot specify columns to distinct on yet',
)
def test_distinct_in(rdd):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be test_distinct_on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yup

if getattr(arr.dtype, 'names', None) is not None:
return pd.DataFrame.from_records(arr).drop_duplicates(
subset=t.on if t.on else None,
).reset_index(drop=True).value.astype(arr.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you might not be covering this in tests since value isn't an attribute of DataFrames

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, you probably want to do to_records(index=False).astype(arr.dtype) here because values will use a lot more memory if you have different types in your DataFrame.

@dispatch(Distinct, np.ndarray)
def compute_up(t, x, **kwargs):
return np.unique(x)
def compute_up(t, arr, _recarray_distinct=recarray_distinct, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to just call recarray_distinct in the body of the function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a slight perf increase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's necessary. what kind perf increase are we talking? saving the cost of a global lookup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it swaps out a global lookup for a local lookup. It's not a huge gain but it does slightly cut down the cost of calling the function.

@cpcloud
Copy link
Member

cpcloud commented Jul 13, 2015

ok this looks good to me

llllllllll added a commit that referenced this pull request Jul 13, 2015
@llllllllll llllllllll merged commit f3c5fe6 into blaze:master Jul 13, 2015
@llllllllll llllllllll deleted the distinct branch July 13, 2015 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants