Adds distinct on #1159

llllllllll · 2015-07-07T17:40:26Z

No description provided.

cpcloud · 2015-07-07T19:27:56Z

blaze/expr/collections.py

@@ -97,8 +102,17 @@ class Distinct(Expr):
    >>> from blaze.compute.python import compute
    >>> sorted(compute(e, data))
    [('Alice', 100, 1), ('Bob', 200, 2)]
+
+    Using a subset


You have to put a space after these comments otherwise sphinx won't render them correctly.

cpcloud · 2015-07-07T20:55:28Z

pull in master after i merge #1161, numba changed the way they __repr__d functions so i went ahead and wrote some proper tests instead of depending on doctest

cpcloud · 2015-07-07T21:19:05Z

blaze/compute/numpy.py

@@ -199,6 +199,8 @@ def compute_up(t, x, **kwargs):

 @dispatch(Distinct, np.ndarray)
 def compute_up(t, x, **kwargs):
+    if t.on:
+        raise ValueError('numpy backend cannot specify what to distinct on')


maybe say numpy backend cannot specify what column to distinct on

llllllllll · 2015-07-08T04:05:32Z

note: I need to get the string conversion for the pandas round trip and add tests for np and probably the other error cases before this is ready.

llllllllll · 2015-07-09T17:01:39Z

And it's passing!

cpcloud · 2015-07-09T19:07:59Z

blaze/compute/pandas.py

 def compute_up(t, df, **kwargs):
-    return df.drop_duplicates().reset_index(drop=True)
+    return df.drop_duplicates(
+        subset=t.on if t.on else None,


minor style comment: should these be t.on or None?

ah yes, then I previously was doing a check against None

cpcloud · 2015-07-10T18:48:24Z

blaze/compute/tests/test_spark.py

+    raises=NotImplementedError,
+    reason='cannot specify columns to distinct on yet',
+)
+def test_distinct_in(rdd):


should this be test_distinct_on?

cpcloud · 2015-07-10T22:41:36Z

blaze/compute/numpy.py

+        if getattr(arr.dtype, 'names', None) is not None:
+            return pd.DataFrame.from_records(arr).drop_duplicates(
+                subset=t.on if t.on else None,
+            ).reset_index(drop=True).value.astype(arr.dtype)


i think you might not be covering this in tests since value isn't an attribute of DataFrames

also, you probably want to do to_records(index=False).astype(arr.dtype) here because values will use a lot more memory if you have different types in your DataFrame.

cpcloud · 2015-07-13T13:13:20Z

blaze/compute/numpy.py

 @dispatch(Distinct, np.ndarray)
-def compute_up(t, x, **kwargs):
-    return np.unique(x)
+def compute_up(t, arr, _recarray_distinct=recarray_distinct, **kwargs):


any reason not to just call recarray_distinct in the body of the function?

this is just a slight perf increase.

i don't think it's necessary. what kind perf increase are we talking? saving the cost of a global lookup?

Yeah, it swaps out a global lookup for a local lookup. It's not a huge gain but it does slightly cut down the cost of calling the function.

cpcloud · 2015-07-13T16:12:37Z

ok this looks good to me

Adds distinct on

llllllllll added expression extension sql pandas labels Jul 7, 2015

llllllllll added this to the 0.8.2 milestone Jul 7, 2015

llllllllll added postgresql and removed sql labels Jul 7, 2015

cpcloud reviewed Jul 7, 2015
View reviewed changes

llllllllll force-pushed the distinct branch 2 times, most recently from c865ccd to 923f592 Compare July 7, 2015 21:14

cpcloud reviewed Jul 7, 2015
View reviewed changes

llllllllll force-pushed the distinct branch 2 times, most recently from ceb4d60 to 581e5c5 Compare July 7, 2015 23:32

llllllllll added the wip label Jul 8, 2015

llllllllll force-pushed the distinct branch 2 times, most recently from d5dfad4 to adf9158 Compare July 8, 2015 22:34

cpcloud modified the milestones: 0.8.2, 0.8.3 Jul 9, 2015

Joe Jevnik added 7 commits July 9, 2015 10:59

ENH: Adds distinct on

5aee80d

DOC: Adds docstring example for distinct on

1a1f713

TST: don't import from another test with pytest

b538610

ENH: Adds recarray support and fixes error messages

00694ee

ENH: preserve numpy string types

86185d0

TST: Adds tests for numpy distinct on

700d3f7

TST: Adds test for failure case in pyspark distinct on

f82d56c

llllllllll force-pushed the distinct branch from adf9158 to f82d56c Compare July 9, 2015 14:59

llllllllll removed the wip label Jul 9, 2015

cpcloud reviewed Jul 9, 2015
View reviewed changes

STY: use or instead of ternary

a6d5b8c

cpcloud reviewed Jul 10, 2015
View reviewed changes

MAINT: test_distinct_in -> test_distinct_on

d469b9d

cpcloud reviewed Jul 10, 2015
View reviewed changes

BUG: fix numpy structured arrays and add test case for distinct

381daa1

llllllllll force-pushed the distinct branch from c872a56 to 381daa1 Compare July 11, 2015 21:56

Joe Jevnik added 2 commits July 11, 2015 18:15

ENH: Better validation of fields in distinct on clauses

65105e8

MAINT: remove extra stackframe in isidentical method

fa660f3

cpcloud reviewed Jul 13, 2015
View reviewed changes

llllllllll added a commit that referenced this pull request Jul 13, 2015

Merge pull request #1159 from quantopian/distinct

f3c5fe6

Adds distinct on

llllllllll merged commit f3c5fe6 into blaze:master Jul 13, 2015

llllllllll deleted the distinct branch July 13, 2015 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds distinct on #1159

Adds distinct on #1159

llllllllll commented Jul 7, 2015

cpcloud Jul 7, 2015

llllllllll Jul 7, 2015

cpcloud commented Jul 7, 2015

cpcloud Jul 7, 2015

llllllllll commented Jul 8, 2015

llllllllll commented Jul 9, 2015

cpcloud Jul 9, 2015

llllllllll Jul 9, 2015

cpcloud Jul 10, 2015

llllllllll Jul 10, 2015

cpcloud Jul 10, 2015

cpcloud Jul 10, 2015

cpcloud Jul 13, 2015

llllllllll Jul 13, 2015

cpcloud Jul 13, 2015

llllllllll Jul 13, 2015

cpcloud commented Jul 13, 2015

Adds distinct on #1159

Adds distinct on #1159

Conversation

llllllllll commented Jul 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Jul 7, 2015

Choose a reason for hiding this comment

llllllllll commented Jul 8, 2015

llllllllll commented Jul 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Jul 13, 2015