Random sampling #1410

kwmsmith · 2016-02-09T22:39:32Z

Mimics Pandas' df.sample() interface.

The main usecase is to randomly downsample to a cached dataset, and then use that as the basis for faster exploration and expression building.

So far, the Python and SQL backends are implemented; still have to implement the Pandas / Dask backend.

Implemented for Python backend.

Can be improved performance-wise.

llllllllll · 2016-02-12T04:50:14Z

blaze/compute/sql.py

@@ -842,6 +842,24 @@ def compute_up(t, s, **kwargs):
    return s.order_by(*cols)


+@dispatch(Sample, sa.Table)
+def compute_up(t, s, **kwargs):
+    n = t.n if t.n is not None else int(t.frac * s.count().scalar())


This is going to actually execute a sql query at this point when you call .scalar(). Can we look into pushing this into a subselect to reduce the IO overhead. Also, should we special case postgres>=9.5 wich has the TABLESAMPLE clause on selects?

Yes, there's a lot of room here for performance improvements. I'll look into the subselect. Related questions: is sqla sophisticated enough to translate s.count() into an O(1) operation if the underlying table is indexed? Is this dialect / engine specific? Or does sqla take the safe approach an always do a table scan regardless?

I don't think sqlalchemy is going to decide any of that. The way to check this would be to output the sql query (you can use blaze.utils.literal_compile for this to get any bind params formatted in) and then feed it to explain in your db. This will show the query plan which should include any index use

llllllllll · 2016-02-12T18:54:33Z

blaze/compute/python.py

+@dispatch(Sample, Sequence)
+def compute_up(t, seq, **kwargs):
+    nsamp = t.n if t.n is not None else int(t.frac * len(seq))
+    return random.sample(seq, min(nsamp, len(seq)))


I wonder if an optimization might be to just be identity if n >= len(seq). is the contract that sample is in random order or just that it is a random sampling?

I'm fine with this as-is, since both Pandas' sample and random.sample both implement random order when n == len(seq), and it provides a way for users to randomize the order of all rows if that's desired.

Okay, that sounds good. I just wanted to make sure this was explicitly chosen

Random sampling

kwmsmith added 3 commits February 8, 2016 16:10

Initial version of random sampling.

177992e

Implemented for Python backend.

Add tests and hit edge cases.

9a171cc

Random sampling on SQL backend.

a1d007e

Can be improved performance-wise.

kwmsmith added enhancement expression extension api design labels Feb 9, 2016

kwmsmith added this to the 0.10 milestone Feb 9, 2016

kwmsmith added 5 commits February 10, 2016 13:05

Implement random sampling for Pandas and Dask backends.

7fb3278

Remove range from sequence type tuple.

fd61d55

Fix doctest.

c444914

Add whatsnew entry [ci skip]

6f11466

Merge branch 'master' into feature/sampling [ci skip]

b934614

kwmsmith changed the title ~~WIP: random sampling~~ Random sampling Feb 10, 2016

llllllllll reviewed Feb 12, 2016
View reviewed changes

kwmsmith added 2 commits February 12, 2016 11:42

Tweaks to sample implementation.

3632888

Use subselect inside SQL sample compute_up.

d3123dc

llllllllll reviewed Feb 12, 2016
View reviewed changes

kwmsmith added 3 commits February 12, 2016 14:16

Add SQLA test for sample(frac=...) case.

11376fa

Trivial commit to trigger Travis-CI.

623613c

Another Travis-CI trigger, nothing to see here...

53905e5

kwmsmith added a commit that referenced this pull request Feb 12, 2016

Merge pull request #1410 from kwmsmith/feature/sampling

13fd21a

Random sampling

kwmsmith merged commit 13fd21a into blaze:master Feb 12, 2016

kwmsmith deleted the feature/sampling branch February 12, 2016 23:10

kwmsmith mentioned this pull request Feb 12, 2016

Update rosetta docs with new sample() expression #1415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random sampling #1410

Random sampling #1410

kwmsmith commented Feb 9, 2016

llllllllll Feb 12, 2016

kwmsmith Feb 12, 2016

llllllllll Feb 12, 2016

llllllllll Feb 12, 2016

kwmsmith Feb 12, 2016

llllllllll Feb 12, 2016

Random sampling #1410

Random sampling #1410

Conversation

kwmsmith commented Feb 9, 2016

llllllllll Feb 12, 2016

Choose a reason for hiding this comment

kwmsmith Feb 12, 2016

Choose a reason for hiding this comment

llllllllll Feb 12, 2016

Choose a reason for hiding this comment

llllllllll Feb 12, 2016

Choose a reason for hiding this comment

kwmsmith Feb 12, 2016

Choose a reason for hiding this comment

llllllllll Feb 12, 2016

Choose a reason for hiding this comment