New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random sampling #1410

Merged
merged 13 commits into from Feb 12, 2016

Conversation

Projects
None yet
2 participants
@kwmsmith
Member

kwmsmith commented Feb 9, 2016

Mimics Pandas' df.sample() interface.

The main usecase is to randomly downsample to a cached dataset, and then use that as the basis for faster exploration and expression building.

So far, the Python and SQL backends are implemented; still have to implement the Pandas / Dask backend.

kwmsmith added some commits Feb 8, 2016

Initial version of random sampling.
Implemented for Python backend.
Random sampling on SQL backend.
Can be improved performance-wise.

@kwmsmith kwmsmith changed the title from WIP: random sampling to Random sampling Feb 10, 2016

@@ -842,6 +842,24 @@ def compute_up(t, s, **kwargs):
return s.order_by(*cols)
@dispatch(Sample, sa.Table)
def compute_up(t, s, **kwargs):
n = t.n if t.n is not None else int(t.frac * s.count().scalar())

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

This is going to actually execute a sql query at this point when you call .scalar(). Can we look into pushing this into a subselect to reduce the IO overhead. Also, should we special case postgres>=9.5 wich has the TABLESAMPLE clause on selects?

This comment has been minimized.

@kwmsmith

kwmsmith Feb 12, 2016

Member

Yes, there's a lot of room here for performance improvements. I'll look into the subselect. Related questions: is sqla sophisticated enough to translate s.count() into an O(1) operation if the underlying table is indexed? Is this dialect / engine specific? Or does sqla take the safe approach an always do a table scan regardless?

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

I don't think sqlalchemy is going to decide any of that. The way to check this would be to output the sql query (you can use blaze.utils.literal_compile for this to get any bind params formatted in) and then feed it to explain in your db. This will show the query plan which should include any index use

if n is not None and frac is not None:
raise ValueError("n ({}) and frac ({}) cannot both be specified.".format(n, frac))
if n is not None:
n = int(n)

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

Should we maybe use operator.index to reject things like float and Decimal?

This comment has been minimized.

@kwmsmith

kwmsmith Feb 12, 2016

Member

Yes, and good to know about op.index, thanks for pointing it out.

raise ValueError("n must be positive, given {}".format(n))
if frac is not None:
frac = float(frac)
if frac < 0.0 or frac > 1.0:

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

if not 0.0 <= frac <= 1.0: might be clearer.

This comment has been minimized.

@kwmsmith

kwmsmith Feb 12, 2016

Member

Agreed.

if frac is not None:
frac = float(frac)
if frac < 0.0 or frac > 1.0:
raise ValueError("0 < frac < 1.0, given {}".format(frac))

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

could we prefix this with sample requires so it is clear what the lhs is showing. Also I think that those should be inclusive bounds.

This comment has been minimized.

@kwmsmith

kwmsmith Feb 12, 2016

Member

Agreed.

@dispatch(Sample, Sequence)
def compute_up(t, seq, **kwargs):
nsamp = t.n if t.n is not None else int(t.frac * len(seq))
return random.sample(seq, min(nsamp, len(seq)))

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

I wonder if an optimization might be to just be identity if n >= len(seq). is the contract that sample is in random order or just that it is a random sampling?

This comment has been minimized.

@kwmsmith

kwmsmith Feb 12, 2016

Member

I'm fine with this as-is, since both Pandas' sample and random.sample both implement random order when n == len(seq), and it provides a way for users to randomize the order of all rows if that's desired.

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

Okay, that sounds good. I just wanted to make sure this was explicitly chosen

@@ -517,6 +517,10 @@ def test_head():
assert str(compute(t.head(2), s)) == str(select(s).limit(2))
def test_sample():
assert str(compute(t.sample(n=5), s)) == str(select(s).order_by(sa.func.random()).limit(5))

This comment has been minimized.

@llllllllll

llllllllll Feb 12, 2016

Member

should we have a test with frac also?

kwmsmith added a commit that referenced this pull request Feb 12, 2016

@kwmsmith kwmsmith merged commit 13fd21a into blaze:master Feb 12, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+2.0%) to 90.591%
Details

@kwmsmith kwmsmith deleted the kwmsmith:feature/sampling branch Feb 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment