Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random sampling #1410

Merged
merged 13 commits into from
Feb 12, 2016
Merged

Random sampling #1410

merged 13 commits into from
Feb 12, 2016

Conversation

kwmsmith
Copy link
Member

@kwmsmith kwmsmith commented Feb 9, 2016

Mimics Pandas' df.sample() interface.

The main usecase is to randomly downsample to a cached dataset, and then use that as the basis for faster exploration and expression building.

So far, the Python and SQL backends are implemented; still have to implement the Pandas / Dask backend.

@kwmsmith kwmsmith changed the title WIP: random sampling Random sampling Feb 10, 2016
@@ -842,6 +842,24 @@ def compute_up(t, s, **kwargs):
return s.order_by(*cols)


@dispatch(Sample, sa.Table)
def compute_up(t, s, **kwargs):
n = t.n if t.n is not None else int(t.frac * s.count().scalar())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to actually execute a sql query at this point when you call .scalar(). Can we look into pushing this into a subselect to reduce the IO overhead. Also, should we special case postgres>=9.5 wich has the TABLESAMPLE clause on selects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's a lot of room here for performance improvements. I'll look into the subselect. Related questions: is sqla sophisticated enough to translate s.count() into an O(1) operation if the underlying table is indexed? Is this dialect / engine specific? Or does sqla take the safe approach an always do a table scan regardless?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think sqlalchemy is going to decide any of that. The way to check this would be to output the sql query (you can use blaze.utils.literal_compile for this to get any bind params formatted in) and then feed it to explain in your db. This will show the query plan which should include any index use

@dispatch(Sample, Sequence)
def compute_up(t, seq, **kwargs):
nsamp = t.n if t.n is not None else int(t.frac * len(seq))
return random.sample(seq, min(nsamp, len(seq)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if an optimization might be to just be identity if n >= len(seq). is the contract that sample is in random order or just that it is a random sampling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this as-is, since both Pandas' sample and random.sample both implement random order when n == len(seq), and it provides a way for users to randomize the order of all rows if that's desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds good. I just wanted to make sure this was explicitly chosen

kwmsmith added a commit that referenced this pull request Feb 12, 2016
@kwmsmith kwmsmith merged commit 13fd21a into blaze:master Feb 12, 2016
@kwmsmith kwmsmith deleted the feature/sampling branch February 12, 2016 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants