-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random sampling #1410
Random sampling #1410
Conversation
Implemented for Python backend.
Can be improved performance-wise.
@@ -842,6 +842,24 @@ def compute_up(t, s, **kwargs): | |||
return s.order_by(*cols) | |||
|
|||
|
|||
@dispatch(Sample, sa.Table) | |||
def compute_up(t, s, **kwargs): | |||
n = t.n if t.n is not None else int(t.frac * s.count().scalar()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to actually execute a sql query at this point when you call .scalar()
. Can we look into pushing this into a subselect to reduce the IO overhead. Also, should we special case postgres>=9.5 wich has the TABLESAMPLE clause on selects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's a lot of room here for performance improvements. I'll look into the subselect. Related questions: is sqla sophisticated enough to translate s.count()
into an O(1) operation if the underlying table is indexed? Is this dialect / engine specific? Or does sqla take the safe approach an always do a table scan regardless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think sqlalchemy is going to decide any of that. The way to check this would be to output the sql query (you can use blaze.utils.literal_compile
for this to get any bind params formatted in) and then feed it to explain
in your db. This will show the query plan which should include any index use
@dispatch(Sample, Sequence) | ||
def compute_up(t, seq, **kwargs): | ||
nsamp = t.n if t.n is not None else int(t.frac * len(seq)) | ||
return random.sample(seq, min(nsamp, len(seq))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if an optimization might be to just be identity if n >= len(seq)
. is the contract that sample
is in random order or just that it is a random sampling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this as-is, since both Pandas' sample
and random.sample
both implement random order when n == len(seq)
, and it provides a way for users to randomize the order of all rows if that's desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that sounds good. I just wanted to make sure this was explicitly chosen
Mimics Pandas'
df.sample()
interface.The main usecase is to randomly downsample to a cached dataset, and then use that as the basis for faster exploration and expression building.
So far, the Python and SQL backends are implemented; still have to implement the Pandas / Dask backend.