New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase speed and parallelism of the limit algorithm and implement descending sorting #75
Conversation
df = dd.from_delayed( | ||
reversed([self.reverse_partition(p) for p in df.to_delayed()]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to become familiar with the df.partitions
iterable.
dd.concat(df.partitions[::-1])
Going between delayed and dataframes introduces some inconvenience in graph-handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh nice! That looks awesome. Yes, this is definitely much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I think that we'll want to go through and remove all of the to/from_delayed
calls. This isn't critical, but it's nice to do from a performance standpoint (happy to go into this in more depth if you like).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advice @mrocklin! Really appreciated. I would actually be really interested in your thoughts about this part.
Probably you see it right away, but what I would like to achieve is basically two things
- call a function on every partition, while also having the partition index as an input argument
- calculate
partition_borders
only when executing the DAG - not already when creating it (this is what I have done before this PR, and it slows down the process).
So far, the delayed
function was all I could come up with, but I definitely think there is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created an issue to fix this, so we can go on with this PR.
# We do a (hopefully) very quick check: if the first partition | ||
# is already enough, we will just ust this | ||
first_partition_length = df.map_partitions(lambda x: len(x)).to_delayed()[0] | ||
first_partition_length = first_partition_length.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(df.partitions[0])
Codecov Report
@@ Coverage Diff @@
## main #75 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 34 34
Lines 1383 1383
Branches 185 189 +4
=========================================
Hits 1383 1383
Continue to review full report at Codecov.
|
This PR introduces four changes: