Increase speed and parallelism of the limit algorithm and implement descending sorting #75

nils-braun · 2020-11-06T23:13:59Z

This PR introduces four changes:

it simplifies the logic for getting a piece of the dask dataframe starting at OFFSET and ending at LIMIT.
this allows to not needing to calculate the dask dataframe before, but really only at the actual calculation. The calculated used to be needed for getting the partition boundaries, but they can now also only be calculated "lazily"
as there is no need to do a full recalculation, I introduced a quick shortcut: if the first partition is already enough, just return this. This is a very typical use case when you just do a "LIMIT 10".
it allows for descending sorting also in the first column

mrocklin · 2020-11-09T21:53:36Z

dask_sql/physical/rel/logical/sort.py

+            df = dd.from_delayed(
+                reversed([self.reverse_partition(p) for p in df.to_delayed()])
+            )


You may want to become familiar with the df.partitions iterable.

dd.concat(df.partitions[::-1])

Going between delayed and dataframes introduces some inconvenience in graph-handling.

Uh nice! That looks awesome. Yes, this is definitely much better.

At some point I think that we'll want to go through and remove all of the to/from_delayed calls. This isn't critical, but it's nice to do from a performance standpoint (happy to go into this in more depth if you like).

Thanks for your advice @mrocklin! Really appreciated. I would actually be really interested in your thoughts about this part.

Probably you see it right away, but what I would like to achieve is basically two things

call a function on every partition, while also having the partition index as an input argument

calculate partition_borders only when executing the DAG - not already when creating it (this is what I have done before this PR, and it slows down the process).

So far, the delayed function was all I could come up with, but I definitely think there is a better way.

I have created an issue to fix this, so we can go on with this PR.

mrocklin · 2020-11-09T21:54:19Z

dask_sql/physical/rel/logical/sort.py

+            # We do a (hopefully) very quick check: if the first partition
+            # is already enough, we will just ust this
+            first_partition_length = df.map_partitions(lambda x: len(x)).to_delayed()[0]
+            first_partition_length = first_partition_length.compute()


len(df.partitions[0])

…sql installation on the workers

codecov-io · 2020-11-15T12:24:59Z

Codecov Report

Merging #75 (2bc8920) into main (8abc48c) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #75   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           34        34           
  Lines         1383      1383           
  Branches       185       189    +4     
=========================================
  Hits          1383      1383

Impacted Files	Coverage Δ
dask_sql/physical/rel/logical/sort.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8abc48c...2bc8920. Read the comment docs.

nils-braun added 3 commits November 7, 2020 00:08

Increase speed and parallelism of the limit algorithm

8f76f49

Fixed docs

6d5ac2c

Implement descending sorting. Fixes #10

49d969c

nils-braun changed the title ~~[WIP] Increase speed and parallelism of the limit algorithm~~ Increase speed and parallelism of the limit algorithm and implement descending sorting Nov 9, 2020

mrocklin reviewed Nov 9, 2020

View reviewed changes

nils-braun added 4 commits November 10, 2020 20:19

Replace two delayed usages. Thanks @mrocklin

8fd9220

Merge branch 'main' into feature/faster-limiting

75eea3d

Remoe the reference to the function - to make it usable without dask-…

b98baf2

…sql installation on the workers

Merge branch 'main' into feature/faster-limiting

2bc8920

nils-braun mentioned this pull request Nov 15, 2020

Remove to_delayed/from_delayed calls #80

Open

nils-braun merged commit 02e2dad into main Nov 15, 2020

nils-braun deleted the feature/faster-limiting branch November 15, 2020 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase speed and parallelism of the limit algorithm and implement descending sorting #75

Increase speed and parallelism of the limit algorithm and implement descending sorting #75

nils-braun commented Nov 6, 2020 •

edited

mrocklin Nov 9, 2020

nils-braun Nov 9, 2020

mrocklin Nov 9, 2020

nils-braun Nov 10, 2020 •

edited

nils-braun Nov 15, 2020

mrocklin Nov 9, 2020

codecov-io commented Nov 15, 2020 •

edited

Increase speed and parallelism of the limit algorithm and implement descending sorting #75

Increase speed and parallelism of the limit algorithm and implement descending sorting #75

Conversation

nils-braun commented Nov 6, 2020 • edited

mrocklin Nov 9, 2020

Choose a reason for hiding this comment

nils-braun Nov 9, 2020

Choose a reason for hiding this comment

mrocklin Nov 9, 2020

Choose a reason for hiding this comment

nils-braun Nov 10, 2020 • edited

Choose a reason for hiding this comment

nils-braun Nov 15, 2020

Choose a reason for hiding this comment

mrocklin Nov 9, 2020

Choose a reason for hiding this comment

codecov-io commented Nov 15, 2020 • edited

Codecov Report

nils-braun commented Nov 6, 2020 •

edited

nils-braun Nov 10, 2020 •

edited

codecov-io commented Nov 15, 2020 •

edited