-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-14975: [Python] Dataset.sort_by #14976
Conversation
amol-
commented
Dec 15, 2022
•
edited
Loading
edited
- Closes: [Python] Add Dataset.sort_by #14975
- Proof of concept using an ExecPlan
- Add test to filter and then sort to confirm lazy filtering works with sorting.
|
res = _pc()._exec_plan._sort_source(self, output_type=Table, | ||
sort_options=_pc().SortOptions( | ||
sort_keys=sorting, **kwargs | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any advantage is using the exec plan version here, instead of the current version? (but I suppose also no disadvantage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't benchmark it, I can if you want to have numbers, but I guessed that it's faster to do the whole operation in C++ once than having to sort indices and create a temporary arrays of indices just to then take the actual values from the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, at the time I replaced this with a manual sort_indices call in a follow-up PR (#15268, for enabling this in a minimal build). But so it also appears that the Acero-based implementation is currently actually slower, see #34249 (comment) for some analysis (so as long as that is not solved, eg by allowing configuring the batch size, we should keep this manual implementation for Table/RecordBatch.sort_by).
Benchmark runs are scheduled for baseline = 0f5b8dd and contender = 387e95a. 387e95a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
@github-actions crossbow submit example-python-minimal-build-* |
Revision: 3dbe53b Submitted crossbow builds: ursacomputing/crossbow @ actions-78101888a5
|
@amol- It seems that this breaks https://github.com/ursacomputing/crossbow/actions/runs/3772771994/jobs/6413886549#step:3:7216
Could you fix this? |
* Closes: apache#14975 - [x] Proof of concept using an ExecPlan - [x] Add test to filter and then sort to confirm lazy filtering works with sorting. Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…by to fix minimal tests
Fixing this in #15268 |
…by to fix minimal tests (apache#15268) * Closes: apache#14976 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>