Rewrite FileStream in terms of Morsel API#21342
Conversation
816d243 to
3346af7
Compare
| /// This groups together ready planners, ready morsels, the active reader, | ||
| /// pending planner I/O, the remaining files and limit, and the metrics | ||
| /// associated with processing that work. | ||
| pub(super) struct ScanState { |
There was a problem hiding this comment.
This is the new inner state machine for FileStream
| use std::sync::Arc; | ||
| use std::sync::mpsc::{self, Receiver, TryRecvError}; | ||
|
|
||
| /// Adapt a legacy [`FileOpener`] to the morsel API. |
There was a problem hiding this comment.
This is an adapter so that existing FileOpeners continue to have the same behavior
| @@ -0,0 +1,556 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
This is testing infrastructure to write the snapshot tests
| return Poll::Ready(Some(Err(err))); | ||
| } | ||
| } | ||
| FileStreamState::Scan { scan_state: queue } => { |
There was a problem hiding this comment.
moved the inner state machine into a separate module/struct to try and keep indenting under control and encapsualte the complexity somewhat
| assert!(err.contains("FileStreamBuilder invalid partition index: 1")); | ||
| } | ||
|
|
||
| /// Verifies the simplest morsel-driven flow: one planner produces one |
There was a problem hiding this comment.
Here are tests showing the sequence of calls to the various morsel APIs. I intend to use this framework to show how work can migrate from one stream to the other
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b5c452a to
d5a1f74
Compare
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/file_stream_split (d5a1f74) to 1e93a67 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
| all-features = true | ||
|
|
||
| [features] | ||
| backtrace = ["datafusion-common/backtrace"] |
There was a problem hiding this comment.
I added this while debugging why the tests failed on CI and not locally (it was when this feature flag was on the Error messages got mangled).
I added a crate level feature to enable the feature in datafusion-common so I could reproduce the error locally
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/file_stream_split (d5a1f74) to 1e93a67 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/file_stream_split (d5a1f74) to 1e93a67 (merge-base) diff using: tpcds File an issue against this benchmark runner |
d5a1f74 to
b2c9bd6
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Stacked on
Which issue does this PR close?
Rationale for this change
The Morsel API allows for finer grain parallelism (and IO). It is important to have the FileStream work in terms of the Morsel API to allow future features (like workstealing, etc)
What changes are included in this PR?
Are these changes tested?
Yes by existing functional and benchmark tests, as well as new functional tests
Are there any user-facing changes?
No (not yet)