How to get the raw Arrow stream of matched rows? #6870

dgildeh · 2023-07-06T21:36:01Z

dgildeh
Jul 6, 2023

I'm trying to query some Parquet files on S3, and use Datafusion's partitioned streams API to pass all the record batches as they're available to a callback function like below:

let streams = df.execute_stream_partitioned().await?
for stream in streams {
        let batch = common::collect(stream).await?;
        let results = record_batches_to_json_rows(&batch)?;
        rows.append(&mut results.clone());
        if let Some(callback) = &callback {
            if !results.is_empty() {
                // Only call callback function if there are results to process
                callback(results).await;
            }
        }
    }

However, what I'd really like to do, is get a stream of rows across all threads that match a query and pass them to the callback function as soon as they're found (so the callback will get called every time a new row matches with the row data instead of waiting for the partition finish collecting rows).

Is this possible and if so how can I do this using Datafusion? From stepping into the execute_stream_partitioned() function it looks like I may have to make my own physical plan and plug it into Datafusion, but that feels like a lot of work for a Rust/Datafusion newbie so hoping there's an easier way/API I can hook into to do this.

Thanks!

alamb · 2023-07-10T13:05:51Z

alamb
Jul 10, 2023
Collaborator

Is this possible and if so how can I do this using Datafusion? From stepping into the execute_stream_partitioned() function it looks like I may have to make my own physical plan and plug it into Datafusion, but that feels like a lot of work for a Rust/Datafusion newbie so hoping there's an easier way/API I can hook into to do this.

I think you could do what you describe with futures::stream::flatten (to combine all the partition stream into a single stream) and then for_each

https://docs.rs/futures/0.3.28/futures/stream/trait.StreamExt.html#method.flatten

https://docs.rs/futures/0.3.28/futures/stream/trait.StreamExt.html#method.for_each

Something like this (untested):

let streams = df.execute_stream_partitioned().await?
// combine all streams into a single logical one
let stream = futures::stream::iter(streams).flatten();
// Call the method for each record batch seen:
stream.for_each(|batch| {
  // your per-batch callback here...
}).await

0 replies

alamb · 2023-07-10T13:12:11Z

alamb
Jul 10, 2023
Collaborator

Actually, if you want the values from each partition as they come in you may need to use https://docs.rs/futures/0.3.28/futures/stream/trait.StreamExt.html#method.flatten_unordered

2 replies

dgildeh Aug 24, 2023
Author

Thanks for your help. Will this get row level streams? My understanding is by using df.execute_stream_partitioned().await? I will get a stream that returns as soon as a partition is processed (typically one file in an array of files), not a stream of the rows as the files are filtered, or am I missing something?

tustvold Aug 25, 2023
Collaborator

In the absence of a pipeline blocker, such as a sort or aggregate, you will get a stream of batches as they are read. The batch size can be configured on SessionConfig and is typically ~8000 rows, this helps to amortise dispatch overheads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the raw Arrow stream of matched rows? #6870

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to get the raw Arrow stream of matched rows? #6870

dgildeh Jul 6, 2023

Replies: 2 comments · 2 replies

alamb Jul 10, 2023 Collaborator

alamb Jul 10, 2023 Collaborator

dgildeh Aug 24, 2023 Author

tustvold Aug 25, 2023 Collaborator

dgildeh
Jul 6, 2023

Replies: 2 comments 2 replies

alamb
Jul 10, 2023
Collaborator

alamb
Jul 10, 2023
Collaborator

dgildeh Aug 24, 2023
Author

tustvold Aug 25, 2023
Collaborator