Skip to content

FileStream start_next_file() is not measured in compute time. #20571

@EmilyMatt

Description

@EmilyMatt

Describe the bug

start_next_file returns a FileOpenFuture, however nothing guarantees that the open() function, which returns that future, doesn't take processing time.
This time should be represented in time scanning total

fn start_next_file(&mut self) -> Option<Result<(FileOpenFuture, Vec<ScalarValue>)>> {
        let part_file = self.file_iter.pop_front()?;

        let partition_values = part_file.partition_values.clone();
        let _timer = self.file_stream_metrics.time_scanning_total.metrics.timer(); // <- this line is missing
        Some(
            self.file_opener
                .open(part_file)
                .map(|future| (future, partition_values)),
        )
    }

An alternative solution would be to remove Result from this function's signature, and return a futures::future::lazy which will perform the opening, and call .try_flatten() on that nested future.
That way this will only happen when polled first

To Reproduce

Add a FileOpener which sleeps a couple seconds before returning the file open future

Expected behavior

This should be clearly represented by the scan time total, in practice, given enough CPU work, the opening file time can actually appear larger than total scan time

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions