Add files_processed and files_scanned metrics to FileStreamMetrics#20592
Add files_processed and files_scanned metrics to FileStreamMetrics#20592adriangb merged 10 commits intoapache:mainfrom
Conversation
Track file-level progress in FileStream with two new counters: - files_opened: incremented when a file is successfully opened - files_scanned: incremented when a file's reader stream is fully consumed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename `files_opened` metric to `files_processed` so it reflects all files assigned to the partition, not just those that were opened. When a LIMIT terminates the stream early, the remaining files (including any prefetched next file) are counted so that `files_processed` always equals the total partition file count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the files_processed metric so it only increments when we are truly done with a file (consumed, errored+skipped, or limit), not at file-open time. This makes the metric semantics match the name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Incremented when a file is successfully opened, complementing the existing files_processed (done with file) and files_scanned (fully consumed) metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pairs better with files_opened and avoids confusion with files_scanned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the review @getChan. The goal of this metric was precisely to record when we are done with a file, so really it's the implementation that was wrong. I've fixed it now. I also added a |
|
@comphead @gabotechs any chance one of you could take a look at this? |
| batch | ||
| } else { | ||
| let batch = batch.slice(0, *remain); | ||
| // Count this file, the prefetched next file |
There was a problem hiding this comment.
// Count this file, the prefetched next file
// (if any), and all remaining files we will
// never open.
I read it twice and prob need some help to understand what files are being counted here? if they never opened how come we need to count them as closed?
There was a problem hiding this comment.
This is if we hit a LIMIT. Maybe files_processed is a better term? I want this to be useful to track the progress of a query, IMO the best metric we have now is "how many files have we completed work on (for any reason) out of all of the files we can possibly look at?".
There was a problem hiding this comment.
yeah, processed would be more intuitive I'd say
| /// assigned to this partition. | ||
| pub files_closed: Count, | ||
| /// Count of files completely scanned (reader stream fully consumed). | ||
| pub files_scanned: Count, |
There was a problem hiding this comment.
so the file can be opened, but not scanned?
There was a problem hiding this comment.
Yes, e.g. if FilePruner discards it based on stats.
There was a problem hiding this comment.
would be that more explanatory to see how many files were skipped instead of closed? Intuitively i would suggest number of open files == number of closed files, unless we debug a connection leak. And skipped by stats would make more sense?
There was a problem hiding this comment.
I think the two invariants are:
- number of open files >= number of closed files
- number of open files == number of closed files at the beginning and end of execution
Regarding files_scanned vs. files_skipped: we already have metrics for file_groups_prunned_statistics. I don't really need the files_scanned metric so let me remove it.
Summary
files_processedcounter toFileStreamMetrics, incremented for every file assigned to the partition — whether it was opened, pruned (returned an empty stream), or skipped due to a LIMIT. When the stream completes, this equals the total number of files in the partition.files_openedcounter toFileStreamMetrics, incremented as soon as we consider a file for processing (either actually opened, discarded because of a LIMIT or stats, etc.).Motivation
These metrics enable tracking query progress during long-running scans. Today, there is no way to monitor how far along a file scan is. The existing
FileStreamMetricsonly provide:time_elapsed_opening,time_elapsed_scanning_total, etc.) — these measure duration but don't indicate progress. You can't tell whether a scan is 10% or 90% done from elapsed time alone.file_open_errors,file_scan_errors) — these only count failures, not successful progress.output_rowsorbytes_scanned(fromBaselineMetrics) — counts rows emitted, but since we don't know upfront how many rows will be emitted in total this is a poor metric, i.e. it never converges to 100% if there are filters, etc.In contrast,
files_processedandfiles_openedcombined with the known number of files infile_groupsgive a clear progress indicator:files_processed / total_files. This is the most natural and reliable way to track scan progress since the file count is known at plan time. Depending on what users plan to do with the metric they can pickfiles_opened / total_files(leading metric) orfiles_processed / total_files(lagging metric).Test plan
file_streamtests pass (8/8)cargo check -p datafusion-datasourcecompiles cleanly🤖 Generated with Claude Code