Add files_processed and files_scanned metrics to FileStreamMetrics by adriangb · Pull Request #20592 · apache/datafusion

adriangb · 2026-02-27T12:07:50Z

Summary

Add files_processed counter to FileStreamMetrics, incremented for every file assigned to the partition — whether it was opened, pruned (returned an empty stream), or skipped due to a LIMIT. When the stream completes, this equals the total number of files in the partition.
Add files_opened counter to FileStreamMetrics, incremented as soon as we consider a file for processing (either actually opened, discarded because of a LIMIT or stats, etc.).

Motivation

These metrics enable tracking query progress during long-running scans. Today, there is no way to monitor how far along a file scan is. The existing FileStreamMetrics only provide:

Timing metrics (time_elapsed_opening, time_elapsed_scanning_total, etc.) — these measure duration but don't indicate progress. You can't tell whether a scan is 10% or 90% done from elapsed time alone.
Error counters (file_open_errors, file_scan_errors) — these only count failures, not successful progress.
output_rows or bytes_scanned (from BaselineMetrics) — counts rows emitted, but since we don't know upfront how many rows will be emitted in total this is a poor metric, i.e. it never converges to 100% if there are filters, etc.

In contrast, files_processed and files_opened combined with the known number of files in file_groups give a clear progress indicator: files_processed / total_files. This is the most natural and reliable way to track scan progress since the file count is known at plan time. Depending on what users plan to do with the metric they can pick files_opened / total_files (leading metric) or files_processed / total_files (lagging metric).

Test plan

Existing file_stream tests pass (8/8)
cargo check -p datafusion-datasource compiles cleanly

🤖 Generated with Claude Code

Track file-level progress in FileStream with two new counters: - files_opened: incremented when a file is successfully opened - files_scanned: incremented when a file's reader stream is fully consumed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename `files_opened` metric to `files_processed` so it reflects all files assigned to the partition, not just those that were opened. When a LIMIT terminates the stream early, the remaining files (including any prefetched next file) are counted so that `files_processed` always equals the total partition file count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

getChan

Thanks for the improvement.

Small concern: files_processed is incremented on open success (and LIMIT short-circuit), not on completion.
so I considered renaming to better reflect this, but couldn’t find a clearer naming.

Move the files_processed metric so it only increments when we are truly done with a file (consumed, errored+skipped, or limit), not at file-open time. This makes the metric semantics match the name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Incremented when a file is successfully opened, complementing the existing files_processed (done with file) and files_scanned (fully consumed) metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pairs better with files_opened and avoids confusion with files_scanned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adriangb · 2026-03-02T16:02:57Z

Thanks for the review @getChan. The goal of this metric was precisely to record when we are done with a file, so really it's the implementation that was wrong. I've fixed it now. I also added a files_opened metric to complete the picture.

adriangb · 2026-03-02T21:15:34Z

@comphead @gabotechs any chance one of you could take a look at this?

comphead · 2026-03-02T22:02:53Z

datafusion/datasource/src/file_stream.rs

                                        batch
                                    } else {
                                        let batch = batch.slice(0, *remain);
+                                        // Count this file, the prefetched next file


// Count this file, the prefetched next file // (if any), and all remaining files we will // never open.

I read it twice and prob need some help to understand what files are being counted here? if they never opened how come we need to count them as closed?

This is if we hit a LIMIT. Maybe files_processed is a better term? I want this to be useful to track the progress of a query, IMO the best metric we have now is "how many files have we completed work on (for any reason) out of all of the files we can possibly look at?".

yeah, processed would be more intuitive I'd say

comphead · 2026-03-02T22:03:47Z

datafusion/datasource/src/file_stream.rs

+    /// assigned to this partition.
+    pub files_closed: Count,
+    /// Count of files completely scanned (reader stream fully consumed).
+    pub files_scanned: Count,


so the file can be opened, but not scanned?

Yes, e.g. if FilePruner discards it based on stats.

would be that more explanatory to see how many files were skipped instead of closed? Intuitively i would suggest number of open files == number of closed files, unless we debug a connection leak. And skipped by stats would make more sense?

I think the two invariants are:

number of open files >= number of closed files

number of open files == number of closed files at the beginning and end of execution

Regarding files_scanned vs. files_skipped: we already have metrics for file_groups_prunned_statistics. I don't really need the files_scanned metric so let me remove it.

Add docstrings

comphead

Thanks @adriangb

github-actions bot added the datasource Changes to the datasource crate label Feb 27, 2026

adriangb and others added 2 commits February 27, 2026 12:11

fmt

20115eb

adriangb changed the title ~~Add files_opened and files_scanned metrics to FileStreamMetrics~~ Add files_processed and files_scanned metrics to FileStreamMetrics Mar 1, 2026

getChan approved these changes Mar 2, 2026

View reviewed changes

adriangb and others added 3 commits March 2, 2026 16:32

Add files_opened metric to FileStreamMetrics

b81ded0

Incremented when a file is successfully opened, complementing the existing files_processed (done with file) and files_scanned (fully consumed) metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename files_processed to files_closed

cac9763

Pairs better with files_opened and avoids confusion with files_scanned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fmt

1c7edf6

comphead reviewed Mar 2, 2026

View reviewed changes

adriangb added 3 commits March 3, 2026 07:37

remove files_scanned

555cc33

Rename files_closed to files_processed in metrics

4674ee2

Add docstrings

fmt

4016354

comphead approved these changes Mar 3, 2026

View reviewed changes

adriangb added this pull request to the merge queue Mar 4, 2026

Merged via the queue into apache:main with commit 028e351 Mar 4, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add files_processed and files_scanned metrics to FileStreamMetrics#20592

Add files_processed and files_scanned metrics to FileStreamMetrics#20592
adriangb merged 10 commits intoapache:mainfrom
pydantic:file-open-metrics

adriangb commented Feb 27, 2026 •

edited

Loading

Uh oh!

getChan left a comment •

edited

Loading

Uh oh!

adriangb commented Mar 2, 2026

Uh oh!

adriangb commented Mar 2, 2026

Uh oh!

comphead Mar 2, 2026

Uh oh!

adriangb Mar 2, 2026

Uh oh!

comphead Mar 3, 2026

Uh oh!

adriangb Mar 3, 2026

Uh oh!

comphead Mar 2, 2026

Uh oh!

adriangb Mar 2, 2026

Uh oh!

comphead Mar 3, 2026

Uh oh!

adriangb Mar 3, 2026

Uh oh!

comphead left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adriangb commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

getChan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Mar 2, 2026

Uh oh!

adriangb commented Mar 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adriangb commented Feb 27, 2026 •

edited

Loading

getChan left a comment •

edited

Loading