Skip to content

Add files_processed and files_scanned metrics to FileStreamMetrics#20592

Merged
adriangb merged 10 commits intoapache:mainfrom
pydantic:file-open-metrics
Mar 4, 2026
Merged

Add files_processed and files_scanned metrics to FileStreamMetrics#20592
adriangb merged 10 commits intoapache:mainfrom
pydantic:file-open-metrics

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Feb 27, 2026

Summary

  • Add files_processed counter to FileStreamMetrics, incremented for every file assigned to the partition — whether it was opened, pruned (returned an empty stream), or skipped due to a LIMIT. When the stream completes, this equals the total number of files in the partition.
  • Add files_opened counter to FileStreamMetrics, incremented as soon as we consider a file for processing (either actually opened, discarded because of a LIMIT or stats, etc.).

Motivation

These metrics enable tracking query progress during long-running scans. Today, there is no way to monitor how far along a file scan is. The existing FileStreamMetrics only provide:

  • Timing metrics (time_elapsed_opening, time_elapsed_scanning_total, etc.) — these measure duration but don't indicate progress. You can't tell whether a scan is 10% or 90% done from elapsed time alone.
  • Error counters (file_open_errors, file_scan_errors) — these only count failures, not successful progress.
  • output_rows or bytes_scanned (from BaselineMetrics) — counts rows emitted, but since we don't know upfront how many rows will be emitted in total this is a poor metric, i.e. it never converges to 100% if there are filters, etc.

In contrast, files_processed and files_opened combined with the known number of files in file_groups give a clear progress indicator: files_processed / total_files. This is the most natural and reliable way to track scan progress since the file count is known at plan time. Depending on what users plan to do with the metric they can pick files_opened / total_files (leading metric) or files_processed / total_files (lagging metric).

Test plan

  • Existing file_stream tests pass (8/8)
  • cargo check -p datafusion-datasource compiles cleanly

🤖 Generated with Claude Code

Track file-level progress in FileStream with two new counters:
- files_opened: incremented when a file is successfully opened
- files_scanned: incremented when a file's reader stream is fully consumed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the datasource Changes to the datasource crate label Feb 27, 2026
adriangb and others added 2 commits February 27, 2026 12:11
Rename `files_opened` metric to `files_processed` so it reflects
all files assigned to the partition, not just those that were opened.
When a LIMIT terminates the stream early, the remaining files
(including any prefetched next file) are counted so that
`files_processed` always equals the total partition file count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adriangb adriangb changed the title Add files_opened and files_scanned metrics to FileStreamMetrics Add files_processed and files_scanned metrics to FileStreamMetrics Mar 1, 2026
Copy link
Contributor

@getChan getChan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement.

Small concern: files_processed is incremented on open success (and LIMIT short-circuit), not on completion.
so I considered renaming to better reflect this, but couldn’t find a clearer naming.

adriangb and others added 3 commits March 2, 2026 16:32
Move the files_processed metric so it only increments when we are
truly done with a file (consumed, errored+skipped, or limit), not
at file-open time. This makes the metric semantics match the name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Incremented when a file is successfully opened, complementing the
existing files_processed (done with file) and files_scanned (fully
consumed) metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pairs better with files_opened and avoids confusion with
files_scanned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adriangb
Copy link
Contributor Author

adriangb commented Mar 2, 2026

Thanks for the review @getChan. The goal of this metric was precisely to record when we are done with a file, so really it's the implementation that was wrong. I've fixed it now. I also added a files_opened metric to complete the picture.

@adriangb
Copy link
Contributor Author

adriangb commented Mar 2, 2026

@comphead @gabotechs any chance one of you could take a look at this?

batch
} else {
let batch = batch.slice(0, *remain);
// Count this file, the prefetched next file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                                        // Count this file, the prefetched next file
                                        // (if any), and all remaining files we will
                                        // never open.

I read it twice and prob need some help to understand what files are being counted here? if they never opened how come we need to count them as closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is if we hit a LIMIT. Maybe files_processed is a better term? I want this to be useful to track the progress of a query, IMO the best metric we have now is "how many files have we completed work on (for any reason) out of all of the files we can possibly look at?".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, processed would be more intuitive I'd say

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

/// assigned to this partition.
pub files_closed: Count,
/// Count of files completely scanned (reader stream fully consumed).
pub files_scanned: Count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the file can be opened, but not scanned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, e.g. if FilePruner discards it based on stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be that more explanatory to see how many files were skipped instead of closed? Intuitively i would suggest number of open files == number of closed files, unless we debug a connection leak. And skipped by stats would make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two invariants are:

  • number of open files >= number of closed files
  • number of open files == number of closed files at the beginning and end of execution

Regarding files_scanned vs. files_skipped: we already have metrics for file_groups_prunned_statistics. I don't really need the files_scanned metric so let me remove it.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb

@adriangb adriangb added this pull request to the merge queue Mar 4, 2026
Merged via the queue into apache:main with commit 028e351 Mar 4, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants