Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose top level stats in scans #227

Merged
merged 8 commits into from
Jun 25, 2024
Merged

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented May 27, 2024

  1. Adds a "stats" column to the EngineData we produce during scans
  2. Extracts, parses, and calls back with stats when using visit_scan_files
  3. Exposes this via ffi

Only supports numRecords for the moment, but I've kept this as a struct so we can add future stats without breaking apis

@nicklan nicklan force-pushed the expose-stats branch 2 times, most recently from 20355e6 to f4445b6 Compare May 30, 2024 21:31
pub struct Stats {
pub num_records: u64,
#[serde(default = "default_true")]
pub tight_bounds: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it help an engine to know about tight bounds? I guess it tells whether the logical rowcount could be less than the reported physical rowcount, but I wouldn't ever expect a deletion vector to suppress more than a small fraction of rows?

Also: Where do we draw the line on what stats to expose? I guess we stop here because the min/max/nullcount fields are (a) difficult to expose because schema-specific; and (b) arguably it's kernel's job to deal with them, not engine's?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I had just started looking at the stats stuff when I wrote this PR, and thought that tight_bounds had implications for the value of this stat, but it doesn't. Rather, the presence of a deletion vector matters.

So I've removed that field, and added a has_vector() -> bool method on DvInfo, and noted what the stat means in a doc comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call it has_dv or has_deletion_vector to be self documenting?

kernel/src/scan/state.rs Outdated Show resolved Hide resolved
@nicklan nicklan requested a review from scovich June 19, 2024 22:00
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a "good enough" start -- we can always iterate later as use cases proliferate.

const DvInfo* dv_info,
const CStringMap* partition_values) {
printf("file: %.*s\n", (int)path.len, path.ptr);
printf("file: %.*s (size: %" PRId64 ", num_records:", (int)path.len, path.ptr, size);
if (stats) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: should this code also print dv info?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering cffi-test.c "deprecated". As soon as #203 merges, I'll remove cffi-test.c and switch our CI over to run the more complete example there, which does look at the dv info

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nicklan nicklan merged commit 4b25f40 into delta-incubator:main Jun 25, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants