Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify / Document metrics contract #767

Closed
edgarRd opened this issue Feb 3, 2020 · 3 comments
Closed

Clarify / Document metrics contract #767

edgarRd opened this issue Feb 3, 2020 · 3 comments

Comments

@edgarRd
Copy link
Contributor

edgarRd commented Feb 3, 2020

The metrics contract is a bit unclear, from the implementation. Since it's not defined in the spec, having the only fully implemented metrics for Parquet, and while I'm working on ORC metrics it's not very clear what is the contract expected since file formats seem to implement this differently, for instance:

  • Map<Integer, Long> valueCounts() - it's not clear whether this method includes non-null or repeated values. As per the TestMetrics it looks like value counts includes null and repeated values which would be pretty much the same as row count, except for nested structures (e.g. lists, maps) - however this is not defined.

This issue is to track the discussion about the expected metrics contract and get a clear definition.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Feb 4, 2020

I think the spec partially mentions this (i.e. while describing data_file struct) but I agree we should reiterate.

It seems we used to have distinct_counts but it is deprecated now. @rdblue, do you remember why?

@aokolnychyi
Copy link
Contributor

I actually missed #768.

@rdblue
Copy link
Contributor

rdblue commented Feb 5, 2020

Yes, value counts includes null values. I think Anton wrote good tests to validate the behavior here.

We deprecated distinct_counts because it wasn't useful. It doesn't do much good to know how many distinct values are in a single file unless you can combine that across files. To do that, we need more information than just the distinct count. We need a data sketch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants