You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The metrics contract is a bit unclear, from the implementation. Since it's not defined in the spec, having the only fully implemented metrics for Parquet, and while I'm working on ORC metrics it's not very clear what is the contract expected since file formats seem to implement this differently, for instance:
Map<Integer, Long> valueCounts() - it's not clear whether this method includes non-null or repeated values. As per the TestMetrics it looks like value counts includes null and repeated values which would be pretty much the same as row count, except for nested structures (e.g. lists, maps) - however this is not defined.
This issue is to track the discussion about the expected metrics contract and get a clear definition.
The text was updated successfully, but these errors were encountered:
Yes, value counts includes null values. I think Anton wrote good tests to validate the behavior here.
We deprecated distinct_counts because it wasn't useful. It doesn't do much good to know how many distinct values are in a single file unless you can combine that across files. To do that, we need more information than just the distinct count. We need a data sketch.
The metrics contract is a bit unclear, from the implementation. Since it's not defined in the spec, having the only fully implemented metrics for Parquet, and while I'm working on ORC metrics it's not very clear what is the contract expected since file formats seem to implement this differently, for instance:
Map<Integer, Long> valueCounts()
- it's not clear whether this method includes non-null or repeated values. As per theTestMetrics
it looks like value counts includes null and repeated values which would be pretty much the same as row count, except for nested structures (e.g. lists, maps) - however this is not defined.This issue is to track the discussion about the expected metrics contract and get a clear definition.
The text was updated successfully, but these errors were encountered: