Skip to content

null_count is omitted for large columns in parquet files #3574

@mdibaiee

Description

@mdibaiee

Describe the bug, including details regarding any error messages, version, and platform.

Currently in ParquetMetadataConverter.java, there is a guard that prevents the writing of statistics such as min/max AND null_count when the stats are larger than the max allowed size under truncation. The rationale for this makes sense for omitting min/max, however null_count can be written on the file despite the size of its content. See the code below:

public static Statistics toParquetStatistics(
org.apache.parquet.column.statistics.Statistics stats, int truncateLength) {
Statistics formatStats = new Statistics();
// Don't write stats larger than the max size rather than truncating. The
// rationale is that some engines may use the minimum value in the page as
// the true minimum for aggregations and there is no way to mark that a
// value has been truncated and is a lower bound and not in the page.
if (!stats.isEmpty() && withinLimit(stats, truncateLength)) {

The missing null_count metadata sometimes causes downstream consumers of the parquet files to error. For example in Snowflake we are seeing the following kind of error:

non-nullable column without default has null values according to file statistics

Component(s)

Core

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions