null_count is omitted for large columns in parquet files

### Describe the bug, including details regarding any error messages, version, and platform.

Currently in [ParquetMetadataConverter.java](https://github.com/apache/parquet-java/blob/7be05b4702df78ae0c0c6b44adc6b7b7af2d931f/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java), there is a guard that prevents the writing of statistics such as min/max AND null_count when the stats are larger than the max allowed size under truncation. The rationale for this makes sense for omitting min/max, however null_count can be written on the file despite the size of its content. See the code below:

https://github.com/apache/parquet-java/blob/7be05b4702df78ae0c0c6b44adc6b7b7af2d931f/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L800-L807

The missing `null_count` metadata sometimes causes downstream consumers of the parquet files to error. For example in Snowflake we are seeing the following kind of error:

```
non-nullable column without default has null values according to file statistics
```

### Component(s)

Core

	public static Statistics toParquetStatistics(
	org.apache.parquet.column.statistics.Statistics stats, int truncateLength) {
	Statistics formatStats = new Statistics();
	// Don't write stats larger than the max size rather than truncating. The
	// rationale is that some engines may use the minimum value in the page as
	// the true minimum for aggregations and there is no way to mark that a
	// value has been truncated and is a lower bound and not in the page.
	if (!stats.isEmpty() && withinLimit(stats, truncateLength)) {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

null_count is omitted for large columns in parquet files #3574

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

null_count is omitted for large columns in parquet files #3574

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions