Describe the bug, including details regarding any error messages, version, and platform.
Currently in ParquetMetadataConverter.java, there is a guard that prevents the writing of statistics such as min/max AND null_count when the stats are larger than the max allowed size under truncation. The rationale for this makes sense for omitting min/max, however null_count can be written on the file despite the size of its content. See the code below:
|
public static Statistics toParquetStatistics( |
|
org.apache.parquet.column.statistics.Statistics stats, int truncateLength) { |
|
Statistics formatStats = new Statistics(); |
|
// Don't write stats larger than the max size rather than truncating. The |
|
// rationale is that some engines may use the minimum value in the page as |
|
// the true minimum for aggregations and there is no way to mark that a |
|
// value has been truncated and is a lower bound and not in the page. |
|
if (!stats.isEmpty() && withinLimit(stats, truncateLength)) { |
The missing null_count metadata sometimes causes downstream consumers of the parquet files to error. For example in Snowflake we are seeing the following kind of error:
non-nullable column without default has null values according to file statistics
Component(s)
Core
Describe the bug, including details regarding any error messages, version, and platform.
Currently in ParquetMetadataConverter.java, there is a guard that prevents the writing of statistics such as min/max AND null_count when the stats are larger than the max allowed size under truncation. The rationale for this makes sense for omitting min/max, however null_count can be written on the file despite the size of its content. See the code below:
parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Lines 800 to 807 in 7be05b4
The missing
null_countmetadata sometimes causes downstream consumers of the parquet files to error. For example in Snowflake we are seeing the following kind of error:Component(s)
Core