Fix collection of bounds for small decimals in ParquetMetrics #131

aokolnychyi · 2019-03-15T14:57:22Z

This PR resolves #125.

ParquetMetrics uses ParquetConversions$fromParquetPrimitive, which assumes that decimals are always represented as binary in Parquet. The last statement is not true according to the Parquet spec.

As a consequence, Iceberg might collect invalid lower/upper bounds that can lead to skipping wrong files. See the issue description for an example.

aokolnychyi · 2019-03-15T14:59:28Z

parquet/src/test/java/com/netflix/iceberg/parquet/TestParquetMetrics.java

+import static com.netflix.iceberg.Files.localInput;
+import static com.netflix.iceberg.Files.localOutput;
+import static com.netflix.iceberg.types.Conversions.fromByteBuffer;
+import static com.netflix.iceberg.types.Types.*;


I usually prefer to avoid wildcard imports but we really use every data type within the tests. Therefore, it seems reasonable here.

I'm okay with this in tests, but I don't generally mind large import blocks because they are maintained by the IDE and are good for context. I'd prefer expanding this but it's up to you.

I don't have a strong opinion here. I'll expand

rdblue · 2019-03-15T18:01:52Z

parquet/src/main/java/com/netflix/iceberg/parquet/ParquetMetrics.java

@@ -86,9 +86,9 @@ public static Metrics fromMetadata(ParquetMetadata metadata) {
          Types.NestedField field = fileSchema.asStruct().field(fieldId);
          if (field != null && stats.hasNonNullValue()) {
            updateMin(lowerBounds, fieldId,
-                fromParquetPrimitive(field.type(), stats.genericGetMin()));
+              fromParquetPrimitive(field.type(), column.getPrimitiveType(), stats.genericGetMin()));


Nit: continuation lines should be indented 4 spaces from the start of the statement.

Would no longer fit into 100 characters per line. Do we even have a requirement to fit into 100 chars per line? I've seen a couple of places where this is not respected.

That's probably a mistake then. Thanks for fixing this.

rdblue · 2019-03-15T18:05:28Z

@aokolnychyi, can you comment on how this fixes the problem? Why doesn't the previous call, for example Literals.of((Long) value).to(DecimalType.of(9, 2), work?

aokolnychyi · 2019-03-15T18:20:21Z

@rdblue Let's assume we have 3.50 as Decimal(10, 2), which is represented as INT64 in Parquet.

file schema: table 
--------------------------------------------------------------------------------
decimalCol:  OPTIONAL INT64 O:DECIMAL R:0 D:1

row group 1: RC:1 TS:75 OFFSET:4 
--------------------------------------------------------------------------------
decimalCol:   INT64 GZIP DO:0 FPO:4 SZ:91/75/0.82 VC:1 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 3.50, max: 3.50, num_nulls: 0]

Once we read the footer and call stats.genericGetMin() for this decimal column, we will get 350. This value must be properly scaled (handled by converterFromParquet) before creating a literal.

Right now, Literals.of((Long) 350).to(DecimalType.of(10, 2) will give us 350.00 instead of 3.50.

rdblue · 2019-03-15T21:03:34Z

parquet/src/test/java/com/netflix/iceberg/parquet/TestParquetMetrics.java

+    checkFieldMetrics(1, schema, metrics, 2, 2, null, null);
+  }
+
+  private <T> void checkFieldMetrics(int fieldId, Schema schema, Metrics metrics,


I think these tests would be more readable if this were broken into two: assertCounts and assertBounds. That way it is clear when calling them that the metrics match the value and null counts, without looking at the implementation. Similarly, it would be clear that the bounds are equal to the given values.

That also allows you to avoid passing in so many arguments. The schema becomes unnecessary because you pass in the field or field type and field ID. As it is now, this gets the field by ID and then accesses its field ID, which is awkward.

rdblue · 2019-03-15T21:04:28Z

parquet/src/test/java/com/netflix/iceberg/parquet/TestParquetMetrics.java

+import static com.netflix.iceberg.types.Types.NestedField.optional;
+import static com.netflix.iceberg.types.Types.NestedField.required;
+
+public class TestParquetMetrics {


A test case that is missing is what happens when there are multiple row groups in the file that are merged. We don't have to fix that in this PR, but it would be nice to have at test for it eventually.

I've created #132 so that we don't forget

rdblue · 2019-03-15T21:07:29Z

parquet/src/test/java/com/netflix/iceberg/parquet/TestParquetMetrics.java

+    firstRecord.put("longCol", 5L);
+    firstRecord.put("floatCol", 2.0F);
+    firstRecord.put("doubleCol", 2.0D);
+    firstRecord.put("decimalCol", new BigDecimal("3.50"));


I think this should test all Parquet decimal storage types: int, long, fixed, and binary. Can you add tests for values other than long?

I've added a separate test for decimals as int/long/fixed. Any ideas on how to generate files where decimals represented as binary? TestMetricsRowGroupFilterTypes also verifies only int/long/fixed.

The storage is determined by the decimal precision. Here's how we do it in other tests: https://github.com/apache/incubator-iceberg/blob/master/spark/src/test/java/com/netflix/iceberg/spark/data/AvroDataTest.java#L55-L57

rdblue · 2019-03-18T22:26:47Z

Looks good to me. Thanks @aokolnychyi!

…#131)

Fix collection of bounds for small decimals in ParquetMetrics

4877ed6

aokolnychyi commented Mar 15, 2019

View reviewed changes

aokolnychyi mentioned this pull request Mar 15, 2019

Collect lower and upper bounds for nested struct fields in ParquetMetrics #78

Closed

rdblue reviewed Mar 15, 2019

View reviewed changes

Refactor tests & add more tests for decimals

f54645a

rdblue merged commit 7169e21 into apache:master Mar 18, 2019

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Apr 10, 2019

Fix collection of bounds for small decimals in ParquetMetrics (apache…

d15ed71

…#131)

rdblue pushed a commit to rdblue/iceberg that referenced this pull request May 14, 2019

Fix collection of bounds for small decimals in ParquetMetrics (apache…

c01106c

…#131)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix collection of bounds for small decimals in ParquetMetrics #131

Fix collection of bounds for small decimals in ParquetMetrics #131

aokolnychyi commented Mar 15, 2019

aokolnychyi Mar 15, 2019

rdblue Mar 15, 2019

aokolnychyi Mar 15, 2019

rdblue Mar 15, 2019

aokolnychyi Mar 15, 2019 •

edited

rdblue Mar 18, 2019

rdblue commented Mar 15, 2019

aokolnychyi commented Mar 15, 2019

rdblue Mar 15, 2019

rdblue Mar 15, 2019

aokolnychyi Mar 16, 2019

rdblue Mar 15, 2019

aokolnychyi Mar 16, 2019

rdblue Mar 18, 2019

rdblue commented Mar 18, 2019

Fix collection of bounds for small decimals in ParquetMetrics #131

Fix collection of bounds for small decimals in ParquetMetrics #131

Conversation

aokolnychyi commented Mar 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Mar 15, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Mar 15, 2019

aokolnychyi commented Mar 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Mar 18, 2019

aokolnychyi Mar 15, 2019 •

edited