Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix collection of bounds for small decimals in ParquetMetrics #131

Merged
merged 2 commits into from
Mar 18, 2019

Conversation

aokolnychyi
Copy link
Contributor

This PR resolves #125.

ParquetMetrics uses ParquetConversions$fromParquetPrimitive, which assumes that decimals are always represented as binary in Parquet. The last statement is not true according to the Parquet spec.

As a consequence, Iceberg might collect invalid lower/upper bounds that can lead to skipping wrong files. See the issue description for an example.

import static com.netflix.iceberg.Files.localInput;
import static com.netflix.iceberg.Files.localOutput;
import static com.netflix.iceberg.types.Conversions.fromByteBuffer;
import static com.netflix.iceberg.types.Types.*;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually prefer to avoid wildcard imports but we really use every data type within the tests. Therefore, it seems reasonable here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this in tests, but I don't generally mind large import blocks because they are maintained by the IDE and are good for context. I'd prefer expanding this but it's up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. I'll expand

@@ -86,9 +86,9 @@ public static Metrics fromMetadata(ParquetMetadata metadata) {
Types.NestedField field = fileSchema.asStruct().field(fieldId);
if (field != null && stats.hasNonNullValue()) {
updateMin(lowerBounds, fieldId,
fromParquetPrimitive(field.type(), stats.genericGetMin()));
fromParquetPrimitive(field.type(), column.getPrimitiveType(), stats.genericGetMin()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: continuation lines should be indented 4 spaces from the start of the statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would no longer fit into 100 characters per line. Do we even have a requirement to fit into 100 chars per line? I've seen a couple of places where this is not respected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a mistake then. Thanks for fixing this.

@rdblue
Copy link
Contributor

rdblue commented Mar 15, 2019

@aokolnychyi, can you comment on how this fixes the problem? Why doesn't the previous call, for example Literals.of((Long) value).to(DecimalType.of(9, 2), work?

@aokolnychyi
Copy link
Contributor Author

@rdblue Let's assume we have 3.50 as Decimal(10, 2), which is represented as INT64 in Parquet.

file schema: table 
--------------------------------------------------------------------------------
decimalCol:  OPTIONAL INT64 O:DECIMAL R:0 D:1

row group 1: RC:1 TS:75 OFFSET:4 
--------------------------------------------------------------------------------
decimalCol:   INT64 GZIP DO:0 FPO:4 SZ:91/75/0.82 VC:1 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 3.50, max: 3.50, num_nulls: 0]

Once we read the footer and call stats.genericGetMin() for this decimal column, we will get 350. This value must be properly scaled (handled by converterFromParquet) before creating a literal.

Right now, Literals.of((Long) 350).to(DecimalType.of(10, 2) will give us 350.00 instead of 3.50.

checkFieldMetrics(1, schema, metrics, 2, 2, null, null);
}

private <T> void checkFieldMetrics(int fieldId, Schema schema, Metrics metrics,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests would be more readable if this were broken into two: assertCounts and assertBounds. That way it is clear when calling them that the metrics match the value and null counts, without looking at the implementation. Similarly, it would be clear that the bounds are equal to the given values.

That also allows you to avoid passing in so many arguments. The schema becomes unnecessary because you pass in the field or field type and field ID. As it is now, this gets the field by ID and then accesses its field ID, which is awkward.

import static com.netflix.iceberg.types.Types.NestedField.optional;
import static com.netflix.iceberg.types.Types.NestedField.required;

public class TestParquetMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test case that is missing is what happens when there are multiple row groups in the file that are merged. We don't have to fix that in this PR, but it would be nice to have at test for it eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #132 so that we don't forget

firstRecord.put("longCol", 5L);
firstRecord.put("floatCol", 2.0F);
firstRecord.put("doubleCol", 2.0D);
firstRecord.put("decimalCol", new BigDecimal("3.50"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should test all Parquet decimal storage types: int, long, fixed, and binary. Can you add tests for values other than long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a separate test for decimals as int/long/fixed. Any ideas on how to generate files where decimals represented as binary? TestMetricsRowGroupFilterTypes also verifies only int/long/fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage is determined by the decimal precision. Here's how we do it in other tests: https://github.com/apache/incubator-iceberg/blob/master/spark/src/test/java/com/netflix/iceberg/spark/data/AvroDataTest.java#L55-L57

@rdblue rdblue merged commit 7169e21 into apache:master Mar 18, 2019
@rdblue
Copy link
Contributor

rdblue commented Mar 18, 2019

Looks good to me. Thanks @aokolnychyi!

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Apr 10, 2019
rdblue pushed a commit to rdblue/iceberg that referenced this pull request May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ParquetMetrics computes wrong lower/upper bounds for small decimals
2 participants