Collect lower/upper bounds for nested struct fields in ParquetMetrics #136

aokolnychyi · 2019-03-19T10:29:05Z

This PR enables collection of lower/upper bounds for nested struct fields in ParquetMetrics.

The test is pretty simple as TestParquetMetrics already has a test for map/list elements as well as a test for all supported data types.

This resolves #78.

rdblue · 2019-03-19T18:05:56Z

parquet/src/main/java/com/netflix/iceberg/parquet/ParquetMetrics.java

+    Type currentType = schema.asStruct();
+
+    while (pathIterator.hasNext()) {
+      if (currentType == null || !currentType.isStructType()) return false;


Style: control flow should always use { and }.

rdblue · 2019-03-19T18:09:03Z

Looks great to me other than one style problem. @prodeezy has also been working in this area, so I'd like to hear what he thinks, too.

aokolnychyi · 2019-03-19T20:09:11Z

@prodeezy it would be great to do an end-to-end test with your work to see that everything works as expected.

prodeezy · 2019-03-20T05:39:59Z

Thanks for this PR @aokolnychyi , ran an end to end test with this patch applied on latest code in master ..

Used a local spark instance that contains feature to pushdown struct filters and verified that filter is pushed down to data source

scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat = 101.123 ").explain()
== Physical Plan ==
*(1) Project [age#0, name#1, friends#2, location#3]
+- *(1) Filter (isnotnull(location#3) && (location#3.lat = 101.123))
   +- *(1) ScanV2 iceberg[age#0, name#1, friends#2, location#3] (Filters: [isnotnull(location#3), (location#3.lat = 101.123)], Options: [path=iceberg-people-struct-metrics,paths=[]])

Created Parquet data using these metrics. Verified that the struct's leaf field metrics are stored now ..

avro-tools tojson iceberg-people-struct-metrics/metadata/e4f66767-8baa-4dee-8b3d-56a0c1d99464-m0.avro  | jq



    "lower_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u0013\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "Andy"
        },
        {
          "key": 7,
          "value": "\u001dZd;ßGY@"
        },
        {
          "key": 8,
          "value": " \u001a/Ý$�4@"
        }
      ]
    },
    "upper_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u001e\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "Michael"
        },
        {
          "key": 7,
          "value": "\u0012�ÀÊ¡ýe@"
        },
        {
          "key": 8,
          "value": "¶óýÔx)I@"
        }

Applied my struct filtering patch and ran filters on above table ..



scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat = 101.123 ").show()
+---+----+--------------------+-----------------+
|age|name|             friends|         location|
+---+----+--------------------+-----------------+
| 30|Andy|[Josh -> 10, Bisw...|[101.123, 50.324]|
+---+----+--------------------+-----------------+

scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat < 101.123 ").show()
+---+----+-------+--------+
|age|name|friends|location|
+---+----+-------+--------+
+---+----+-------+--------+

scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat > 200 ").show()
+---+----+-------+--------+
|age|name|friends|location|
+---+----+-------+--------+
+---+----+-------+--------+

Verified that struct filters don't fail and the expression evaluation in InclusiveMetricsEvaluator uses lower/upper bounds (using debug breakpoints)
Verified that out of bounds cases for eq, lt, gt, etc skip files / row groups appropriately

prodeezy · 2019-03-20T05:40:57Z

Functionally this patch works end to end along with the struct filter fix. nice work!

prodeezy

minor nit about comment to allow struct nesting. LGTM otherwise.

prodeezy · 2019-03-20T04:20:56Z

parquet/src/main/java/com/netflix/iceberg/parquet/ParquetMetrics.java

@@ -105,6 +107,22 @@ public static Metrics fromMetadata(ParquetMetadata metadata) {
        toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
  }

+  // we allow struct nesting, but not maps or arrays


The comment can be a bit more descriptive of the fact that this check also precludes structs containing maps or array and vice versa.

I think the wording here is okay. I'd rather merge now than wait for an update here. I'd be happy to merge a clarification PR though.

rdblue · 2019-03-20T20:49:32Z

Merged. Thanks @aokolnychyi for fixing this and @prodeezy for the review!

dbtsai · 2019-04-05T05:40:16Z

FYI, @prodeezy I restart the work of apache/spark#22573 and I will try to have it merged by Spark 3.0

…apache#136)

Collect lower/upper bounds for nested struct fields in ParquetMetrics

cbea8d2

rdblue reviewed Mar 19, 2019

View reviewed changes

rdblue mentioned this pull request Mar 19, 2019

Use Iceberg writers for Parquet data written from Spark. #63

Merged

Fix styling

8841608

prodeezy approved these changes Mar 20, 2019

View reviewed changes

rdblue merged commit c383dd8 into apache:master Mar 20, 2019

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Apr 19, 2019

Collect lower/upper bounds for nested struct fields in ParquetMetrics (…

800ecfe

…apache#136)

rdblue pushed a commit to rdblue/iceberg that referenced this pull request May 14, 2019

Collect lower/upper bounds for nested struct fields in ParquetMetrics (…

4fa8a21

…apache#136)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect lower/upper bounds for nested struct fields in ParquetMetrics #136

Collect lower/upper bounds for nested struct fields in ParquetMetrics #136

aokolnychyi commented Mar 19, 2019

rdblue Mar 19, 2019

rdblue commented Mar 19, 2019

aokolnychyi commented Mar 19, 2019

prodeezy commented Mar 20, 2019 •

edited

prodeezy commented Mar 20, 2019 •

edited

prodeezy left a comment •

edited

prodeezy Mar 20, 2019

rdblue Mar 20, 2019

rdblue commented Mar 20, 2019

dbtsai commented Apr 5, 2019 •

edited

Collect lower/upper bounds for nested struct fields in ParquetMetrics #136

Collect lower/upper bounds for nested struct fields in ParquetMetrics #136

Conversation

aokolnychyi commented Mar 19, 2019

rdblue Mar 19, 2019

Choose a reason for hiding this comment

rdblue commented Mar 19, 2019

aokolnychyi commented Mar 19, 2019

prodeezy commented Mar 20, 2019 • edited

prodeezy commented Mar 20, 2019 • edited

prodeezy left a comment • edited

Choose a reason for hiding this comment

prodeezy Mar 20, 2019

Choose a reason for hiding this comment

rdblue Mar 20, 2019

Choose a reason for hiding this comment

rdblue commented Mar 20, 2019

dbtsai commented Apr 5, 2019 • edited

prodeezy commented Mar 20, 2019 •

edited

prodeezy commented Mar 20, 2019 •

edited

prodeezy left a comment •

edited

dbtsai commented Apr 5, 2019 •

edited