Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect lower/upper bounds for nested struct fields in ParquetMetrics #136

Merged
merged 2 commits into from
Mar 20, 2019

Conversation

aokolnychyi
Copy link
Contributor

This PR enables collection of lower/upper bounds for nested struct fields in ParquetMetrics.

The test is pretty simple as TestParquetMetrics already has a test for map/list elements as well as a test for all supported data types.

This resolves #78.

Type currentType = schema.asStruct();

while (pathIterator.hasNext()) {
if (currentType == null || !currentType.isStructType()) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: control flow should always use { and }.

@rdblue
Copy link
Contributor

rdblue commented Mar 19, 2019

Looks great to me other than one style problem. @prodeezy has also been working in this area, so I'd like to hear what he thinks, too.

@aokolnychyi
Copy link
Contributor Author

@prodeezy it would be great to do an end-to-end test with your work to see that everything works as expected.

@prodeezy
Copy link
Contributor

prodeezy commented Mar 20, 2019

Thanks for this PR @aokolnychyi , ran an end to end test with this patch applied on latest code in master ..

  1. Used a local spark instance that contains feature to pushdown struct filters and verified that filter is pushed down to data source
scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat = 101.123 ").explain()
== Physical Plan ==
*(1) Project [age#0, name#1, friends#2, location#3]
+- *(1) Filter (isnotnull(location#3) && (location#3.lat = 101.123))
   +- *(1) ScanV2 iceberg[age#0, name#1, friends#2, location#3] (Filters: [isnotnull(location#3), (location#3.lat = 101.123)], Options: [path=iceberg-people-struct-metrics,paths=[]])

  1. Created Parquet data using these metrics. Verified that the struct's leaf field metrics are stored now ..
avro-tools tojson iceberg-people-struct-metrics/metadata/e4f66767-8baa-4dee-8b3d-56a0c1d99464-m0.avro  | jq



    "lower_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u0013\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "Andy"
        },
        {
          "key": 7,
          "value": "\u001dZd;ßGY@"
        },
        {
          "key": 8,
          "value": " \u001a/Ý$�4@"
        }
      ]
    },
    "upper_bounds": {
      "array": [
        {
          "key": 1,
          "value": "\u001e\u0000\u0000\u0000"
        },
        {
          "key": 2,
          "value": "Michael"
        },
        {
          "key": 7,
          "value": "\u0012�ÀÊ¡ýe@"
        },
        {
          "key": 8,
          "value": "¶óýÔx)I@"
        }

  1. Applied my struct filtering patch and ran filters on above table ..


scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat = 101.123 ").show()
+---+----+--------------------+-----------------+
|age|name|             friends|         location|
+---+----+--------------------+-----------------+
| 30|Andy|[Josh -> 10, Bisw...|[101.123, 50.324]|
+---+----+--------------------+-----------------+

scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat < 101.123 ").show()
+---+----+-------+--------+
|age|name|friends|location|
+---+----+-------+--------+
+---+----+-------+--------+

scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat > 200 ").show()
+---+----+-------+--------+
|age|name|friends|location|
+---+----+-------+--------+
+---+----+-------+--------+
  • Verified that struct filters don't fail and the expression evaluation in InclusiveMetricsEvaluator uses lower/upper bounds (using debug breakpoints)
  • Verified that out of bounds cases for eq, lt, gt, etc skip files / row groups appropriately

@prodeezy
Copy link
Contributor

prodeezy commented Mar 20, 2019

Functionally this patch works end to end along with the struct filter fix. nice work!

Copy link
Contributor

@prodeezy prodeezy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit about comment to allow struct nesting. LGTM otherwise.

@@ -105,6 +107,22 @@ public static Metrics fromMetadata(ParquetMetadata metadata) {
toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
}

// we allow struct nesting, but not maps or arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment can be a bit more descriptive of the fact that this check also precludes structs containing maps or array and vice versa.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording here is okay. I'd rather merge now than wait for an update here. I'd be happy to merge a clarification PR though.

@rdblue rdblue merged commit c383dd8 into apache:master Mar 20, 2019
@rdblue
Copy link
Contributor

rdblue commented Mar 20, 2019

Merged. Thanks @aokolnychyi for fixing this and @prodeezy for the review!

@dbtsai
Copy link
Member

dbtsai commented Apr 5, 2019

FYI, @prodeezy I restart the work of apache/spark#22573 and I will try to have it merged by Spark 3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collect lower and upper bounds for nested struct fields in ParquetMetrics
4 participants