Convert Spark In filter to iceberg IN Expression #749

jun-he · 2020-01-26T07:27:47Z

Issue #748

jun-he · 2020-01-26T07:45:07Z

@rdblue can you help to review it? Thanks!

spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java

spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java

rdblue · 2020-02-01T01:19:02Z

spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java

@@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, PartitionSpec spec, String udf,

  private List<Record> testRecords(org.apache.avro.Schema avroSchema) {
    return Lists.newArrayList(
-        record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), "junction"),
+        record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), "junction"),


Why was it necessary to change these values? This doesn't change the hour partitions the values are stored in.

@rdblue It is because java.sql.Timestamp constructor uses a milliseconds time value.
There is a deprecated java.sql.Timestamp constructor to use year, month, date, hour, minute, second, and nano. But we also need take care of timezone issue (java timestamp is always UTC).

So to avoid using deprecated method and make the test straightforward, I just update two records to be millisecond scale.

These don't use the constructor that only supports milliseconds, so these should be precise to microseconds. But the test only uses Timestamp to create a filter that gets converted to a partition filter, so the timestamps used to create the Spark filter and these timestamps shouldn't need to match. Doesn't the test pass if these are unchanged?

The test fails because those partitions picked in the tests have only one value (equals the lower and higher bound) so the Timestamp must exactly match.

To avoid changing those values, I will update the test to use the partition of 2017-12-21T15, which contains two records. So any Timestamp between them will match.

Thanks, @jun-he! I think that's a better solution to the problem.

Convert Spark In filter to iceberg IN Expression

32abfe2

jun-he requested a review from rdblue January 26, 2020 07:44

jun-he mentioned this pull request Jan 26, 2020

Convert Spark filters to IN and NOT_IN #748

Closed

aokolnychyi reviewed Jan 27, 2020

View reviewed changes

spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Jan 27, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 27, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java Outdated Show resolved Hide resolved

address the comments and add additional unit tests

842e57b

rdblue reviewed Feb 1, 2020

View reviewed changes

update the unit test

8aacecb

rdblue merged commit 905637d into apache:master Feb 3, 2020

jun-he deleted the jun/in-SparkFilter branch February 4, 2020 06:37

ConeyLiu mentioned this pull request Jul 28, 2023

Spark 3.4: Support pushing down system functions by V2 filters #7886

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Spark In filter to iceberg IN Expression #749

Convert Spark In filter to iceberg IN Expression #749

jun-he commented Jan 26, 2020

jun-he commented Jan 26, 2020

rdblue Feb 1, 2020

jun-he Feb 2, 2020

rdblue Feb 2, 2020

jun-he Feb 3, 2020 •

edited

rdblue Feb 3, 2020

Convert Spark In filter to iceberg IN Expression #749

Convert Spark In filter to iceberg IN Expression #749

Conversation

jun-he commented Jan 26, 2020

jun-he commented Jan 26, 2020

rdblue Feb 1, 2020

Choose a reason for hiding this comment

jun-he Feb 2, 2020

Choose a reason for hiding this comment

rdblue Feb 2, 2020

Choose a reason for hiding this comment

jun-he Feb 3, 2020 • edited

Choose a reason for hiding this comment

rdblue Feb 3, 2020

Choose a reason for hiding this comment

jun-he Feb 3, 2020 •

edited