Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Spark In filter to iceberg IN Expression #749

Merged
merged 3 commits into from
Feb 3, 2020

Conversation

jun-he
Copy link
Collaborator

@jun-he jun-he commented Jan 26, 2020

Issue #748

@jun-he jun-he requested a review from rdblue January 26, 2020 07:44
@jun-he
Copy link
Collaborator Author

jun-he commented Jan 26, 2020

@rdblue can you help to review it? Thanks!

@@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, PartitionSpec spec, String udf,

private List<Record> testRecords(org.apache.avro.Schema avroSchema) {
return Lists.newArrayList(
record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), "junction"),
record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), "junction"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it necessary to change these values? This doesn't change the hour partitions the values are stored in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue It is because java.sql.Timestamp constructor uses a milliseconds time value.
There is a deprecated java.sql.Timestamp constructor to use year, month, date, hour, minute, second, and nano. But we also need take care of timezone issue (java timestamp is always UTC).

So to avoid using deprecated method and make the test straightforward, I just update two records to be millisecond scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't use the constructor that only supports milliseconds, so these should be precise to microseconds. But the test only uses Timestamp to create a filter that gets converted to a partition filter, so the timestamps used to create the Spark filter and these timestamps shouldn't need to match. Doesn't the test pass if these are unchanged?

Copy link
Collaborator Author

@jun-he jun-he Feb 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fails because those partitions picked in the tests have only one value (equals the lower and higher bound) so the Timestamp must exactly match.

To avoid changing those values, I will update the test to use the partition of 2017-12-21T15, which contains two records. So any Timestamp between them will match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jun-he! I think that's a better solution to the problem.

@rdblue rdblue merged commit 905637d into apache:master Feb 3, 2020
@jun-he jun-he deleted the jun/in-SparkFilter branch February 4, 2020 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants