-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621
Conversation
Test build #87485 has finished for PR 20621 at commit
|
Test build #87488 has finished for PR 20621 at commit
|
val dateTry = Try { | ||
// try and parse the date, if no exception occurs this is a candidate to be resolved as | ||
// DateType | ||
DateTimeUtils.getThreadLocalDateFormat.parse(raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the root cause is more specific to SimpleDateFormat
because it allows invalid dates like 2018-01-01-04
to be parsed fine ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually all the DateFormat
's parse
allow extra-characters after a valid date: (https://docs.oracle.com/javase/7/docs/api/java/text/DateFormat.html#parse(java.lang.String)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a short-cut? It seems OK to always go to the cast path,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is enough to go always with the cast path, since it allows many format/strings, not allowed by the parse method. Thus I think it not safe to avoid the parse method.
// We need to check that we can cast the raw string since we later can use Cast to get | ||
// the partition values with the right DataType (see | ||
// org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning) | ||
val dateOption = Option(Cast(Literal(raw), DateType).eval()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add require(dateOption.isDefine)
with some comments explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, aren't these comments enough? may you please provide some suggestions about how you would like to improve them, ie. what is it missing/not clear? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean .. simply like:
// Disallow date type if the cast returned null blah blah
require(dateOption.isDefine)
nothing special. I am fine with not adding it too.
Test build #87508 has finished for PR 20621 at commit
|
This is a blocker-level regression? When did we introduce this? |
It sounds like Spark 2.2 already has this bug, but Spark 2.1 is still fine. This causes an incorrect result. |
@gatorsmile thanks for checking. Yes, Spark 2.2 is affected too, so I am not sure whether this should be considered a blocker regression. But, I think we should fix it as soon as possible, nonetheless. |
|
||
data.write.partitionBy("date_month", "date_hour").parquet(path.getAbsolutePath) | ||
val input = spark.read.parquet(path.getAbsolutePath) | ||
checkAnswer(input.select("id", "date_month", "date_hour", "data"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also check the schema to make sure that field is string type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not necessary, but I will add this check, thanks.
val unescapedRaw = unescapePathName(raw) | ||
// try and parse the date, if no exception occurs this is a candidate to be resolved as | ||
// TimestampType | ||
DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we save the parsing here? I think to cast string to TimestampType
will also check if it can be parsed as timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because in the cast we tolerate various timestamp format which parse doesn't support (please check the comment to DateTimeUtils.stringToTimestamp
). So I'd not consider safe to remove this and anyway it would/may introduce unintended behavior changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this changes the behavior of PartitioningUtils.parsePartitions
, doesn't it change the result of another path in inferPartitioning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I am not sure I got 100% your question, may you elaborate it a bit more please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inferPartitioning
will use PartitioningUtils.parsePartitions
to infer the partition spec if there is no userPartitionSchema
. It is used by DataSource.sourceSchema
. Seems this change makes the partition directory previously parsing-able now unable to parse. Will it change behavior of other code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. the only change introduced is that some values which were previously wrongly inferred as dates, now will be inferred as strings. Everything else works as before.
Test build #87526 has finished for PR 20621 at commit
|
thanks, merging to master! |
thanks @cloud-fan. Sorry, since this seems a bug to me, why this wasn't backported to branch-2.3 too? Thanks. |
it's not a very serious bug, I'd like to hold it until 2.3 is released. We may have it in 2.3.1 |
…o Date ## What changes were proposed in this pull request? Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`). When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value. The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20621 from mgaido91/SPARK-23436.
…an be casted to Date This PR is to backport #20621 to branch 2.3� --- ## What changes were proposed in this pull request? Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`). When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value. The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20764 from gatorsmile/backport23436.
What changes were proposed in this pull request?
Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like
2018-02-15AAA
).When this happens and the input format has metadata which define the schema of the table, then
null
is returned as a value for the partition column, because thecast
operator used in (PartitioningAwareFileIndex.inferPartitioning
) is unable to convert the value.The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.
How was this patch tested?
added UT