[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621

mgaido91 · 2018-02-15T17:13:34Z

What changes were proposed in this pull request?

Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like 2018-02-15AAA).

When this happens and the input format has metadata which define the schema of the table, then null is returned as a value for the partition column, because the cast operator used in (PartitioningAwareFileIndex.inferPartitioning) is unable to convert the value.

The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.

How was this patch tested?

added UT

…o Date

mgaido91 · 2018-02-15T17:14:31Z

cc @cloud-fan @HyukjinKwon @viirya

SparkQA · 2018-02-15T17:19:30Z

Test build #87485 has finished for PR 20621 at commit 2f05ab8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-15T21:50:15Z

Test build #87488 has finished for PR 20621 at commit 6b56408.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-16T05:55:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

+    val dateTry = Try {
+      // try and parse the date, if no exception occurs this is a candidate to be resolved as
+      // DateType
+      DateTimeUtils.getThreadLocalDateFormat.parse(raw)


Ah, so the root cause is more specific to SimpleDateFormat because it allows invalid dates like 2018-01-01-04 to be parsed fine ..

actually all the DateFormat's parse allow extra-characters after a valid date: (https://docs.oracle.com/javase/7/docs/api/java/text/DateFormat.html#parse(java.lang.String)).

is this a short-cut? It seems OK to always go to the cast path,

I don't think it is enough to go always with the cast path, since it allows many format/strings, not allowed by the parse method. Thus I think it not safe to avoid the parse method.

HyukjinKwon · 2018-02-16T06:00:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

+      // We need to check that we can cast the raw string since we later can use Cast to get
+      // the partition values with the right DataType (see
+      // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
+      val dateOption = Option(Cast(Literal(raw), DateType).eval())


Can we add require(dateOption.isDefine) with some comments explicitly?

sure, aren't these comments enough? may you please provide some suggestions about how you would like to improve them, ie. what is it missing/not clear? Thanks.

I mean .. simply like:

// Disallow date type if the cast returned null blah blah require(dateOption.isDefine)

nothing special. I am fine with not adding it too.

SparkQA · 2018-02-16T12:42:44Z

Test build #87508 has finished for PR 20621 at commit 6274537.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-16T17:25:47Z

This is a blocker-level regression? When did we introduce this?

gatorsmile · 2018-02-16T17:33:45Z

It sounds like Spark 2.2 already has this bug, but Spark 2.1 is still fine. This causes an incorrect result.

mgaido91 · 2018-02-16T17:49:32Z

@gatorsmile thanks for checking. Yes, Spark 2.2 is affected too, so I am not sure whether this should be considered a blocker regression. But, I think we should fix it as soon as possible, nonetheless.

cloud-fan · 2018-02-17T03:22:27Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

+
+      data.write.partitionBy("date_month", "date_hour").parquet(path.getAbsolutePath)
+      val input = spark.read.parquet(path.getAbsolutePath)
+      checkAnswer(input.select("id", "date_month", "date_hour", "data"),


shall we also check the schema to make sure that field is string type?

I think it is not necessary, but I will add this check, thanks.

viirya · 2018-02-17T11:10:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

+      val unescapedRaw = unescapePathName(raw)
+      // try and parse the date, if no exception occurs this is a candidate to be resolved as
+      // TimestampType
+      DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)


Can we save the parsing here? I think to cast string to TimestampType will also check if it can be parsed as timestamp?

I don't think so, because in the cast we tolerate various timestamp format which parse doesn't support (please check the comment to DateTimeUtils.stringToTimestamp). So I'd not consider safe to remove this and anyway it would/may introduce unintended behavior changes.

Since this changes the behavior of PartitioningUtils.parsePartitions, doesn't it change the result of another path in inferPartitioning?

sorry, I am not sure I got 100% your question, may you elaborate it a bit more please?

inferPartitioning will use PartitioningUtils.parsePartitions to infer the partition spec if there is no userPartitionSchema. It is used by DataSource.sourceSchema. Seems this change makes the partition directory previously parsing-able now unable to parse. Will it change behavior of other code path?

Yes, you are right. the only change introduced is that some values which were previously wrongly inferred as dates, now will be inferred as strings. Everything else works as before.

SparkQA · 2018-02-17T21:06:17Z

Test build #87526 has finished for PR 20621 at commit 8698f4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-20T05:56:49Z

thanks, merging to master!

mgaido91 · 2018-02-20T08:28:33Z

thanks @cloud-fan. Sorry, since this seems a bug to me, why this wasn't backported to branch-2.3 too? Thanks.

cloud-fan · 2018-02-20T09:55:42Z

it's not a very serious bug, I'd like to hold it until 2.3 is released. We may have it in 2.3.1

…o Date ## What changes were proposed in this pull request? Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`). When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value. The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20621 from mgaido91/SPARK-23436.

…an be casted to Date This PR is to backport #20621 to branch 2.3� --- ## What changes were proposed in this pull request? Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`). When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value. The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20764 from gatorsmile/backport23436.

[SPARK-23436][SQL] Infer partition as Date only if it can be casted t…

2f05ab8

…o Date

mgaido91 added 2 commits February 15, 2018 19:50

fix scalastyle

5d60f88

remove unneeded import

6b56408

HyukjinKwon reviewed Feb 16, 2018

View reviewed changes

address comments

6274537

cloud-fan reviewed Feb 17, 2018

View reviewed changes

viirya reviewed Feb 17, 2018

View reviewed changes

add check on output schema

8698f4d

asfgit closed this in 651b027 Feb 20, 2018

gatorsmile mentioned this pull request Mar 8, 2018

[SPARK-23436][SQL][BACKPORT-2.3] Infer partition as Date only if it can be casted to Date #20764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621

[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621

mgaido91 commented Feb 15, 2018

mgaido91 commented Feb 15, 2018

SparkQA commented Feb 15, 2018

SparkQA commented Feb 15, 2018

HyukjinKwon Feb 16, 2018

mgaido91 Feb 16, 2018

cloud-fan Feb 17, 2018

mgaido91 Feb 17, 2018

HyukjinKwon Feb 16, 2018

mgaido91 Feb 16, 2018

HyukjinKwon Feb 16, 2018

SparkQA commented Feb 16, 2018

gatorsmile commented Feb 16, 2018 •

edited

Loading

gatorsmile commented Feb 16, 2018 •

edited

Loading

mgaido91 commented Feb 16, 2018

cloud-fan Feb 17, 2018

mgaido91 Feb 17, 2018

viirya Feb 17, 2018

mgaido91 Feb 17, 2018

viirya Feb 17, 2018

mgaido91 Feb 17, 2018

viirya Feb 17, 2018

mgaido91 Feb 18, 2018

SparkQA commented Feb 17, 2018

cloud-fan commented Feb 20, 2018

mgaido91 commented Feb 20, 2018

cloud-fan commented Feb 20, 2018

[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621

[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621

Conversation

mgaido91 commented Feb 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Feb 15, 2018

SparkQA commented Feb 15, 2018

SparkQA commented Feb 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 16, 2018

gatorsmile commented Feb 16, 2018 • edited Loading

gatorsmile commented Feb 16, 2018 • edited Loading

mgaido91 commented Feb 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 17, 2018

cloud-fan commented Feb 20, 2018

mgaido91 commented Feb 20, 2018

cloud-fan commented Feb 20, 2018

gatorsmile commented Feb 16, 2018 •

edited

Loading

gatorsmile commented Feb 16, 2018 •

edited

Loading