Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23436][SQL] Infer partition as Date only if it can be casted to Date #20621

Closed
wants to merge 5 commits into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like 2018-02-15AAA).

When this happens and the input format has metadata which define the schema of the table, then null is returned as a value for the partition column, because the cast operator used in (PartitioningAwareFileIndex.inferPartitioning) is unable to convert the value.

The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.

How was this patch tested?

added UT

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @HyukjinKwon @viirya

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87485 has finished for PR 20621 at commit 2f05ab8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87488 has finished for PR 20621 at commit 6b56408.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val dateTry = Try {
// try and parse the date, if no exception occurs this is a candidate to be resolved as
// DateType
DateTimeUtils.getThreadLocalDateFormat.parse(raw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the root cause is more specific to SimpleDateFormat because it allows invalid dates like 2018-01-01-04 to be parsed fine ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually all the DateFormat's parse allow extra-characters after a valid date: (https://docs.oracle.com/javase/7/docs/api/java/text/DateFormat.html#parse(java.lang.String)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a short-cut? It seems OK to always go to the cast path,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is enough to go always with the cast path, since it allows many format/strings, not allowed by the parse method. Thus I think it not safe to avoid the parse method.

// We need to check that we can cast the raw string since we later can use Cast to get
// the partition values with the right DataType (see
// org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning)
val dateOption = Option(Cast(Literal(raw), DateType).eval())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add require(dateOption.isDefine) with some comments explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, aren't these comments enough? may you please provide some suggestions about how you would like to improve them, ie. what is it missing/not clear? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean .. simply like:

// Disallow date type if the cast returned null blah blah
require(dateOption.isDefine)

nothing special. I am fine with not adding it too.

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87508 has finished for PR 20621 at commit 6274537.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 16, 2018

This is a blocker-level regression? When did we introduce this?

@gatorsmile
Copy link
Member

gatorsmile commented Feb 16, 2018

It sounds like Spark 2.2 already has this bug, but Spark 2.1 is still fine. This causes an incorrect result.

@mgaido91
Copy link
Contributor Author

@gatorsmile thanks for checking. Yes, Spark 2.2 is affected too, so I am not sure whether this should be considered a blocker regression. But, I think we should fix it as soon as possible, nonetheless.


data.write.partitionBy("date_month", "date_hour").parquet(path.getAbsolutePath)
val input = spark.read.parquet(path.getAbsolutePath)
checkAnswer(input.select("id", "date_month", "date_hour", "data"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also check the schema to make sure that field is string type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not necessary, but I will add this check, thanks.

val unescapedRaw = unescapePathName(raw)
// try and parse the date, if no exception occurs this is a candidate to be resolved as
// TimestampType
DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we save the parsing here? I think to cast string to TimestampType will also check if it can be parsed as timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because in the cast we tolerate various timestamp format which parse doesn't support (please check the comment to DateTimeUtils.stringToTimestamp). So I'd not consider safe to remove this and anyway it would/may introduce unintended behavior changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this changes the behavior of PartitioningUtils.parsePartitions, doesn't it change the result of another path in inferPartitioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I am not sure I got 100% your question, may you elaborate it a bit more please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferPartitioning will use PartitioningUtils.parsePartitions to infer the partition spec if there is no userPartitionSchema. It is used by DataSource.sourceSchema. Seems this change makes the partition directory previously parsing-able now unable to parse. Will it change behavior of other code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. the only change introduced is that some values which were previously wrongly inferred as dates, now will be inferred as strings. Everything else works as before.

@SparkQA
Copy link

SparkQA commented Feb 17, 2018

Test build #87526 has finished for PR 20621 at commit 8698f4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 651b027 Feb 20, 2018
@mgaido91
Copy link
Contributor Author

thanks @cloud-fan. Sorry, since this seems a bug to me, why this wasn't backported to branch-2.3 too? Thanks.

@cloud-fan
Copy link
Contributor

it's not a very serious bug, I'd like to hold it until 2.3 is released. We may have it in 2.3.1

gatorsmile pushed a commit to gatorsmile/spark that referenced this pull request Mar 8, 2018
…o Date

## What changes were proposed in this pull request?

Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`).

When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value.

The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.

## How was this patch tested?

added UT

Author: Marco Gaido <marcogaido91@gmail.com>

Closes apache#20621 from mgaido91/SPARK-23436.
asfgit pushed a commit that referenced this pull request Mar 9, 2018
…an be casted to Date

This PR is to backport #20621 to branch 2.3�

---
## What changes were proposed in this pull request?

Before the patch, Spark could infer as Date a partition value which cannot be casted to Date (this can happen when there are extra characters after a valid date, like `2018-02-15AAA`).

When this happens and the input format has metadata which define the schema of the table, then `null` is returned as a value for the partition column, because the `cast` operator used in (`PartitioningAwareFileIndex.inferPartitioning`) is unable to convert the value.

The PR checks in the partition inference that values can be casted to Date and Timestamp, in order to infer that datatype to them.

## How was this patch tested?

added UT

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #20764 from gatorsmile/backport23436.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants