Skip to content

[SPARK-31212][SQL][2.4] Fix Failure of casting the '1000-02-29' string to the date type#28445

Closed
tianshizz wants to merge 1 commit intoapache:branch-2.4from
tianshizz:SPARK-31212
Closed

[SPARK-31212][SQL][2.4] Fix Failure of casting the '1000-02-29' string to the date type#28445
tianshizz wants to merge 1 commit intoapache:branch-2.4from
tianshizz:SPARK-31212

Conversation

@tianshizz
Copy link
Contributor

What changes were proposed in this pull request?

Use GregorianCanlendar.isLeapYear() as suggested in SPARK-31212 to fix parsing date string such as 1000-02-29

Why are the changes needed?

Fix a bug that leap years in Julian calendar can't be parsed correctly in v2.4

Does this PR introduce any user-facing change?

No

How was this patch tested?

added a unit test that would fail in the current branch

@tianshizz
Copy link
Contributor Author

attempted to fix the bug in branch-2.4, as discussed in #28443

cc @MaxGekk @HyukjinKwon

@kiszk
Copy link
Member

kiszk commented May 4, 2020

Could you please add [2.4] into the title like this?

@HyukjinKwon HyukjinKwon changed the title [SPARK-31212][SQL] Fix Failure of casting the '1000-02-29' string to the date type in 2.4 [SPARK-31212][SQL][2.4] Fix Failure of casting the '1000-02-29' string to the date type May 4, 2020
@HyukjinKwon
Copy link
Member

@HyukjinKwon
Copy link
Member

cc @cloud-fan too.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 4, 2020

Test build #122244 has finished for PR 28445 at commit ce2470a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.


private val threadLocalUtcGregorianCalendar = new ThreadLocal[GregorianCalendar] {
override def initialValue(): GregorianCalendar = {
new GregorianCalendar(TimeZoneUTC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does timezone matter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the implementation (https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/util/GregorianCalendar.java#L819), I don't think timezone would make a difference here.

@cloud-fan
Copy link
Contributor

I believe there are more bugs in Spark 2.4 as it has a lot of homemade datetime processing code. I'm not sure you can safely process datetime values before 1582 with Spark 2.4, even with this fix.

@tianshizz
Copy link
Contributor Author

@HyukjinKwon thanks for updating the title. I did change the place that you pointed to. According to the jira ticket, I believe that's the only place we need to fix?

@tianshizz
Copy link
Contributor Author

@cloud-fan I agree that there might be more time related bugs in 2.4. Do you think it would be good to fix this specific bug in this pr and file more tickets as we go? I'm very new to the community and would love to take on something like this to sweep the time related bugs with some guidance from more tenured members here.

@tianshizz
Copy link
Contributor Author

tianshizz commented May 4, 2020

[error] running /home/jenkins/workspace/SparkPullRequestBuilder@2/build/sbt -Phadoop-2.6 -Phive-thriftserver -Phive -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest hive-thriftserver/test avro/test mllib/test hive/test repl/test catalyst/test sql/test sql-kafka-0-10/test examples/test ; process was terminated by signal 9

the build failed due to a unrelated error I think. Could someone help me retest?

@kiszk
Copy link
Member

kiszk commented May 4, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 4, 2020

Test build #122274 has finished for PR 28445 at commit ce2470a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@MaxGekk what's your opinion? I'm fine with this fix but I won't encourage people to spend much time fixing datetime related bugs in 2.4. The datetime part is completely rewritten in 3.0.

@tianshizz
Copy link
Contributor Author

@MaxGekk bump to see how you feel about this pr.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 6, 2020

Okay, I had some time to investigate the related items here. I think there are some more places to fix, e.g. here.

Let's don't fix these in branch-2.4 only because:

  • These were "fixed" in the master branch by switching the calendar. I think we should take this fix was superseded by it in the master. We can't backport it to branch-2.4 because it's too much breaking change, though.

  • It's pretty risky. Given the history of fixing it in the master, fixing one caused a bunch of other related issues. These new issues have been fixed and found over one year, and it's still in progress. I would like to avoid to keep fixing by completely different approaches in 2.4 compared to 3.0.

  • It will causes a bunch of behaviour changes. Yes, these will be bug fixes but I tend to be more conservative. See also Bug compatibility.

Considering these risks, let's don't land these fixes. We can more conservatively just document these in Spark 2.4 specifically given the potential maintenance overhead and technical difficulty, if we should.

@tianshizz
Copy link
Contributor Author

@HyukjinKwon Thanks for the detailed explanation. Agree that it's probably better to keep these bugs in 2.4 instead of fixing. For the next step, should we update the ticket with the reasoning and close it as won't fix?
cc @MaxGekk if he has different opinions.

@tianshizz tianshizz closed this May 6, 2020
@HyukjinKwon
Copy link
Member

Yeah, let's close it as wont'fix. Thanks @tianshizz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments