Skip to content

[SPARK-29362][SQL] Move timestamp expressions and tests to separate files#26031

Closed
MaxGekk wants to merge 5 commits intoapache:masterfrom
MaxGekk:date-time-refactoring
Closed

[SPARK-29362][SQL] Move timestamp expressions and tests to separate files#26031
MaxGekk wants to merge 5 commits intoapache:masterfrom
MaxGekk:date-time-refactoring

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 5, 2019

What changes were proposed in this pull request?

In the PR, I propose refactoring of date and timestamp expressions by:

  • Moving all timestamp expressions from datetimeExpressions.scala to timestampExpressions.scala
  • Renaming datetimeExpressions.scala to dateExpressions.scala
  • Moving all timestamp expressions tests from DateExpressionsSuite.scala to TimestampExpressionsSuite.scala.
  • Moving all timestamp functions tests to from DateFunctionsSuite.scala to TimestampFunctionsSuite.scala.

Why are the changes needed?

The datetimeExpressions.scala file has been becoming large. Its size is more than 2000 lines, at the moment. Also it contains 2 kind of expressions - date expressions and timestamp expressions. To make easier navigation and maintainability of date-time expressions, it would be nice to separate the expressions. The same reason is applicable for timestamp expressions and functions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing tests from the test suite DateExpressionsSuite and TimestampExpressionSuite, and by new test suites TimestampExpressionsSuite and TimestampFunctionsSuite.

@SparkQA
Copy link

SparkQA commented Oct 5, 2019

Test build #111801 has finished for PR 26031 at commit cd713e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TimestampFunctionsSuite extends QueryTest with SharedSparkSession

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 5, 2019

@dongjoon-hyun @srowen @cloud-fan @HyukjinKwon Could you take a look at the PR, please.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of neutral on it. Yeah it's a big file but there are lots of date/time functions, and they're related.
I usually wonder if this will cause a few merge conflict problems, and how much it's worth in comparison.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 5, 2019

I am just worry IntelliJ IDEA heats my laptop so much when I change this file that I can lost it at some moment. ;-)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 5, 2019

I also agree with @srowen .

cc @gatorsmile since there was a similar discussion on the other big SQL files. IIRC, it ended up with a negative conclusion. In any cases, this is a case-by-case decision.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 5, 2019

Yeah it's a big file but there are lots of date/time functions, and they're related.

Large file is only one reason. Another one is date expressions and timestamp expressions in datetimeExpressions.scala are not related. They don't depend on each other. This is the only one place where we mixed 2 types, probably because of historical reasons.

... this will cause a few merge conflict problems ...

It seems this is main concern, right? As far as I understand, if a private Spark fork has additional date or timestamp expressions, this will cause some conflicts. ... but such conflicts could be easily resolved by keeping an expression in the same place (date expression) or just move it to timestampExpressions.scala. Is it a real problem? Especially taking into account that such private expressions are placed to a separate package like com.databricks....

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 5, 2019

@MaxGekk . Since you were not there, I'll give you more context. In addition to them, the last blocker concern of the last discussion was not about resolving conflicts. It's about weakening the traceability due to the loss of the commit history on that file. At Spark 2.0, we did many refactoring like this PR, and this removes the line-by-line commit history and increased the chance of bugs. That was the reason why the PMCs are reluctant to do that again. And, basically, I've heard that comment from @gatorsmile .

@ExpressionDescription(
usage = "_FUNC_() - Returns the current date at the start of query evaluation.",
since = "1.5.0")
case class CurrentDate(timeZoneId: Option[String] = None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we will not do such a file split. It will make us [especially for new contributors] very hard to track the change history. For example, this function, we can get the change history in IntelliJ
image

@gatorsmile
Copy link
Member

I would agree with @dongjoon-hyun . I have to say -1 for such a PR, like what I did in the other similar PRs. Reading our change history is the most efficient way to learn our code base. I did it almost everyday.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

Before closing this PR, let's look at moved expression CurrentTimestamp from datetimeExpression.scala to timestampExpressions.scala.
Screen Shot 2019-10-06 at 08 18 36
Screen Shot 2019-10-06 at 08 20 23
IntelliJ IDEA can show the blame history as well:
Screen Shot 2019-10-06 at 08 22 26
Screen Shot 2019-10-06 at 08 24 06

@srowen
Copy link
Member

srowen commented Oct 6, 2019

Does it preserve git history for both files? I'd imagine at most one of them can be a rename, so the other appears all new.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

Does it preserve git history for both files?

@srowen Git (and IntelliJ IDEA) is able to track history for both files. I see original history (and blame annotations) in the renamed file (datetimeExpressions.scala -> dateExpressions.scala) as is. In the new file timestampExpressions.scala, I am as the last author of whole file but you can see the git log and blames as I showed above. When you select Annotate Previous Revision, IDEA opens parent revision on the original file datetimeExpressions.scala. Somehow it recognizes the original places where the code moved from. The same for history new file timestampExpressions.scala. Its history does not start from scratch (see the second screenshot above).

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

Here is a related answer https://stackoverflow.com/a/40466759 . The command below show blame annotation of copied code:

 git blame -C ./timestampExpressions.scala

@srowen
Copy link
Member

srowen commented Oct 6, 2019

Interesting, so you can get git to notice that the file contents were copied. It isn't the default? so I assume IJ wouldn't automatically do this, but it's available. I'd still be overall neutral. Does it really cause problems in IJ? didn't seem too hard to deal with when I opened it, but maybe complex ops with git take a while?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

Interesting, so you can get git to notice that the file contents were copied. It isn't the default?

Yes, you just need to pass additional options to git -C (https://git-scm.com/docs/git-blame#Documentation/git-blame.txt--Cltnumgt) or -M(https://git-scm.com/docs/git-blame#Documentation/git-blame.txt--Mltnumgt) otherwise it will show who copied the code.

but maybe complex ops with git take a while?

Less than 2 seconds:

 $ time git blame -C ./timestampExpressions.scala > /dev/null
git blame -C ./timestampExpressions.scala > /dev/null  1.21s user 0.09s system 99% cpu 1.312 total

@dongjoon-hyun
Copy link
Member

That's good. Let me check that too, @MaxGekk .

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 6, 2019

@MaxGekk . GitHub doesn't support that by default. Please see you new file in GitHub. Everything is under one commit.

In this case, I'm still -1.

@srowen
Copy link
Member

srowen commented Oct 6, 2019

Yes, and I was also wondering about IJ. I get that you can do this on the command line, which is great. I was also asking what the IJ perf problem is.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

I am closing this PR.

@MaxGekk MaxGekk closed this Oct 6, 2019
@dongjoon-hyun
Copy link
Member

Thank you for the decision, @MaxGekk .

@cloud-fan
Copy link
Contributor

This is a tradeoff that we need to evaluate. We should not blindly reject any cross-file code movement simply because it makes commit tracking harder. Users can still find the code movement commit and continue to look at the commits of the old file. It's still doable but harder.

For this particular case, I think datetimeExpressions.scala is OK to hold the date and timestamp expressions. It's better to split into 2 files as it has grown too big, we should do it if we are going to add many more datetime expressions later.

@MaxGekk MaxGekk deleted the date-time-refactoring branch October 15, 2019 19:57
@HyukjinKwon
Copy link
Member

We should not blindly reject any cross-file code movement simply because it makes commit tracking harder

Yea, I think so too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants