Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10520][SQL] Allow average out of DateType #28754

Closed
wants to merge 5 commits into from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 8, 2020

This allows us to make an average out of DateType.

https://jira.apache.org/jira/browse/SPARK-10520

Under the hood, we take an average of the days since epoch, and convert it to date again. This requires the date object to be cast to a double to perform the average.

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType;
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
	at org.apache.spark.sql.

To keep the PR's nice and small, I've split this out of #28554.

What changes were proposed in this pull request?

Why are the changes needed?

This allows us to take an average of a Date column. This is required to include dates in the summary.

Does this PR introduce any user-facing change?

Yes. If you cast a Date to a Double, it will return the days since epoch instead of null. This is required to compute the average of the days.

How was this patch tested?

Using unit tests and the data frame test suite.

@Fokko
Copy link
Contributor Author

Fokko commented Jun 8, 2020

Required for: #28754

@@ -40,10 +40,17 @@ case class Average(child: Expression) extends DeclarativeAggregate with Implicit

override def children: Seq[Expression] = child :: Nil

override def inputTypes: Seq[AbstractDataType] = Seq(NumericType)
override def inputTypes: Seq[AbstractDataType] = Seq(NumericType, DateType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, before you go further, can we check other DBMSes as references? I would like to avoid having a variant behaviour in Spark alone compared to other DBMSes ...

Copy link
Contributor Author

@Fokko Fokko Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense. See the details below, let me know if I'm missing something, but I don't think there is a real consensus on the subject.

Postgres

For postgres, it is just unsupported

postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR:  cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
               ^

postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR:  cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
               ^

The way to get the epoch in days is:

postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part 
-----------
    18422
(1 row)

MySQL

For MySQL it will convert it automatically to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
|                                    20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)

Converting to an int is not allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1

BigQuery

Unsupported

image

https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules

Excel

The greatest DBMS of them all:

image

Which is the epoch since 01-01-1900 :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon what are your thoughts on this? Can we move this forward?

This allows to make an average out of date types.

Under the hood we take an average of the days since
epoch, and convert it to a date again. This requires
the date object to be casted to a double to perform
the average.

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType;
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
	at org.apache.spark.sql.
@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 27, 2020

Test build #124570 has finished for PR 28754 at commit 533dd8d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Jun 27, 2020

Hmm, weird that the test is failing. I've just pulled in the latest master to retrigger the tests.

@dongjoon-hyun
Copy link
Member

Thanks, @Fokko .

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 28, 2020

Test build #124574 has finished for PR 28754 at commit 9fba69e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class MapOutputCommitMessage
  • sealed trait LogisticRegressionSummary extends ClassificationSummary
  • class _ClassificationSummary(JavaWrapper):
  • class _TrainingSummary(JavaWrapper):
  • class _BinaryClassificationSummary(_ClassificationSummary):
  • class LinearSVCModel(_JavaClassificationModel, _LinearSVCParams, JavaMLWritable, JavaMLReadable,
  • class LinearSVCSummary(_BinaryClassificationSummary):
  • class LinearSVCTrainingSummary(LinearSVCSummary, _TrainingSummary):
  • class LogisticRegressionSummary(_ClassificationSummary):
  • class LogisticRegressionTrainingSummary(LogisticRegressionSummary, _TrainingSummary):
  • class BinaryLogisticRegressionSummary(_BinaryClassificationSummary,
  • case class Hour(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField
  • case class Minute(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField
  • case class Second(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField
  • trait GetDateField extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant
  • case class DayOfYear(child: Expression) extends GetDateField
  • case class Year(child: Expression) extends GetDateField
  • case class YearOfWeek(child: Expression) extends GetDateField
  • case class Quarter(child: Expression) extends GetDateField
  • case class Month(child: Expression) extends GetDateField
  • case class DayOfMonth(child: Expression) extends GetDateField
  • case class DayOfWeek(child: Expression) extends GetDateField
  • case class WeekDay(child: Expression) extends GetDateField
  • case class WeekOfYear(child: Expression) extends GetDateField
  • case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter)
  • case class CoalesceBucketsInSortMergeJoin(conf: SQLConf) extends Rule[SparkPlan]
  • class StateStoreConf(

@dongjoon-hyun
Copy link
Member

The failure looks relevant one. Could you take a look, @Fokko ?

  • org.apache.spark.sql.catalyst.expressions.CastSuite.SPARK-16729 type checking for casting to date type
  • org.apache.spark.sql.catalyst.expressions.AnsiCastSuite.SPARK-16729 type checking for casting to date type

This is days since epoch
@Fokko
Copy link
Contributor Author

Fokko commented Jun 28, 2020

That looks relevant @dongjoon-hyun, thanks for pointing out. I've removed the check since it is allowed to cast from/to date. The cast is asserted by newly added tests.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 29, 2020

Test build #124611 has finished for PR 28754 at commit bbf72c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @Fokko .
Since the PR is working at least and SPARK-10520 has been a long standing issue, could you send an email to dev@spark with the summary of your findings (#28754 (comment))?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 10, 2020

BTW, looks like most of DBMSes don't allow this. I checked quickly and seems ANSI doesn't allow this as well.

@Fokko
Copy link
Contributor Author

Fokko commented Aug 20, 2020

@Fokko
Copy link
Contributor Author

Fokko commented Sep 3, 2020

@dongjoon-hyun @HyukjinKwon Given the discussion on the devlist, is this something that we can move forward?

@HyukjinKwon
Copy link
Member

So will we only allow in summary API?

@cloud-fan
Copy link
Contributor

I believe the conclusion is to manually do date average in the summary method, (cast to int, run average, and cast back to date). I don't think we should allow the average function to accept date input.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 13, 2020
@github-actions github-actions bot closed this Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants