[SPARK-10520][SQL] Allow average out of DateType #28754

Fokko · 2020-06-08T11:52:35Z

This allows us to make an average out of DateType.

https://jira.apache.org/jira/browse/SPARK-10520

Under the hood, we take an average of the days since epoch, and convert it to date again. This requires the date object to be cast to a double to perform the average.

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType;
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
	at org.apache.spark.sql.

To keep the PR's nice and small, I've split this out of #28554.

What changes were proposed in this pull request?

Why are the changes needed?

This allows us to take an average of a Date column. This is required to include dates in the summary.

Does this PR introduce any user-facing change?

Yes. If you cast a Date to a Double, it will return the days since epoch instead of null. This is required to compute the average of the days.

How was this patch tested?

Using unit tests and the data frame test suite.

Fokko · 2020-06-08T11:56:49Z

Required for: #28754

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

HyukjinKwon · 2020-06-09T04:19:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala

@@ -40,10 +40,17 @@ case class Average(child: Expression) extends DeclarativeAggregate with Implicit

  override def children: Seq[Expression] = child :: Nil

-  override def inputTypes: Seq[AbstractDataType] = Seq(NumericType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(NumericType, DateType)


@Fokko, before you go further, can we check other DBMSes as references? I would like to avoid having a variant behaviour in Spark alone compared to other DBMSes ...

Sure, that makes sense. See the details below, let me know if I'm missing something, but I don't think there is a real consensus on the subject.

Postgres

For postgres, it is just unsupported

postgres@366ecc8a0fb9:/$ psql psql (12.3 (Debian 12.3-1.pgdg100+1)) Type "help" for help. postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal); ERROR: cannot cast type date to numeric LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal); ^ postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer); ERROR: cannot cast type date to integer LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer); ^ The way to get the epoch in days is: postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01')); date_part ----------- 18422 (1 row)

MySQL

For MySQL it will convert it automatically to a YYYYMMDD format:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal); +---------------------------------------------+ | CAST(CAST('2020-01-01' AS DATE) AS decimal) | +---------------------------------------------+ | 20200101 | +---------------------------------------------+ 1 row in set (0.00 sec)

Converting to an int is not allowed:

mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int); ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1 mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint); ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1

BigQuery

Unsupported

https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules

Excel

The greatest DBMS of them all:

Which is the epoch since 01-01-1900 :)

For Avro it is milliseconds since epoch:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java

For Parquet it is days since epoch:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date

Also ORC is based around days since Epoch:
https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java

Also with this, we keep parity with the Catalyst type :)

@HyukjinKwon what are your thoughts on this? Can we move this forward?

This allows to make an average out of date types. Under the hood we take an average of the days since epoch, and convert it to a date again. This requires the date object to be casted to a double to perform the average. Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.

dongjoon-hyun · 2020-06-27T17:22:07Z

Retest this please.

SparkQA · 2020-06-27T19:26:27Z

Test build #124570 has finished for PR 28754 at commit 533dd8d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…0520

Fokko · 2020-06-27T20:51:15Z

Hmm, weird that the test is failing. I've just pulled in the latest master to retrigger the tests.

dongjoon-hyun · 2020-06-27T22:20:05Z

Thanks, @Fokko .

dongjoon-hyun · 2020-06-27T22:20:12Z

Retest this please.

SparkQA · 2020-06-28T00:39:04Z

Test build #124574 has finished for PR 28754 at commit 9fba69e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class MapOutputCommitMessage
sealed trait LogisticRegressionSummary extends ClassificationSummary
class _ClassificationSummary(JavaWrapper):
class _TrainingSummary(JavaWrapper):
class _BinaryClassificationSummary(_ClassificationSummary):
class LinearSVCModel(_JavaClassificationModel, _LinearSVCParams, JavaMLWritable, JavaMLReadable,
class LinearSVCSummary(_BinaryClassificationSummary):
class LinearSVCTrainingSummary(LinearSVCSummary, _TrainingSummary):
class LogisticRegressionSummary(_ClassificationSummary):
class LogisticRegressionTrainingSummary(LogisticRegressionSummary, _TrainingSummary):
class BinaryLogisticRegressionSummary(_BinaryClassificationSummary,
case class Hour(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField
case class Minute(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField
case class Second(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField
trait GetDateField extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant
case class DayOfYear(child: Expression) extends GetDateField
case class Year(child: Expression) extends GetDateField
case class YearOfWeek(child: Expression) extends GetDateField
case class Quarter(child: Expression) extends GetDateField
case class Month(child: Expression) extends GetDateField
case class DayOfMonth(child: Expression) extends GetDateField
case class DayOfWeek(child: Expression) extends GetDateField
case class WeekDay(child: Expression) extends GetDateField
case class WeekOfYear(child: Expression) extends GetDateField
case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter)
case class CoalesceBucketsInSortMergeJoin(conf: SQLConf) extends Rule[SparkPlan]
class StateStoreConf(

dongjoon-hyun · 2020-06-28T01:12:33Z

The failure looks relevant one. Could you take a look, @Fokko ?

org.apache.spark.sql.catalyst.expressions.CastSuite.SPARK-16729 type checking for casting to date type
org.apache.spark.sql.catalyst.expressions.AnsiCastSuite.SPARK-16729 type checking for casting to date type

This is days since epoch

Fokko · 2020-06-28T06:21:45Z

That looks relevant @dongjoon-hyun, thanks for pointing out. I've removed the check since it is allowed to cast from/to date. The cast is asserted by newly added tests.

dongjoon-hyun · 2020-06-28T22:04:02Z

Retest this please.

SparkQA · 2020-06-29T03:22:34Z

Test build #124611 has finished for PR 28754 at commit bbf72c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-09T05:32:34Z

Hi, @Fokko .
Since the PR is working at least and SPARK-10520 has been a long standing issue, could you send an email to dev@spark with the summary of your findings (#28754 (comment))?

HyukjinKwon · 2020-08-10T02:16:22Z

BTW, looks like most of DBMSes don't allow this. I checked quickly and seems ANSI doesn't allow this as well.

…0520

Fokko · 2020-08-20T09:45:26Z

I've sent a mail to the devlist: http://apache-spark-developers-list.1001551.n3.nabble.com/Allow-average-out-of-a-Date-td30038.html

Fokko · 2020-09-03T08:39:35Z

@dongjoon-hyun @HyukjinKwon Given the discussion on the devlist, is this something that we can move forward?

HyukjinKwon · 2020-09-03T12:54:19Z

So will we only allow in summary API?

cloud-fan · 2020-09-03T12:55:48Z

I believe the conclusion is to manually do date average in the summary method, (cast to int, run average, and cast back to date). I don't think we should allow the average function to accept date input.

github-actions · 2020-12-13T00:51:19Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Fokko force-pushed the SPARK-10520 branch from 76eb585 to 2aae457 Compare June 8, 2020 11:53

probot-autolabeler bot added the SQL label Jun 8, 2020

Fokko force-pushed the SPARK-10520 branch from 2aae457 to 26f6672 Compare June 8, 2020 11:54

Fokko force-pushed the SPARK-10520 branch from 26f6672 to dce7d29 Compare June 8, 2020 12:01

MaxGekk reviewed Jun 8, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 9, 2020

View reviewed changes

Fokko added 2 commits June 12, 2020 08:08

Remove the timezone conversions

533dd8d

Fokko force-pushed the SPARK-10520 branch from 2076ba9 to 533dd8d Compare June 12, 2020 06:08

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

9fba69e

…0520

We can cast a double to a date

bbf72c4

This is days since epoch

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

f499202

…0520

Fokko mentioned this pull request Aug 22, 2020

[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation #29491

Closed

github-actions bot added the Stale label Dec 13, 2020

github-actions bot closed this Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10520][SQL] Allow average out of DateType #28754

[SPARK-10520][SQL] Allow average out of DateType #28754

Fokko commented Jun 8, 2020 •

edited

Fokko commented Jun 8, 2020

HyukjinKwon Jun 9, 2020

Fokko Jun 9, 2020 •

edited

Fokko Jun 11, 2020

Fokko Jun 30, 2020

dongjoon-hyun commented Jun 27, 2020

SparkQA commented Jun 27, 2020

Fokko commented Jun 27, 2020

dongjoon-hyun commented Jun 27, 2020

dongjoon-hyun commented Jun 27, 2020

SparkQA commented Jun 28, 2020

dongjoon-hyun commented Jun 28, 2020

Fokko commented Jun 28, 2020

dongjoon-hyun commented Jun 28, 2020

SparkQA commented Jun 29, 2020

dongjoon-hyun commented Aug 9, 2020

HyukjinKwon commented Aug 10, 2020 •

edited

Fokko commented Aug 20, 2020

Fokko commented Sep 3, 2020

HyukjinKwon commented Sep 3, 2020

cloud-fan commented Sep 3, 2020

github-actions bot commented Dec 13, 2020

[SPARK-10520][SQL] Allow average out of DateType #28754

[SPARK-10520][SQL] Allow average out of DateType #28754

Conversation

Fokko commented Jun 8, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Fokko commented Jun 8, 2020

HyukjinKwon Jun 9, 2020

Choose a reason for hiding this comment

Fokko Jun 9, 2020 • edited

Choose a reason for hiding this comment

Postgres

MySQL

BigQuery

Excel

Fokko Jun 11, 2020

Choose a reason for hiding this comment

Fokko Jun 30, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 27, 2020

SparkQA commented Jun 27, 2020

Fokko commented Jun 27, 2020

dongjoon-hyun commented Jun 27, 2020

dongjoon-hyun commented Jun 27, 2020

SparkQA commented Jun 28, 2020

dongjoon-hyun commented Jun 28, 2020

Fokko commented Jun 28, 2020

dongjoon-hyun commented Jun 28, 2020

SparkQA commented Jun 29, 2020

dongjoon-hyun commented Aug 9, 2020

HyukjinKwon commented Aug 10, 2020 • edited

Fokko commented Aug 20, 2020

Fokko commented Sep 3, 2020

HyukjinKwon commented Sep 3, 2020

cloud-fan commented Sep 3, 2020

github-actions bot commented Dec 13, 2020

Fokko commented Jun 8, 2020 •

edited

Fokko Jun 9, 2020 •

edited

HyukjinKwon commented Aug 10, 2020 •

edited