New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10520][SQL] Allow average out of DateType #28754
Conversation
Required for: #28754 |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Outdated
Show resolved
Hide resolved
@@ -40,10 +40,17 @@ case class Average(child: Expression) extends DeclarativeAggregate with Implicit | |||
|
|||
override def children: Seq[Expression] = child :: Nil | |||
|
|||
override def inputTypes: Seq[AbstractDataType] = Seq(NumericType) | |||
override def inputTypes: Seq[AbstractDataType] = Seq(NumericType, DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, before you go further, can we check other DBMSes as references? I would like to avoid having a variant behaviour in Spark alone compared to other DBMSes ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense. See the details below, let me know if I'm missing something, but I don't think there is a real consensus on the subject.
Postgres
For postgres, it is just unsupported
postgres@366ecc8a0fb9:/$ psql
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.
postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
ERROR: cannot cast type date to numeric
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
^
postgres=# SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
ERROR: cannot cast type date to integer
LINE 1: SELECT CAST(CAST('2020-01-01' AS DATE) AS integer);
^
The way to get the epoch in days is:
postgres=# SELECT EXTRACT(DAYS FROM (now() - '1970-01-01'));
date_part
-----------
18422
(1 row)
MySQL
For MySQL it will convert it automatically to a YYYYMMDD format:
mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS decimal);
+---------------------------------------------+
| CAST(CAST('2020-01-01' AS DATE) AS decimal) |
+---------------------------------------------+
| 20200101 |
+---------------------------------------------+
1 row in set (0.00 sec)
Converting to an int is not allowed:
mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS int);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'int)' at line 1
mysql> SELECT CAST(CAST('2020-01-01' AS DATE) AS bigint);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'bigint)' at line 1
BigQuery
Unsupported
https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules
Excel
The greatest DBMS of them all:
Which is the epoch since 01-01-1900 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Avro it is milliseconds since epoch:
https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/reflect/DateAsLongEncoding.java
For Parquet it is days since epoch:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
Also ORC is based around days since Epoch:
https://github.com/apache/orc/blob/master/java/core/src/java/org/threeten/extra/chrono/HybridDate.java
Also with this, we keep parity with the Catalyst type :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon what are your thoughts on this? Can we move this forward?
This allows to make an average out of date types. Under the hood we take an average of the days since epoch, and convert it to a date again. This requires the date object to be casted to a double to perform the average. Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.
Retest this please. |
Test build #124570 has finished for PR 28754 at commit
|
Hmm, weird that the test is failing. I've just pulled in the latest master to retrigger the tests. |
Thanks, @Fokko . |
Retest this please. |
Test build #124574 has finished for PR 28754 at commit
|
The failure looks relevant one. Could you take a look, @Fokko ?
|
This is days since epoch
That looks relevant @dongjoon-hyun, thanks for pointing out. I've removed the check since it is allowed to cast from/to date. The cast is asserted by newly added tests. |
Retest this please. |
Test build #124611 has finished for PR 28754 at commit
|
Hi, @Fokko . |
BTW, looks like most of DBMSes don't allow this. I checked quickly and seems ANSI doesn't allow this as well. |
I've sent a mail to the devlist: http://apache-spark-developers-list.1001551.n3.nabble.com/Allow-average-out-of-a-Date-td30038.html |
@dongjoon-hyun @HyukjinKwon Given the discussion on the devlist, is this something that we can move forward? |
So will we only allow in |
I believe the conclusion is to manually do date average in the |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
This allows us to make an average out of DateType.
https://jira.apache.org/jira/browse/SPARK-10520
Under the hood, we take an average of the days since epoch, and convert it to date again. This requires the date object to be cast to a double to perform the average.
To keep the PR's nice and small, I've split this out of #28554.
What changes were proposed in this pull request?
Why are the changes needed?
This allows us to take an average of a Date column. This is required to include dates in the summary.
Does this PR introduce any user-facing change?
Yes. If you cast a Date to a Double, it will return the days since epoch instead of null. This is required to compute the average of the days.
How was this patch tested?
Using unit tests and the data frame test suite.