New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17854][SQL] rand/randn allows null/long as input seed #15432
Conversation
860c177
to
7fa7db2
Compare
Test build #66737 has finished for PR 15432 at commit
|
Test build #66738 has finished for PR 15432 at commit
|
hm - maybe we should just cast any NullType input into some concrete type defined by an ExpectsInputTypes expression? |
@rxin yes, I just wanted to avoid changing a lot. Will try to fix it in that way (at least) to show how it actually look like. |
@rxin, I updated the codes and also updated the PR description. Could you please check if my change makes sense? |
Test build #66815 has finished for PR 15432 at commit
|
Oh, my bad. Will fix it up. |
Test build #66821 has finished for PR 15432 at commit
|
FWIW, the cases below are fine in MySQL too: mysql> SELECT RAND(CAST(2 AS UNSIGNED));
+---------------------------+
| RAND(CAST(2 AS UNSIGNED)) |
+---------------------------+
| 0.6555866465490187 |
+---------------------------+
1 row in set (0.00 sec)
mysql> SELECT RAND(CAST(NULL AS UNSIGNED));
+------------------------------+
| RAND(CAST(NULL AS UNSIGNED)) |
+------------------------------+
| 0.15522042769493574 |
+------------------------------+
1 row in set (0.00 sec) |
Not sure whether you realize it. Since this PR changes the input parm of Now, users can do something like select rand(cast(9 / 4 as int)) from src My suggestion is to always add new test cases whenever you made an external change like this. It can help reviewers decide whether this PR is good or not. |
Since you are running mysql, the output of rand(0) is the same as rand(null)? |
@gatorsmile Yes (for #15432 (comment)), it is and sure, I should add more tests. I actually intended to show how it looks like. |
I should have added |
@rxin How does it look like? Let me add some cases for constant folding @gatorsmile has shown, Just to make sure, in case of MySQL it works fine. mysql> SELECT RAND(CAST(1/1 AS UNSIGNED));
+-----------------------------+
| RAND(CAST(1/1 AS UNSIGNED)) |
+-----------------------------+
| 0.40540353712197724 |
+-----------------------------+
1 row in set (0.00 sec)
mysql> SELECT RAND(CAST(1/152 AS UNSIGNED));
+-------------------------------+
| RAND(CAST(1/152 AS UNSIGNED)) |
+-------------------------------+
| 0.15522042769493574 |
+-------------------------------+
1 row in set (0.00 sec) |
I have a very general comment about the work you are working. Like what we are doing for the |
@gatorsmile Thanks for your feedback. Could you please be a little bit more specific? Do you expect some researches on the argument of If so, it'd be nicer if we have such reseaches on other JIRAs or PRs so that I (and other contributoers) can refer when we make a change on such thing. Is there a great example we already have maybe? |
(Oh, I am making a comment via my phone. Sorry for occasional closing and reopening here..) |
Let me show you an example: This is the official document of
When we defining the expected behavior, we need to consider how most the mainstream data store behave. After we delivering the fix, it is hard to change it again. |
That is a great reference. However, is this the function described in a standard? I guess it is different for each implementation of database. For example,
MySQL treats it as 0 rather than returning In case of PostgreSQL, it seems there is both functions for this, I think I have checked other examples enough. Do we usually have such explanations and tests of all the DBMS, Oracle, MySQL, SQL Server, Hive, DB2, Informix and PostgresSQL and mentions in ANSI standard in PRs and JIRAs? It can be problematic if we don't comply the standard which all other implementations follow but I think it'd be fine if other databases have different implementations. I am sure I am taking every look for other PRs time to time and trying to make mine sensible but I don't think we always have references from all other DBMS and explanations from ANSI standard. It is hard to change it again and that is why I am asking to review. |
Initially, this JIRA was only handling |
Unfortunately, not all the things have a standard to follow. That is why I suggested you to do a research about it. Like Oracle, it does not have such a function in their SQL-function list: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions001.htm Since you are doing the change in |
Strictly, the JIRA describes handling
Also, I would like to add the edge cases here but I'd like to avoid PR is being hold. As not all the things have a standard to follow, we can define the behaviour here. I don't have access to Oracle and DB2. Do you think Hive, PostgreSQL and MySQL (+Oracle and DB2 you just gave) examples are not enough? |
At first, we do not strictly follow Hive. You can easily find many in Spark. I do not think this is an urgent JIRA, right? Like what @srowen replied in the JIRA, he does not think this is a bug. The existing output message looks reasonable to me too.
Setting the seed as DB2 and Oracle have free versions to download. You can easily install the docker versions. You also can google their documentation. What we need to do at first is to do an investigation to save the times of all the other reviewers; otherwise, they have to do it too. |
Not urgent but in my experience such PRs have been being held. So, I am trying to fix the problem specified in the JIRA only rather than fixing others together. @srowen said "I'm not even sure that's a bug.." but "... reasonable to try to follow it.". At least, all the implementations of DB2, MySQL, Hive and PostgreSQL do not throw an exception but it defines its own behaviour and it'd be sensible to follow the majority (two of identified examples, Hive and MySQL) or define our own behaviour. |
94322cd
to
a523302
Compare
@@ -1728,4 +1728,29 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |||
val df = spark.createDataFrame(spark.sparkContext.makeRDD(rows), schema) | |||
assert(df.filter($"array1" === $"array2").count() == 1) | |||
} | |||
|
|||
test("SPARK-17854: rand/randn allows null and long as input seed") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this into sql query test suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Test build #68108 has finished for PR 15432 at commit
|
Test build #68109 has finished for PR 15432 at commit
|
Test build #68115 has finished for PR 15432 at commit
|
|
||
def this() = this(Utils.random.nextLong()) | ||
def this() = this(Literal(Utils.random.nextLong())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not specifying the data type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complier seems complaining if we specify the return type in def this
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, do you mean the type for literal for example as below?
Literal(Utils.random.nextLong(), LongType)
If you think it is beneficial because it at least does not do the type dispatch once, will fix here. Also, I can sweep the usages in functions.scala
in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think we should explicitly specify the type, if possible. This is my personal preference.
Not sure whether it worths a new PR to change all of them in functions.scala
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense. I will fix them here first.
@@ -87,6 +87,10 @@ case class Rand(seed: Long) extends RDG { | |||
} | |||
} | |||
|
|||
object Rand { | |||
def apply(seed: Long): Rand = Rand(Literal(seed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here?
|
||
def this() = this(Utils.random.nextLong()) | ||
def this() = this(Literal(Utils.random.nextLong())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here?
LGTM except a few minor comments. |
Thanks @gatorsmile! |
Test build #68185 has finished for PR 15432 at commit
|
retest this please |
Test build #68196 has finished for PR 15432 at commit
|
retest this please |
Test build #68201 has finished for PR 15432 at commit
|
## What changes were proposed in this pull request? This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`. So, this PR includes both changes below: - `null` support It seems MySQL also accepts this. ``` sql mysql> select rand(0); +---------------------+ | rand(0) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) mysql> select rand(NULL); +---------------------+ | rand(NULL) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) ``` and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694) So the codes below: ``` scala spark.range(1).selectExpr("rand(null)").show() ``` prints.. **Before** ``` Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444) ``` **After** ``` +-----------------------+ |rand(CAST(NULL AS INT))| +-----------------------+ | 0.13385709732307427| +-----------------------+ ``` - `LongType` support in SQL. In addition, it make the function allows to take `LongType` consistently within Scala/SQL. In more details, the codes below: ``` scala spark.range(1).select(rand(1), rand(1L)).show() spark.range(1).selectExpr("rand(1)", "rand(1L)").show() ``` prints.. **Before** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at ``` **After** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ ``` ## How was this patch tested? Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15432 from HyukjinKwon/SPARK-17854. (cherry picked from commit 340f09d) Signed-off-by: Sean Owen <sowen@cloudera.com>
Merged to master/2.1 |
## What changes were proposed in this pull request? This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`. So, this PR includes both changes below: - `null` support It seems MySQL also accepts this. ``` sql mysql> select rand(0); +---------------------+ | rand(0) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) mysql> select rand(NULL); +---------------------+ | rand(NULL) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) ``` and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694) So the codes below: ``` scala spark.range(1).selectExpr("rand(null)").show() ``` prints.. **Before** ``` Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444) ``` **After** ``` +-----------------------+ |rand(CAST(NULL AS INT))| +-----------------------+ | 0.13385709732307427| +-----------------------+ ``` - `LongType` support in SQL. In addition, it make the function allows to take `LongType` consistently within Scala/SQL. In more details, the codes below: ``` scala spark.range(1).select(rand(1), rand(1L)).show() spark.range(1).selectExpr("rand(1)", "rand(1L)").show() ``` prints.. **Before** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at ``` **After** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ ``` ## How was this patch tested? Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15432 from HyukjinKwon/SPARK-17854.
What changes were proposed in this pull request?
This PR proposes
rand
/randn
acceptnull
as input in Scala/SQL andLongType
as input in SQL. In this case, it treats the values as0
.So, this PR includes both changes below:
null
supportIt seems MySQL also accepts this.
and also Hive does according to HIVE-14694
So the codes below:
prints..
Before
After
LongType
support in SQL.In addition, it make the function allows to take
LongType
consistently within Scala/SQL.In more details, the codes below:
prints..
Before
After
How was this patch tested?
Unit tests in
DataFrameSuite.scala
andRandomSuite.scala
.