[SPARK-17854][SQL] rand/randn allows null/long as input seed #15432

HyukjinKwon · 2016-10-11T09:29:02Z

What changes were proposed in this pull request?

This PR proposes rand/randn accept null as input in Scala/SQL and LongType as input in SQL. In this case, it treats the values as 0.

So, this PR includes both changes below:

null support

It seems MySQL also accepts this.

mysql> select rand(0);
+---------------------+
| rand(0)             |
+---------------------+
| 0.15522042769493574 |
+---------------------+
1 row in set (0.00 sec)

mysql> select rand(NULL);
+---------------------+
| rand(NULL)          |
+---------------------+
| 0.15522042769493574 |
+---------------------+
1 row in set (0.00 sec)

and also Hive does according to HIVE-14694

So the codes below:

spark.range(1).selectExpr("rand(null)").show()

prints..

Before

  Input argument to rand must be an integer literal.;; line 1 pos 0
org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444)

After

  +-----------------------+
  |rand(CAST(NULL AS INT))|
  +-----------------------+
  |    0.13385709732307427|
  +-----------------------+

LongType support in SQL.

In addition, it make the function allows to take LongType consistently within Scala/SQL.

In more details, the codes below:

spark.range(1).select(rand(1), rand(1L)).show()
spark.range(1).selectExpr("rand(1)", "rand(1L)").show()

prints..

Before

+------------------+------------------+
|           rand(1)|           rand(1)|
+------------------+------------------+
|0.2630967864682161|0.2630967864682161|
+------------------+------------------+


Input argument to rand must be an integer literal.;; line 1 pos 0
org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
at

After

+------------------+------------------+
|           rand(1)|           rand(1)|
+------------------+------------------+
|0.2630967864682161|0.2630967864682161|
+------------------+------------------+

+------------------+------------------+
|           rand(1)|           rand(1)|
+------------------+------------------+
|0.2630967864682161|0.2630967864682161|
+------------------+------------------+

How was this patch tested?

Unit tests in DataFrameSuite.scala and RandomSuite.scala.

SparkQA · 2016-10-11T11:41:43Z

Test build #66737 has finished for PR 15432 at commit 860c177.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T11:50:00Z

Test build #66738 has finished for PR 15432 at commit 7fa7db2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-11T18:32:44Z

hm - maybe we should just cast any NullType input into some concrete type defined by an ExpectsInputTypes expression?

HyukjinKwon · 2016-10-11T22:48:13Z

@rxin yes, I just wanted to avoid changing a lot. Will try to fix it in that way (at least) to show how it actually look like.

HyukjinKwon · 2016-10-12T12:37:24Z

@rxin, I updated the codes and also updated the PR description. Could you please check if my change makes sense?

SparkQA · 2016-10-12T14:01:33Z

Test build #66815 has finished for PR 15432 at commit 6f8f3f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class RDG extends UnaryExpression with ExpectsInputTypes with Nondeterministic
- case class Rand(child: Expression) extends RDG
- case class Randn(child: Expression) extends RDG

HyukjinKwon · 2016-10-12T14:19:53Z

Oh, my bad. Will fix it up.

SparkQA · 2016-10-12T16:40:33Z

Test build #66821 has finished for PR 15432 at commit a99f674.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-10-12T17:37:45Z

FWIW, the cases below are fine in MySQL too:

mysql> SELECT RAND(CAST(2 AS UNSIGNED));
+---------------------------+
| RAND(CAST(2 AS UNSIGNED)) |
+---------------------------+
|        0.6555866465490187 |
+---------------------------+
1 row in set (0.00 sec)

mysql> SELECT RAND(CAST(NULL AS UNSIGNED));
+------------------------------+
| RAND(CAST(NULL AS UNSIGNED)) |
+------------------------------+
|          0.15522042769493574 |
+------------------------------+
1 row in set (0.00 sec)

gatorsmile · 2016-10-12T18:07:47Z

Not sure whether you realize it. Since this PR changes the input parm of Rand and Randn, you also changes the external support.

Now, users can do something like

select rand(cast(9 / 4 as int)) from src

My suggestion is to always add new test cases whenever you made an external change like this. It can help reviewers decide whether this PR is good or not.

gatorsmile · 2016-10-12T18:09:05Z

Since you are running mysql, the output of rand(0) is the same as rand(null)?

HyukjinKwon · 2016-10-12T18:21:26Z

@gatorsmile Yes (for #15432 (comment)), it is and sure, I should add more tests. I actually intended to show how it looks like.

HyukjinKwon · 2016-10-12T18:22:56Z

I should have added [WIP] maybe. If it look okay in general, I will try to follow your suggestions and also add the case you gave.

HyukjinKwon · 2016-10-13T01:23:50Z

@rxin How does it look like? Let me add some cases for constant folding @gatorsmile has shown, RAND(CAST(2 AS INT)) and RAND(CAST(NULL AS INT)) too if it looks okay.

Just to make sure, in case of MySQL it works fine.

mysql> SELECT RAND(CAST(1/1 AS UNSIGNED));
+-----------------------------+
| RAND(CAST(1/1 AS UNSIGNED)) |
+-----------------------------+
|         0.40540353712197724 |
+-----------------------------+
1 row in set (0.00 sec)

mysql> SELECT RAND(CAST(1/152 AS UNSIGNED));
+-------------------------------+
| RAND(CAST(1/152 AS UNSIGNED)) |
+-------------------------------+
|           0.15522042769493574 |
+-------------------------------+
1 row in set (0.00 sec)

gatorsmile · 2016-10-13T01:43:40Z

I have a very general comment about the work you are working. Like what we are doing for the LIKE operation, we did an investigation on ANSI standard, and all the mainstream data stores, including Oracle, MySQL, SQL Server, Hive, DB2, Informix and Postgres. Could you do a similar thing here? Thank you!

HyukjinKwon · 2016-10-13T02:05:12Z

@gatorsmile Thanks for your feedback. Could you please be a little bit more specific? Do you expect some researches on the argument of rand function in a standard and checking the DBMSs you listed above maybe?

If so, it'd be nicer if we have such reseaches on other JIRAs or PRs so that I (and other contributoers) can refer when we make a change on such thing. Is there a great example we already have maybe?

HyukjinKwon · 2016-10-13T02:06:10Z

(Oh, I am making a comment via my phone. Sorry for occasional closing and reopening here..)

gatorsmile · 2016-10-13T03:02:38Z

Let me show you an example:
https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_bif_rand.html

This is the official document of rand in DB2 z/OS. Below is about the behavior of rand:

If numeric-expression is specified, it is used as the seed value. The argument must be an expression that returns a value of a built-in integer data type (SMALLINT or INTEGER). The value must be between 0 and 2,147,483,646.
The argument must be an expression that returns a value of a built-in integer data type (SMALLINT or INTEGER). The value must be between 0 and 2,147,483,646.
The result can be null; if the argument is null, the result is the null value.
RAND(0) is processed the same as RAND().

When we defining the expected behavior, we need to consider how most the mainstream data store behave. After we delivering the fix, it is hard to change it again.

HyukjinKwon · 2016-10-13T03:31:30Z

That is a great reference. However, is this the function described in a standard? I guess it is different for each implementation of database. For example,

The result can be null; if the argument is null, the result is the null value.

MySQL treats it as 0 rather than returning null value. Also, I gave both references of MySQL and Hive in the PR description. Can we just define the behaviour here? Do we have a target DBMS to follow? Then, it'd be great if this is mentioned in a JIRA if there is (am I missing the JIRA already we have?) I guess it is usually Hive, PostgreSQL and MySQL as I recall.

In case of PostgreSQL, it seems there is both functions for this, random() and setseed(). This works differently with MySQL and also DB2 (assuming from the comment you left). So, I got rid of this here.

I think I have checked other examples enough. Do we usually have such explanations and tests of all the DBMS, Oracle, MySQL, SQL Server, Hive, DB2, Informix and PostgresSQL and mentions in ANSI standard in PRs and JIRAs?

It can be problematic if we don't comply the standard which all other implementations follow but I think it'd be fine if other databases have different implementations.

I am sure I am taking every look for other PRs time to time and trying to make mine sensible but I don't think we always have references from all other DBMS and explanations from ANSI standard.

It is hard to change it again and that is why I am asking to review.

HyukjinKwon · 2016-10-13T03:37:27Z

Initially, this JIRA was only handling null as seed. If you both worry the change here, I would like to make the PR smaller as suggested initially.

gatorsmile · 2016-10-13T04:04:49Z

Unfortunately, not all the things have a standard to follow. That is why I suggested you to do a research about it. Like Oracle, it does not have such a function in their SQL-function list: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions001.htm

Since you are doing the change in rand, I think you can check whether the existing rand behaves as expected and adds the missing test cases if needed. This JIRA is just trying to cover an edge case of a seed number. Why not checking whether we appropriately handle all the cases? Then, we do not need to submit more small fixes for rand, right?

HyukjinKwon · 2016-10-13T04:10:22Z

Strictly, the JIRA describes handling null and we might not have to generalize the cases further.

it will failed when do select rand(null)

Also, I would like to add the edge cases here but I'd like to avoid PR is being hold.

As not all the things have a standard to follow, we can define the behaviour here. I don't have access to Oracle and DB2. Do you think Hive, PostgreSQL and MySQL (+Oracle and DB2 you just gave) examples are not enough?

gatorsmile · 2016-10-13T04:19:24Z

At first, we do not strictly follow Hive. You can easily find many in Spark. I do not think this is an urgent JIRA, right? Like what @srowen replied in the JIRA, he does not think this is a bug. The existing output message looks reasonable to me too.

Input argument to rand must be an integer literal.;; line 1 pos 0

Setting the seed as null also looks weird to me.

DB2 and Oracle have free versions to download. You can easily install the docker versions. You also can google their documentation. What we need to do at first is to do an investigation to save the times of all the other reviewers; otherwise, they have to do it too.

HyukjinKwon · 2016-10-13T04:29:05Z

Not urgent but in my experience such PRs have been being held. So, I am trying to fix the problem specified in the JIRA only rather than fixing others together.

@srowen said "I'm not even sure that's a bug.." but "... reasonable to try to follow it.".

At least, all the implementations of DB2, MySQL, Hive and PostgreSQL do not throw an exception but it defines its own behaviour and it'd be sensible to follow the majority (two of identified examples, Hive and MySQL) or define our own behaviour.

rxin · 2016-11-04T06:02:46Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -1728,4 +1728,29 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
    val df = spark.createDataFrame(spark.sparkContext.makeRDD(rows), schema)
    assert(df.filter($"array1" === $"array2").count() == 1)
  }
+
+  test("SPARK-17854: rand/randn allows null and long as input seed") {


move this into sql query test suite

SparkQA · 2016-11-04T06:30:17Z

Test build #68108 has finished for PR 15432 at commit a523302.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T08:03:50Z

Test build #68109 has finished for PR 15432 at commit b432355.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T09:49:41Z

Test build #68115 has finished for PR 15432 at commit 160ea54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-11-05T04:47:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala


-  def this() = this(Utils.random.nextLong())
+  def this() = this(Literal(Utils.random.nextLong()))


Why not specifying the data type here?

The complier seems complaining if we specify the return type in def this.

Oh, do you mean the type for literal for example as below?

Literal(Utils.random.nextLong(), LongType)

If you think it is beneficial because it at least does not do the type dispatch once, will fix here. Also, I can sweep the usages in functions.scala in another PR.

Yeah. I think we should explicitly specify the type, if possible. This is my personal preference.

Not sure whether it worths a new PR to change all of them in functions.scala.

Yes, makes sense. I will fix them here first.

gatorsmile · 2016-11-05T04:47:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

@@ -87,6 +87,10 @@ case class Rand(seed: Long) extends RDG {
  }
 }

+object Rand {
+  def apply(seed: Long): Rand = Rand(Literal(seed))


The same here?

gatorsmile · 2016-11-05T04:47:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala


-  def this() = this(Utils.random.nextLong())
+  def this() = this(Literal(Utils.random.nextLong()))


The same here?

gatorsmile · 2016-11-05T04:51:58Z

LGTM except a few minor comments.

HyukjinKwon · 2016-11-05T05:49:15Z

Thanks @gatorsmile!

SparkQA · 2016-11-05T07:19:28Z

Test build #68185 has finished for PR 15432 at commit 9b9a49f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-05T08:36:40Z

retest this please

SparkQA · 2016-11-05T09:34:17Z

Test build #68196 has finished for PR 15432 at commit 9b9a49f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-05T09:37:37Z

retest this please

SparkQA · 2016-11-05T12:29:20Z

Test build #68201 has finished for PR 15432 at commit 9b9a49f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`. So, this PR includes both changes below: - `null` support It seems MySQL also accepts this. ``` sql mysql> select rand(0); +---------------------+ | rand(0) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) mysql> select rand(NULL); +---------------------+ | rand(NULL) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) ``` and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694) So the codes below: ``` scala spark.range(1).selectExpr("rand(null)").show() ``` prints.. **Before** ``` Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444) ``` **After** ``` +-----------------------+ |rand(CAST(NULL AS INT))| +-----------------------+ | 0.13385709732307427| +-----------------------+ ``` - `LongType` support in SQL. In addition, it make the function allows to take `LongType` consistently within Scala/SQL. In more details, the codes below: ``` scala spark.range(1).select(rand(1), rand(1L)).show() spark.range(1).selectExpr("rand(1)", "rand(1L)").show() ``` prints.. **Before** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at ``` **After** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ ``` ## How was this patch tested? Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15432 from HyukjinKwon/SPARK-17854. (cherry picked from commit 340f09d) Signed-off-by: Sean Owen <sowen@cloudera.com>

srowen · 2016-11-06T14:12:05Z

Merged to master/2.1

## What changes were proposed in this pull request? This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`. So, this PR includes both changes below: - `null` support It seems MySQL also accepts this. ``` sql mysql> select rand(0); +---------------------+ | rand(0) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) mysql> select rand(NULL); +---------------------+ | rand(NULL) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) ``` and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694) So the codes below: ``` scala spark.range(1).selectExpr("rand(null)").show() ``` prints.. **Before** ``` Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444) ``` **After** ``` +-----------------------+ |rand(CAST(NULL AS INT))| +-----------------------+ | 0.13385709732307427| +-----------------------+ ``` - `LongType` support in SQL. In addition, it make the function allows to take `LongType` consistently within Scala/SQL. In more details, the codes below: ``` scala spark.range(1).select(rand(1), rand(1L)).show() spark.range(1).selectExpr("rand(1)", "rand(1L)").show() ``` prints.. **Before** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at ``` **After** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ ``` ## How was this patch tested? Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15432 from HyukjinKwon/SPARK-17854.

HyukjinKwon force-pushed the SPARK-17854 branch from 860c177 to 7fa7db2 Compare October 11, 2016 09:32

HyukjinKwon closed this Oct 11, 2016

HyukjinKwon reopened this Oct 11, 2016

HyukjinKwon changed the title ~~[SPARK-17854][SQL] rand/randn allows null as input seed~~ [SPARK-17854][SQL] rand/randn allows null/long as input seed Oct 12, 2016

HyukjinKwon closed this Oct 13, 2016

HyukjinKwon reopened this Oct 13, 2016

HyukjinKwon added 6 commits November 4, 2016 13:32

Add test cases for constant folding and improve documentation

fec5f42

Add some more cases for exceptions

7bc0a19

Improve documentation

9c56094

Improve exception message and documentation

30179d8

Fix the test too

3283d3a

Add examples for null as an argument

a523302

HyukjinKwon force-pushed the SPARK-17854 branch from 94322cd to a523302 Compare November 4, 2016 05:31

Fix the tests accordingly

b432355

rxin requested changes Nov 4, 2016

View reviewed changes

Move the tests into sql query test suit

160ea54

gatorsmile reviewed Nov 5, 2016

View reviewed changes

Specify the date type for literals

9b9a49f

asfgit closed this in 340f09d Nov 6, 2016

HyukjinKwon deleted the SPARK-17854 branch January 2, 2018 03:43


		def this() = this(Utils.random.nextLong())
		def this() = this(Literal(Utils.random.nextLong()))

[SPARK-17854][SQL] rand/randn allows null/long as input seed #15432

[SPARK-17854][SQL] rand/randn allows null/long as input seed #15432

Conversation

HyukjinKwon commented Oct 11, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 11, 2016

SparkQA commented Oct 11, 2016

rxin commented Oct 11, 2016

HyukjinKwon commented Oct 11, 2016

HyukjinKwon commented Oct 12, 2016

SparkQA commented Oct 12, 2016

HyukjinKwon commented Oct 12, 2016

SparkQA commented Oct 12, 2016

HyukjinKwon commented Oct 12, 2016 • edited

gatorsmile commented Oct 12, 2016

gatorsmile commented Oct 12, 2016

HyukjinKwon commented Oct 12, 2016

HyukjinKwon commented Oct 12, 2016 • edited

HyukjinKwon commented Oct 13, 2016 • edited

gatorsmile commented Oct 13, 2016

HyukjinKwon commented Oct 13, 2016 • edited

HyukjinKwon commented Oct 13, 2016

gatorsmile commented Oct 13, 2016 • edited

HyukjinKwon commented Oct 13, 2016 • edited

HyukjinKwon commented Oct 13, 2016

gatorsmile commented Oct 13, 2016

HyukjinKwon commented Oct 13, 2016 • edited

gatorsmile commented Oct 13, 2016

HyukjinKwon commented Oct 13, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Nov 5, 2016

HyukjinKwon commented Nov 5, 2016

SparkQA commented Nov 5, 2016

HyukjinKwon commented Nov 5, 2016

SparkQA commented Nov 5, 2016

HyukjinKwon commented Nov 5, 2016

SparkQA commented Nov 5, 2016

srowen commented Nov 6, 2016

HyukjinKwon commented Oct 11, 2016 •

edited

HyukjinKwon commented Oct 12, 2016 •

edited

HyukjinKwon commented Oct 12, 2016 •

edited

HyukjinKwon commented Oct 13, 2016 •

edited

HyukjinKwon commented Oct 13, 2016 •

edited

gatorsmile commented Oct 13, 2016 •

edited

HyukjinKwon commented Oct 13, 2016 •

edited

HyukjinKwon commented Oct 13, 2016 •

edited

HyukjinKwon commented Oct 13, 2016 •

edited