Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17854][SQL] rand/randn allows null/long as input seed #15432

Closed
wants to merge 12 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 11, 2016

What changes were proposed in this pull request?

This PR proposes rand/randn accept null as input in Scala/SQL and LongType as input in SQL. In this case, it treats the values as 0.

So, this PR includes both changes below:

  • null support

    It seems MySQL also accepts this.

    mysql> select rand(0);
    +---------------------+
    | rand(0)             |
    +---------------------+
    | 0.15522042769493574 |
    +---------------------+
    1 row in set (0.00 sec)
    
    mysql> select rand(NULL);
    +---------------------+
    | rand(NULL)          |
    +---------------------+
    | 0.15522042769493574 |
    +---------------------+
    1 row in set (0.00 sec)

    and also Hive does according to HIVE-14694

    So the codes below:

    spark.range(1).selectExpr("rand(null)").show()

    prints..

    Before

      Input argument to rand must be an integer literal.;; line 1 pos 0
    org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
    at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
    at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444)
    

    After

      +-----------------------+
      |rand(CAST(NULL AS INT))|
      +-----------------------+
      |    0.13385709732307427|
      +-----------------------+
    
  • LongType support in SQL.

    In addition, it make the function allows to take LongType consistently within Scala/SQL.

    In more details, the codes below:

    spark.range(1).select(rand(1), rand(1L)).show()
    spark.range(1).selectExpr("rand(1)", "rand(1L)").show()

    prints..

    Before

    +------------------+------------------+
    |           rand(1)|           rand(1)|
    +------------------+------------------+
    |0.2630967864682161|0.2630967864682161|
    +------------------+------------------+
    
    
    Input argument to rand must be an integer literal.;; line 1 pos 0
    org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
    at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
    at
    

    After

    +------------------+------------------+
    |           rand(1)|           rand(1)|
    +------------------+------------------+
    |0.2630967864682161|0.2630967864682161|
    +------------------+------------------+
    
    +------------------+------------------+
    |           rand(1)|           rand(1)|
    +------------------+------------------+
    |0.2630967864682161|0.2630967864682161|
    +------------------+------------------+
    

How was this patch tested?

Unit tests in DataFrameSuite.scala and RandomSuite.scala.

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66737 has finished for PR 15432 at commit 860c177.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66738 has finished for PR 15432 at commit 7fa7db2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Oct 11, 2016

hm - maybe we should just cast any NullType input into some concrete type defined by an ExpectsInputTypes expression?

@HyukjinKwon
Copy link
Member Author

@rxin yes, I just wanted to avoid changing a lot. Will try to fix it in that way (at least) to show how it actually look like.

@HyukjinKwon HyukjinKwon reopened this Oct 11, 2016
@HyukjinKwon HyukjinKwon changed the title [SPARK-17854][SQL] rand/randn allows null as input seed [SPARK-17854][SQL] rand/randn allows null/long as input seed Oct 12, 2016
@HyukjinKwon
Copy link
Member Author

@rxin, I updated the codes and also updated the PR description. Could you please check if my change makes sense?

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66815 has finished for PR 15432 at commit 6f8f3f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class RDG extends UnaryExpression with ExpectsInputTypes with Nondeterministic
    • case class Rand(child: Expression) extends RDG
    • case class Randn(child: Expression) extends RDG

@HyukjinKwon
Copy link
Member Author

Oh, my bad. Will fix it up.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66821 has finished for PR 15432 at commit a99f674.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 12, 2016

FWIW, the cases below are fine in MySQL too:

mysql> SELECT RAND(CAST(2 AS UNSIGNED));
+---------------------------+
| RAND(CAST(2 AS UNSIGNED)) |
+---------------------------+
|        0.6555866465490187 |
+---------------------------+
1 row in set (0.00 sec)

mysql> SELECT RAND(CAST(NULL AS UNSIGNED));
+------------------------------+
| RAND(CAST(NULL AS UNSIGNED)) |
+------------------------------+
|          0.15522042769493574 |
+------------------------------+
1 row in set (0.00 sec)

@gatorsmile
Copy link
Member

Not sure whether you realize it. Since this PR changes the input parm of Rand and Randn, you also changes the external support.

Now, users can do something like

select rand(cast(9 / 4 as int)) from src

My suggestion is to always add new test cases whenever you made an external change like this. It can help reviewers decide whether this PR is good or not.

@gatorsmile
Copy link
Member

Since you are running mysql, the output of rand(0) is the same as rand(null)?

@HyukjinKwon
Copy link
Member Author

@gatorsmile Yes (for #15432 (comment)), it is and sure, I should add more tests. I actually intended to show how it looks like.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 12, 2016

I should have added [WIP] maybe. If it look okay in general, I will try to follow your suggestions and also add the case you gave.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 13, 2016

@rxin How does it look like? Let me add some cases for constant folding @gatorsmile has shown, RAND(CAST(2 AS INT)) and RAND(CAST(NULL AS INT)) too if it looks okay.

Just to make sure, in case of MySQL it works fine.

mysql> SELECT RAND(CAST(1/1 AS UNSIGNED));
+-----------------------------+
| RAND(CAST(1/1 AS UNSIGNED)) |
+-----------------------------+
|         0.40540353712197724 |
+-----------------------------+
1 row in set (0.00 sec)

mysql> SELECT RAND(CAST(1/152 AS UNSIGNED));
+-------------------------------+
| RAND(CAST(1/152 AS UNSIGNED)) |
+-------------------------------+
|           0.15522042769493574 |
+-------------------------------+
1 row in set (0.00 sec)

@gatorsmile
Copy link
Member

I have a very general comment about the work you are working. Like what we are doing for the LIKE operation, we did an investigation on ANSI standard, and all the mainstream data stores, including Oracle, MySQL, SQL Server, Hive, DB2, Informix and Postgres. Could you do a similar thing here? Thank you!

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 13, 2016

@gatorsmile Thanks for your feedback. Could you please be a little bit more specific? Do you expect some researches on the argument of rand function in a standard and checking the DBMSs you listed above maybe?

If so, it'd be nicer if we have such reseaches on other JIRAs or PRs so that I (and other contributoers) can refer when we make a change on such thing. Is there a great example we already have maybe?

@HyukjinKwon HyukjinKwon reopened this Oct 13, 2016
@HyukjinKwon
Copy link
Member Author

(Oh, I am making a comment via my phone. Sorry for occasional closing and reopening here..)

@gatorsmile
Copy link
Member

gatorsmile commented Oct 13, 2016

Let me show you an example:
https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_bif_rand.html

This is the official document of rand in DB2 z/OS. Below is about the behavior of rand:

  1. If numeric-expression is specified, it is used as the seed value. The argument must be an expression that returns a value of a built-in integer data type (SMALLINT or INTEGER). The value must be between 0 and 2,147,483,646.
  2. The argument must be an expression that returns a value of a built-in integer data type (SMALLINT or INTEGER). The value must be between 0 and 2,147,483,646.
  3. The result can be null; if the argument is null, the result is the null value.
  4. RAND(0) is processed the same as RAND().

When we defining the expected behavior, we need to consider how most the mainstream data store behave. After we delivering the fix, it is hard to change it again.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 13, 2016

That is a great reference. However, is this the function described in a standard? I guess it is different for each implementation of database. For example,

The result can be null; if the argument is null, the result is the null value.

MySQL treats it as 0 rather than returning null value. Also, I gave both references of MySQL and Hive in the PR description. Can we just define the behaviour here? Do we have a target DBMS to follow? Then, it'd be great if this is mentioned in a JIRA if there is (am I missing the JIRA already we have?) I guess it is usually Hive, PostgreSQL and MySQL as I recall.

In case of PostgreSQL, it seems there is both functions for this, random() and setseed(). This works differently with MySQL and also DB2 (assuming from the comment you left). So, I got rid of this here.

I think I have checked other examples enough. Do we usually have such explanations and tests of all the DBMS, Oracle, MySQL, SQL Server, Hive, DB2, Informix and PostgresSQL and mentions in ANSI standard in PRs and JIRAs?

It can be problematic if we don't comply the standard which all other implementations follow but I think it'd be fine if other databases have different implementations.

I am sure I am taking every look for other PRs time to time and trying to make mine sensible but I don't think we always have references from all other DBMS and explanations from ANSI standard.

It is hard to change it again and that is why I am asking to review.

@HyukjinKwon
Copy link
Member Author

Initially, this JIRA was only handling null as seed. If you both worry the change here, I would like to make the PR smaller as suggested initially.

@gatorsmile
Copy link
Member

Unfortunately, not all the things have a standard to follow. That is why I suggested you to do a research about it. Like Oracle, it does not have such a function in their SQL-function list: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions001.htm

Since you are doing the change in rand, I think you can check whether the existing rand behaves as expected and adds the missing test cases if needed. This JIRA is just trying to cover an edge case of a seed number. Why not checking whether we appropriately handle all the cases? Then, we do not need to submit more small fixes for rand, right?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 13, 2016

Strictly, the JIRA describes handling null and we might not have to generalize the cases further.

it will failed when do select rand(null)

Also, I would like to add the edge cases here but I'd like to avoid PR is being hold.

As not all the things have a standard to follow, we can define the behaviour here. I don't have access to Oracle and DB2. Do you think Hive, PostgreSQL and MySQL (+Oracle and DB2 you just gave) examples are not enough?

@gatorsmile
Copy link
Member

At first, we do not strictly follow Hive. You can easily find many in Spark. I do not think this is an urgent JIRA, right? Like what @srowen replied in the JIRA, he does not think this is a bug. The existing output message looks reasonable to me too.

Input argument to rand must be an integer literal.;; line 1 pos 0

Setting the seed as null also looks weird to me.

DB2 and Oracle have free versions to download. You can easily install the docker versions. You also can google their documentation. What we need to do at first is to do an investigation to save the times of all the other reviewers; otherwise, they have to do it too.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 13, 2016

Not urgent but in my experience such PRs have been being held. So, I am trying to fix the problem specified in the JIRA only rather than fixing others together.

@srowen said "I'm not even sure that's a bug.." but "... reasonable to try to follow it.".

At least, all the implementations of DB2, MySQL, Hive and PostgreSQL do not throw an exception but it defines its own behaviour and it'd be sensible to follow the majority (two of identified examples, Hive and MySQL) or define our own behaviour.

@@ -1728,4 +1728,29 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
val df = spark.createDataFrame(spark.sparkContext.makeRDD(rows), schema)
assert(df.filter($"array1" === $"array2").count() == 1)
}

test("SPARK-17854: rand/randn allows null and long as input seed") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this into sql query test suite

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68108 has finished for PR 15432 at commit a523302.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68109 has finished for PR 15432 at commit b432355.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68115 has finished for PR 15432 at commit 160ea54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


def this() = this(Utils.random.nextLong())
def this() = this(Literal(Utils.random.nextLong()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not specifying the data type here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complier seems complaining if we specify the return type in def this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do you mean the type for literal for example as below?

Literal(Utils.random.nextLong(), LongType)

If you think it is beneficial because it at least does not do the type dispatch once, will fix here. Also, I can sweep the usages in functions.scala in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think we should explicitly specify the type, if possible. This is my personal preference.

Not sure whether it worths a new PR to change all of them in functions.scala.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. I will fix them here first.

@@ -87,6 +87,10 @@ case class Rand(seed: Long) extends RDG {
}
}

object Rand {
def apply(seed: Long): Rand = Rand(Literal(seed))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here?


def this() = this(Utils.random.nextLong())
def this() = this(Literal(Utils.random.nextLong()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here?

@gatorsmile
Copy link
Member

LGTM except a few minor comments.

@HyukjinKwon
Copy link
Member Author

Thanks @gatorsmile!

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68185 has finished for PR 15432 at commit 9b9a49f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68196 has finished for PR 15432 at commit 9b9a49f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68201 has finished for PR 15432 at commit 9b9a49f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Nov 6, 2016
## What changes were proposed in this pull request?

This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`.

So, this PR includes both changes below:
- `null` support

  It seems MySQL also accepts this.

  ``` sql
  mysql> select rand(0);
  +---------------------+
  | rand(0)             |
  +---------------------+
  | 0.15522042769493574 |
  +---------------------+
  1 row in set (0.00 sec)

  mysql> select rand(NULL);
  +---------------------+
  | rand(NULL)          |
  +---------------------+
  | 0.15522042769493574 |
  +---------------------+
  1 row in set (0.00 sec)
  ```

  and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694)

  So the codes below:

  ``` scala
  spark.range(1).selectExpr("rand(null)").show()
  ```

  prints..

  **Before**

  ```
    Input argument to rand must be an integer literal.;; line 1 pos 0
  org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
  at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
  at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444)
  ```

  **After**

  ```
    +-----------------------+
    |rand(CAST(NULL AS INT))|
    +-----------------------+
    |    0.13385709732307427|
    +-----------------------+
  ```
- `LongType` support in SQL.

  In addition, it make the function allows to take `LongType` consistently within Scala/SQL.

  In more details, the codes below:

  ``` scala
  spark.range(1).select(rand(1), rand(1L)).show()
  spark.range(1).selectExpr("rand(1)", "rand(1L)").show()
  ```

  prints..

  **Before**

  ```
  +------------------+------------------+
  |           rand(1)|           rand(1)|
  +------------------+------------------+
  |0.2630967864682161|0.2630967864682161|
  +------------------+------------------+

  Input argument to rand must be an integer literal.;; line 1 pos 0
  org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
  at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
  at
  ```

  **After**

  ```
  +------------------+------------------+
  |           rand(1)|           rand(1)|
  +------------------+------------------+
  |0.2630967864682161|0.2630967864682161|
  +------------------+------------------+

  +------------------+------------------+
  |           rand(1)|           rand(1)|
  +------------------+------------------+
  |0.2630967864682161|0.2630967864682161|
  +------------------+------------------+
  ```
## How was this patch tested?

Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15432 from HyukjinKwon/SPARK-17854.

(cherry picked from commit 340f09d)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@srowen
Copy link
Member

srowen commented Nov 6, 2016

Merged to master/2.1

@asfgit asfgit closed this in 340f09d Nov 6, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`.

So, this PR includes both changes below:
- `null` support

  It seems MySQL also accepts this.

  ``` sql
  mysql> select rand(0);
  +---------------------+
  | rand(0)             |
  +---------------------+
  | 0.15522042769493574 |
  +---------------------+
  1 row in set (0.00 sec)

  mysql> select rand(NULL);
  +---------------------+
  | rand(NULL)          |
  +---------------------+
  | 0.15522042769493574 |
  +---------------------+
  1 row in set (0.00 sec)
  ```

  and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694)

  So the codes below:

  ``` scala
  spark.range(1).selectExpr("rand(null)").show()
  ```

  prints..

  **Before**

  ```
    Input argument to rand must be an integer literal.;; line 1 pos 0
  org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
  at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
  at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444)
  ```

  **After**

  ```
    +-----------------------+
    |rand(CAST(NULL AS INT))|
    +-----------------------+
    |    0.13385709732307427|
    +-----------------------+
  ```
- `LongType` support in SQL.

  In addition, it make the function allows to take `LongType` consistently within Scala/SQL.

  In more details, the codes below:

  ``` scala
  spark.range(1).select(rand(1), rand(1L)).show()
  spark.range(1).selectExpr("rand(1)", "rand(1L)").show()
  ```

  prints..

  **Before**

  ```
  +------------------+------------------+
  |           rand(1)|           rand(1)|
  +------------------+------------------+
  |0.2630967864682161|0.2630967864682161|
  +------------------+------------------+

  Input argument to rand must be an integer literal.;; line 1 pos 0
  org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
  at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
  at
  ```

  **After**

  ```
  +------------------+------------------+
  |           rand(1)|           rand(1)|
  +------------------+------------------+
  |0.2630967864682161|0.2630967864682161|
  +------------------+------------------+

  +------------------+------------------+
  |           rand(1)|           rand(1)|
  +------------------+------------------+
  |0.2630967864682161|0.2630967864682161|
  +------------------+------------------+
  ```
## How was this patch tested?

Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#15432 from HyukjinKwon/SPARK-17854.
@HyukjinKwon HyukjinKwon deleted the SPARK-17854 branch January 2, 2018 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants