[SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null #3676

adrian-wang · 2014-12-11T11:28:00Z

No description provided.

SparkQA · 2014-12-11T11:35:33Z

Test build #24364 has started for PR 3676 at commit dc5765b.

This patch merges cleanly.

SparkQA · 2014-12-11T12:40:09Z

Test build #24364 has finished for PR 3676 at commit dc5765b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-11T12:40:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24364/
Test PASSed.

liancheng · 2014-12-12T01:26:40Z

This LGTM, but would like to share some findings related to semantics of COUNT(expr). It seems that Hive has a bug here, and Spark SQL behaves differently from Hive.

The Hive language manual says 1:

count(expr) - Returns the number of rows for which the supplied expression is non-NULL

but this doesn't conform to the following results (tested under Hive 0.13.1):

-- The test table `src1(key INT, value STRING)` is the one we used in Spark SQL `TestHiveContext`.
-- The table consists of 25 rows, among which 10 `key`s are `NULL`.

CREATE TABLE src1(key INT, value STRING);
LOAD DATA LOCAL INPATH 'data/files/kv3.txt' INTO TABLE src1;

SELECT COUNT(key) FROM src1
WHERE key IS NOT NULL;              -- => 15, reasonable

SELECT COUNT(NULL) FROM src1;       -- => 0, reasonable

SELECT COUNT(1) FROM src1;          -- => 25, reasonable, 1 is never `NULL`

SELECT COUNT(key + 1) FROM src1;    -- => 15, reasonable since `NULL + 1` is `NULL`.

SELECT COUNT(key) FROM src1;        -- => 25, huh?

CREATE TABLE tmp AS
SELECT CAST(key AS STRING), value
FROM src1;

SELECT COUNT(key) FROM tmp;         -- => 15, hm...

I'm not sure whether Hive has something equivalent to the StructField.nullable field in Spark SQL, but it seems that it always assumes INT as not nullable even if the underlying data may contain NULL. And COUNT(expr) doesn't check the actual data for null when expr is a single column whose data type is not nullable.

On the other hand, Spark SQL looks good. Here is a sample hive/console session:

scala> sql("SELECT COUNT(key) FROM src1").collect()
...
res2: Array[org.apache.spark.sql.Row] = Array([15])     // <- Reasonable

scala> table("src1").printSchema
root
 |-- key: integer (nullable = true)
 |-- value: string (nullable = true)

Notice that we consider all fields read from Hive Metastore nullable since data can be randomly dumped in without any validation.

adrian-wang · 2014-12-12T02:39:33Z

Thanks for such detailed review! I checked your query with hive-0.14.0, the bug no longer exists.

liancheng · 2014-12-12T03:42:10Z

Ah cool, couldn't open JIRA this morning to check whether it's a known issue :)

marmbrus · 2014-12-12T06:57:10Z

Thanks! Merged to master.

add rule to fold count(expr) if expr is not null

dc5765b

asfgit closed this in 41a3f93 Dec 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null #3676

[SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null #3676

Uh oh!

adrian-wang commented Dec 11, 2014

Uh oh!

SparkQA commented Dec 11, 2014

Uh oh!

SparkQA commented Dec 11, 2014

Uh oh!

AmplabJenkins commented Dec 11, 2014

Uh oh!

liancheng commented Dec 12, 2014

Uh oh!

adrian-wang commented Dec 12, 2014

Uh oh!

liancheng commented Dec 12, 2014

Uh oh!

marmbrus commented Dec 12, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null #3676

[SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null #3676

Uh oh!

Conversation

adrian-wang commented Dec 11, 2014

Uh oh!

SparkQA commented Dec 11, 2014

Uh oh!

SparkQA commented Dec 11, 2014

Uh oh!

AmplabJenkins commented Dec 11, 2014

Uh oh!

liancheng commented Dec 12, 2014

Uh oh!

adrian-wang commented Dec 12, 2014

Uh oh!

liancheng commented Dec 12, 2014

Uh oh!

marmbrus commented Dec 12, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants