Skip to content

Conversation

@adrian-wang
Copy link
Contributor

No description provided.

@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24364 has started for PR 3676 at commit dc5765b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 11, 2014

Test build #24364 has finished for PR 3676 at commit dc5765b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24364/
Test PASSed.

@liancheng
Copy link
Contributor

This LGTM, but would like to share some findings related to semantics of COUNT(expr). It seems that Hive has a bug here, and Spark SQL behaves differently from Hive.

The Hive language manual says 1:

count(expr) - Returns the number of rows for which the supplied expression is non-NULL

but this doesn't conform to the following results (tested under Hive 0.13.1):

-- The test table `src1(key INT, value STRING)` is the one we used in Spark SQL `TestHiveContext`.
-- The table consists of 25 rows, among which 10 `key`s are `NULL`.

CREATE TABLE src1(key INT, value STRING);
LOAD DATA LOCAL INPATH 'data/files/kv3.txt' INTO TABLE src1;

SELECT COUNT(key) FROM src1
WHERE key IS NOT NULL;              -- => 15, reasonable

SELECT COUNT(NULL) FROM src1;       -- => 0, reasonable

SELECT COUNT(1) FROM src1;          -- => 25, reasonable, 1 is never `NULL`

SELECT COUNT(key + 1) FROM src1;    -- => 15, reasonable since `NULL + 1` is `NULL`.

SELECT COUNT(key) FROM src1;        -- => 25, huh?

CREATE TABLE tmp AS
SELECT CAST(key AS STRING), value
FROM src1;

SELECT COUNT(key) FROM tmp;         -- => 15, hm...

I'm not sure whether Hive has something equivalent to the StructField.nullable field in Spark SQL, but it seems that it always assumes INT as not nullable even if the underlying data may contain NULL. And COUNT(expr) doesn't check the actual data for null when expr is a single column whose data type is not nullable.

On the other hand, Spark SQL looks good. Here is a sample hive/console session:

scala> sql("SELECT COUNT(key) FROM src1").collect()
...
res2: Array[org.apache.spark.sql.Row] = Array([15])     // <- Reasonable

scala> table("src1").printSchema
root
 |-- key: integer (nullable = true)
 |-- value: string (nullable = true)

Notice that we consider all fields read from Hive Metastore nullable since data can be randomly dumped in without any validation.

@adrian-wang
Copy link
Contributor Author

Thanks for such detailed review! I checked your query with hive-0.14.0, the bug no longer exists.

@liancheng
Copy link
Contributor

Ah cool, couldn't open JIRA this morning to check whether it's a known issue :)

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in 41a3f93 Dec 12, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants