[SPARK-16429][SQL] Include `StringType` columns in `describe()` #14095

dongjoon-hyun · 2016-07-07T20:52:29Z

What changes were proposed in this pull request?

Currently, Spark describe supports StringType. However, describe() returns a dataset for only all numeric columns. This PR aims to include StringType columns in describe(), describe without argument.

Background

scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+

Before

scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|              24.5|
| stddev|7.7781745930520225|
|    min|                19|
|    max|                30|
+-------+------------------+

After

scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+-------+                                            
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+

How was this patch tested?

Pass the Jenkins with a update testcase.

rxin · 2016-07-07T21:16:23Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -228,6 +228,15 @@ class Dataset[T] private[sql](
    }
  }

+  private[sql] def aggregatableColumns: Seq[Expression] = {


private rather than private sql?

That would be better.

dongjoon-hyun · 2016-07-07T21:36:59Z

Thank you for fast review, @rxin . I updated it.

SparkQA · 2016-07-07T22:28:52Z

Test build #61929 has finished for PR 14095 at commit df2edd7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-07T22:43:28Z

Oh, it's a documented behavior.

Computes statistics for **numeric** columns

SparkQA · 2016-07-07T23:09:02Z

Test build #61930 has finished for PR 14095 at commit b6673cb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-08T04:12:41Z

Can you fix Python?

dongjoon-hyun · 2016-07-08T05:32:52Z

Oh, sure!

rxin · 2016-07-08T05:39:05Z

And also update the documentation.

…ribe()`

dongjoon-hyun · 2016-07-08T05:49:43Z

Of course!

dongjoon-hyun · 2016-07-08T06:01:31Z

I fixed Python/R and the docs accordingly, and tested locally.

rxin · 2016-07-08T06:06:40Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      .filter(f => f.dataType.isInstanceOf[NumericType] || f.dataType.isInstanceOf[StringType])
+      .map { n =>
+        queryExecution.analyzed.resolveQuoted(n.name, sparkSession.sessionState.analyzer.resolver)
+          .get


is it possible that this would fail?

Ur, this is an direct extension of line 225 of existing numericColumns.

https://github.com/apache/spark/pull/14095/files#diff-7a46f10c3cedbf013cf255564d9483cdR225

You mean the failure of resolveQuoted?

It will not fail because the names come from schema.fields.

SparkQA · 2016-07-08T08:01:09Z

Test build #61962 has finished for PR 14095 at commit 8915adb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-08T08:08:49Z

Hi, @rxin .
It's ready for review again.

SparkQA · 2016-07-08T08:15:16Z

Test build #61965 has finished for PR 14095 at commit fa4d3b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-08T21:36:32Z

Thanks - merging in master.

dongjoon-hyun · 2016-07-08T21:41:53Z

Thank you for merging, @rxin .

rxin reviewed Jul 7, 2016
View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-16429][SQL] Include StringType columns in Scala/Python describe()~~ [SPARK-16429][SQL] Include StringType columns in describe() Jul 7, 2016

dongjoon-hyun added 2 commits July 7, 2016 22:41

[SPARK-16429][SQL] Include StringType columns in Scala/Python `desc…

52b8562

…ribe()`

Replace private[sql] to private.

4127918

Fix documents and Python/R API consistently.

8915adb

rxin reviewed Jul 8, 2016
View reviewed changes

Fix the doc.

fa4d3b4

asfgit closed this in 142df48 Jul 8, 2016

dongjoon-hyun deleted the SPARK-16429 branch July 20, 2016 07:43

shivaram mentioned this pull request Jul 29, 2016

[Spark-16579][SparkR] add install.spark function #14258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16429][SQL] Include `StringType` columns in `describe()` #14095

[SPARK-16429][SQL] Include `StringType` columns in `describe()` #14095

dongjoon-hyun commented Jul 7, 2016 •

edited

rxin Jul 7, 2016

dongjoon-hyun Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

SparkQA commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

SparkQA commented Jul 7, 2016

rxin commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

rxin commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

rxin Jul 8, 2016

dongjoon-hyun Jul 8, 2016

dongjoon-hyun Jul 8, 2016

dongjoon-hyun Jul 8, 2016

SparkQA commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

SparkQA commented Jul 8, 2016

rxin commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

[SPARK-16429][SQL] Include StringType columns in describe() #14095

[SPARK-16429][SQL] Include StringType columns in describe() #14095

Conversation

dongjoon-hyun commented Jul 7, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

rxin Jul 7, 2016

Choose a reason for hiding this comment

dongjoon-hyun Jul 7, 2016

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 7, 2016

SparkQA commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

SparkQA commented Jul 7, 2016

rxin commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

rxin commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

rxin Jul 8, 2016

Choose a reason for hiding this comment

dongjoon-hyun Jul 8, 2016

Choose a reason for hiding this comment

dongjoon-hyun Jul 8, 2016

Choose a reason for hiding this comment

dongjoon-hyun Jul 8, 2016

Choose a reason for hiding this comment

SparkQA commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

SparkQA commented Jul 8, 2016

rxin commented Jul 8, 2016

dongjoon-hyun commented Jul 8, 2016

[SPARK-16429][SQL] Include `StringType` columns in `describe()` #14095

[SPARK-16429][SQL] Include `StringType` columns in `describe()` #14095

dongjoon-hyun commented Jul 7, 2016 •

edited