[SPARK-7243][SQL] Contingency Tables for DataFrames #5842

brkyvz · 2015-05-01T19:06:15Z

Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation.
cc @mengxr @rxin

SparkQA · 2015-05-01T19:07:55Z

Test build #31585 has finished for PR 5842 at commit 27a5a81.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-05-01T19:47:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/ContingencyTable.scala

+  /** Generate a table of frequencies for the elements of two columns. */
+  private[sql] def crossTabulate(df: DataFrame, col1: String, col2: String): DataFrame = {
+    val tableName = s"${col1}_$col2"
+    val distinctVals = df.select(countDistinct(col1), countDistinct(col2)).collect().head


This implementation triggers multiple jobs. I'm thinking about the following approach:

get distinct values from col2 and create a value-to-index map

aggregate by col1. for each value in col1, generate a Row object and fill in counts

assign table schema

SparkQA · 2015-05-01T23:44:27Z

Test build #31610 has finished for PR 5842 at commit 939b7c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- "public class " + className + extendsText + " implements java.io.Serializable
- class DataFrameStatFunctions(object):

rxin · 2015-05-02T02:09:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+  /** Generate a table of frequencies for the elements of two columns. */
+  private[sql] def crossTabulate(df: DataFrame, col1: String, col2: String): DataFrame = {
+    val tableName = s"${col1}_$col2"
+    val distinctCol2 = df.select(col2).distinct.orderBy(col2).collect()


it might be faster to collect and then sort, rather than sort and collect

The first implementation uses multiple passes. We need two passes any way, either on the original columns or on the pair counts. The latter may be better.

val counts = select(col1, col2).rdd.countByValue().cache().

Get distinct values from col2: counts.map(_._1._2).distinct().collect(). I'm not sure whether ordering by counts is useful here.

GroupBy col1 in counts and create an RDD of Row. And then apply the schema.

SparkQA · 2015-05-02T05:02:17Z

Test build #31647 has finished for PR 5842 at commit 6805df8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-05-02T05:15:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+  /** Generate a table of frequencies for the elements of two columns. */
+  private[sql] def crossTabulate(df: DataFrame, col1: String, col2: String): DataFrame = {
+    val tableName = s"${col1}_$col2"
+    val distinctCol2 = df.select(col2).distinct.collect().sortBy(_.get(0).toString)


btw - isn't a more efficient way to run this is to do groupBy(col1, col2).count(), and then pivot the table?

that way we only need one pass over the data.

That's what I did first. Xiangrui thought this would be more efficient.

On Fri, May 1, 2015 at 10:16 PM, Reynold Xin notifications@github.com
wrote:

In
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
#5842 (comment):

@@ -77,4 +78,42 @@ private[sql] object StatFunctions {
})
counts.cov
}
+

/** Generate a table of frequencies for the elements of two columns. */

private[sql] def crossTabulate(df: DataFrame, col1: String, col2: String): DataFrame = {

val tableName = s"${col1}_$col2"

val distinctCol2 = df.select(col2).distinct.collect().sortBy(_.get(0).toString)

btw - isn't a more efficient way to run this is to do groupBy(col1,
col2).count(), and then pivot the table?

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5842/files#r29545262.

mhmm I'm not sure if I agree. Doing it this way requires 2 pass, and also does not rely on the underlying execution engine. The physical execution will get faster over time, and we definitely want to take advantage of that.

I'm happy to implement it both ways. Check my first commit. If you both think that's ok, or a combination of both ideas is better, I'd be happy to implement it. I think my first implementation had two passes as well. I think you might need two passes two pivot properly. Maybe we can do something smarter. cc @mengxr

The first implementation uses multiple passes. We need two passes any way, either on the original columns or on the pair counts. The latter may be better.

val counts = select(col1, col2).rdd.countByValue().cache().

Get distinct values from col2: counts.map(_._1._2).distinct().collect(). I'm not sure whether ordering by counts is useful here.

GroupBy col1 in counts and create an RDD of Row. And then apply the schema.

use dataframe's own group by and count; don't use the rdd one. df one will get a lot faster with tungsten over time.

@mengxr what did you mean that we need two passes? If I understand this correctly, it is simply

"select col1, col2, count(*) from table group by col1, col2"

and then pivot the result to put col2 as the colum name ? This is one pass.

Interesting, this comment only shows up in the diff. +1 on the single-pass approach. Driver could be the bottleneck, but we are not expecting large amount of data for crosstab.

mengxr · 2015-05-03T16:27:05Z

@brkyvz I had an offline discussion with Reynold. For the first version, let's implement the local version in a single pass, which should cover most of the use cases. The steps would be

val counts = df.groupBy(col1, col1).agg(col1, col2, count("*")).collect().
Find distinct col1/col2 values on the driver from counts.
Pivot and create the table DF on the driver.

SparkQA · 2015-05-03T18:00:17Z

Test build #31700 has finished for PR 5842 at commit a63ad00.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2015-05-03T18:46:25Z

retest this please

SparkQA · 2015-05-03T18:50:38Z

Test build #31703 has finished for PR 5842 at commit a63ad00.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-03T21:12:25Z

Test build #31704 has finished for PR 5842 at commit bced829.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-05-04T03:50:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+  /** Generate a table of frequencies for the elements of two columns. */
+  private[sql] def crossTabulate(df: DataFrame, col1: String, col2: String): DataFrame = {
+    val tableName = s"${col1}_$col2"
+    val counts = df.groupBy(col1, col2).agg(col(col1), col(col2), count("*")).collect()


maybe use take to avoid running out of memory.

do we know how much we are going to take though?

It doesn't matter. you can set a max number (maybe 1 million). If the dataset has less than that, it will just return the entire dataset (at a slightly higher cost to run multiple jobs).

Would taking 1e8 have such a high cost if there are only for example 100 in total? The reason I chose 1e8, was 1e4 * 1e4, basically the limit we put on the number of columns.

rxin · 2015-05-04T04:15:56Z

@brkyvz can you also update the PR description?

mengxr · 2015-05-04T04:27:41Z

python/pyspark/sql/dataframe.py

+        """
+        Computes a pair-wise frequency table of the given columns. Also known as a contingency
+        table. The number of distinct values for each column should be less than 1e5. The first
+        column of each row will be the distinct values of `col1` and the column names will be the


Document the first column name. 1e5 -> 1e4.

SparkQA · 2015-05-04T06:50:37Z

Test build #31740 has finished for PR 5842 at commit ae9e01d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-04T08:26:54Z

Test build #31733 has finished for PR 5842 at commit 9106585.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-05-04T08:52:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+
+  /** Generate a table of frequencies for the elements of two columns. */
+  private[sql] def crossTabulate(df: DataFrame, col1: String, col2: String): DataFrame = {
+    val tableName = s"${col1}_$col2"


In my previous comment, I mean this tableName is not document. Users need to know the name of the first column to operate.

minor: It would be good to check pandas' OR R's naming for this column and follow one of them.

Pandas and R have the concept of row names, which we currently don't. We have to have the first column as the "row names".

mengxr · 2015-05-04T08:52:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+  /** Generate a table of frequencies for the elements of two columns. */
+  private[sql] def crossTabulate(df: DataFrame, col1: String, col2: String): DataFrame = {
+    val tableName = s"${col1}_$col2"
+    val counts = df.groupBy(col1, col2).agg(col(col1), col(col2), count("*")).take(1e8.toInt)


Check the size of counts. If it is 1e8, throw a warning.

@brkyvz can you submit a follow up pr to reduce 1e8 to 1e6? 1e8 is too large.

SparkQA · 2015-05-04T17:20:39Z

Test build #31769 has finished for PR 5842 at commit a07c01e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2015-05-04T19:34:49Z

retest this please

SparkQA · 2015-05-04T21:56:29Z

Test build #31774 has finished for PR 5842 at commit a07c01e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-05-05T00:02:51Z

Thanks. I'm merging this.

Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5842 from brkyvz/df-cont and squashes the following commits: a07c01e [Burak Yavuz] addressed comments v4.1 ae9e01d [Burak Yavuz] fix test 9106585 [Burak Yavuz] addressed comments v4.0 bced829 [Burak Yavuz] fix merge conflicts a63ad00 [Burak Yavuz] addressed comments v3.0 a0cad97 [Burak Yavuz] addressed comments v3.0 6805df8 [Burak Yavuz] addressed comments and fixed test 939b7c4 [Burak Yavuz] lint python 7f098bc [Burak Yavuz] add crosstab pyTest fd53b00 [Burak Yavuz] added python support for crosstab 27a5a81 [Burak Yavuz] implemented crosstab (cherry picked from commit 8055411) Signed-off-by: Reynold Xin <rxin@databricks.com>

Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#5842 from brkyvz/df-cont and squashes the following commits: a07c01e [Burak Yavuz] addressed comments v4.1 ae9e01d [Burak Yavuz] fix test 9106585 [Burak Yavuz] addressed comments v4.0 bced829 [Burak Yavuz] fix merge conflicts a63ad00 [Burak Yavuz] addressed comments v3.0 a0cad97 [Burak Yavuz] addressed comments v3.0 6805df8 [Burak Yavuz] addressed comments and fixed test 939b7c4 [Burak Yavuz] lint python 7f098bc [Burak Yavuz] add crosstab pyTest fd53b00 [Burak Yavuz] added python support for crosstab 27a5a81 [Burak Yavuz] implemented crosstab

implemented crosstab

27a5a81

mengxr reviewed May 1, 2015
View reviewed changes

brkyvz added 3 commits May 1, 2015 14:21

added python support for crosstab

fd53b00

add crosstab pyTest

7f098bc

lint python

939b7c4

rxin reviewed May 2, 2015
View reviewed changes

addressed comments and fixed test

6805df8

rxin reviewed May 2, 2015
View reviewed changes

brkyvz added 2 commits May 3, 2015 10:41

addressed comments v3.0

a0cad97

addressed comments v3.0

a63ad00

fix merge conflicts

bced829

rxin reviewed May 4, 2015
View reviewed changes

mengxr reviewed May 4, 2015
View reviewed changes

brkyvz added 2 commits May 3, 2015 23:33

addressed comments v4.0

9106585

fix test

ae9e01d

mengxr reviewed May 4, 2015
View reviewed changes

addressed comments v4.1

a07c01e

asfgit closed this in 8055411 May 5, 2015

[SPARK-7243][SQL] Contingency Tables for DataFrames #5842

[SPARK-7243][SQL] Contingency Tables for DataFrames #5842

Conversation

brkyvz commented May 1, 2015

SparkQA commented May 1, 2015

Choose a reason for hiding this comment

SparkQA commented May 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr commented May 3, 2015

SparkQA commented May 3, 2015

brkyvz commented May 3, 2015

SparkQA commented May 3, 2015

SparkQA commented May 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented May 4, 2015

Choose a reason for hiding this comment

SparkQA commented May 4, 2015

SparkQA commented May 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 4, 2015

brkyvz commented May 4, 2015

SparkQA commented May 4, 2015

rxin commented May 5, 2015