[SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax by kachayev · Pull Request #27916 · apache/spark

kachayev · 2020-03-14T23:55:55Z

What changes were proposed in this pull request?

DataFrameStatFunctions now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably:

approxQuantile
freqItems
cov
corr

(other functions from DataFrameStatFunctions already work correctly).

See code examples below.

Why are the changes needed?

With current implementation some stat functions are impossible to use when joining datasets with similar column names.

Does this PR introduce any user-facing change?

Yes. Before the change, the following code would fail with AnalysisException.

scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2")
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val dfx = df2.crossJoin(df1)
dfx: org.apache.spark.sql.DataFrame = [num: int, num: int]

scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0)
res0: Array[Double] = Array(1.0)

scala> dfx.stat.corr("table1.num", "table2.num")
res1: Double = 1.0

scala> dfx.stat.cov("table1.num", "table2.num")
res2: Double = 11.0

scala> dfx.stat.freqItems(Array("table1.num", "table2.num"))
res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>]

How was this patch tested?

Corresponding unit tests are added to DataFrameStatSuite.scala (marked as "SPARK-30532").

HyukjinKwon · 2020-03-15T02:16:32Z

ok to test

SparkQA · 2020-03-15T05:40:50Z

Test build #119807 has finished for PR 27916 at commit 45c1d8b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…s not throw exception

kachayev · 2020-03-16T04:09:36Z

Fixed tests.

SparkQA · 2020-03-16T07:05:01Z

Test build #119830 has finished for PR 27916 at commit 2f24ae0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kachayev · 2020-03-17T04:06:06Z

Tests failed because of something that seems completely irrelevant (in hive). Digging deeper to understand why.

HyukjinKwon · 2020-03-19T09:43:29Z

retest this please

cloud-fan · 2020-03-19T11:30:34Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala

can we merge these tests into one? Then we can share the below code

val df1 = spark.sparkContext.parallelize(0 to 10).toDF("num").as("table1") val df2 = spark.sparkContext.parallelize(0 to 10).toDF("num").as("table2") val dfx = df2.crossJoin(df1)

Sure, I will submit update shortly.

SparkQA · 2020-03-19T14:41:48Z

Test build #120044 has finished for PR 27916 at commit 2f24ae0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-22T07:05:02Z

Test build #120149 has finished for PR 27916 at commit 37f441e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kachayev · 2020-03-22T20:08:18Z

Looks like tests failed for the same reason as previously. Could not replicate this locally (local build & tests are just fine).

HyukjinKwon · 2020-03-23T04:41:46Z

retest this please

SparkQA · 2020-03-23T07:05:02Z

Test build #120176 has finished for PR 27916 at commit 37f441e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-23T07:56:30Z

retest this please

SparkQA · 2020-03-23T12:54:39Z

Test build #120185 has finished for PR 27916 at commit 37f441e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-23T13:11:04Z

retest this please

SparkQA · 2020-03-23T14:39:11Z

Test build #120207 has finished for PR 27916 at commit 37f441e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-30T03:40:13Z

Test build #120561 has finished for PR 27916 at commit dd9f258.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kachayev · 2020-03-30T04:04:56Z

@cloud-fan Merged with latest master. Now tests work (previously they failed with unrelated issues).

cloud-fan · 2020-03-30T05:21:13Z

thanks, merging to master/3.0!

### What changes were proposed in this pull request? `DataFrameStatFunctions` now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably: * `approxQuantile` * `freqItems` * `cov` * `corr` (other functions from `DataFrameStatFunctions` already work correctly). See code examples below. ### Why are the changes needed? With current implementation some stat functions are impossible to use when joining datasets with similar column names. ### Does this PR introduce any user-facing change? Yes. Before the change, the following code would fail with `AnalysisException`. ```scala scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2") df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val dfx = df2.crossJoin(df1) dfx: org.apache.spark.sql.DataFrame = [num: int, num: int] scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0) res0: Array[Double] = Array(1.0) scala> dfx.stat.corr("table1.num", "table2.num") res1: Double = 1.0 scala> dfx.stat.cov("table1.num", "table2.num") res2: Double = 11.0 scala> dfx.stat.freqItems(Array("table1.num", "table2.num")) res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>] ``` ### How was this patch tested? Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as "SPARK-30532"). Closes #27916 from kachayev/fix-spark-30532. Authored-by: Oleksii Kachaiev <kachayev@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 22bb6b0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? `DataFrameStatFunctions` now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably: * `approxQuantile` * `freqItems` * `cov` * `corr` (other functions from `DataFrameStatFunctions` already work correctly). See code examples below. ### Why are the changes needed? With current implementation some stat functions are impossible to use when joining datasets with similar column names. ### Does this PR introduce any user-facing change? Yes. Before the change, the following code would fail with `AnalysisException`. ```scala scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2") df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val dfx = df2.crossJoin(df1) dfx: org.apache.spark.sql.DataFrame = [num: int, num: int] scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0) res0: Array[Double] = Array(1.0) scala> dfx.stat.corr("table1.num", "table2.num") res1: Double = 1.0 scala> dfx.stat.cov("table1.num", "table2.num") res2: Double = 11.0 scala> dfx.stat.freqItems(Array("table1.num", "table2.num")) res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>] ``` ### How was this patch tested? Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as "SPARK-30532"). Closes apache#27916 from kachayev/fix-spark-30532. Authored-by: Oleksii Kachaiev <kachayev@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

kachayev added 4 commits March 14, 2020 15:43

Stat functions to resolve column names

986dc18

Merge branch 'master' into fix-spark-30532

2c35066

Fix test compilation

eb615d8

Fix freqItem test to respect freqItem result structure

45c1d8b

No need to evaluate corr/cov results, just check that computation doe…

2f24ae0

…s not throw exception

cloud-fan reviewed Mar 19, 2020

View reviewed changes

kachayev added 2 commits March 21, 2020 21:48

Merge branch 'master' into fix-spark-30532

6588860

Collapse 4 test cases into a single one

37f441e

Merge branch 'master' into fix-spark-30532

dd9f258

cloud-fan closed this in 22bb6b0 Mar 30, 2020

Conversation

kachayev commented Mar 14, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 15, 2020

Uh oh!

SparkQA commented Mar 15, 2020

Uh oh!

kachayev commented Mar 16, 2020

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

kachayev commented Mar 17, 2020

Uh oh!

HyukjinKwon commented Mar 19, 2020

Uh oh!

cloud-fan Mar 19, 2020

Choose a reason for hiding this comment

Uh oh!

kachayev Mar 21, 2020

Choose a reason for hiding this comment

Uh oh!

kachayev Mar 22, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 19, 2020

Uh oh!

SparkQA commented Mar 22, 2020

Uh oh!

kachayev commented Mar 22, 2020

Uh oh!

HyukjinKwon commented Mar 23, 2020

Uh oh!

SparkQA commented Mar 23, 2020

Uh oh!

cloud-fan commented Mar 23, 2020

Uh oh!

SparkQA commented Mar 23, 2020

Uh oh!

cloud-fan commented Mar 23, 2020

Uh oh!

SparkQA commented Mar 23, 2020

Uh oh!

SparkQA commented Mar 30, 2020

Uh oh!

kachayev commented Mar 30, 2020

Uh oh!

cloud-fan commented Mar 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments