Skip to content

[SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax#27916

Closed
kachayev wants to merge 8 commits intoapache:masterfrom
kachayev:fix-spark-30532
Closed

[SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax#27916
kachayev wants to merge 8 commits intoapache:masterfrom
kachayev:fix-spark-30532

Conversation

@kachayev
Copy link
Contributor

What changes were proposed in this pull request?

DataFrameStatFunctions now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably:

  • approxQuantile
  • freqItems
  • cov
  • corr

(other functions from DataFrameStatFunctions already work correctly).

See code examples below.

Why are the changes needed?

With current implementation some stat functions are impossible to use when joining datasets with similar column names.

Does this PR introduce any user-facing change?

Yes. Before the change, the following code would fail with AnalysisException.

scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2")
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val dfx = df2.crossJoin(df1)
dfx: org.apache.spark.sql.DataFrame = [num: int, num: int]

scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0)
res0: Array[Double] = Array(1.0)

scala> dfx.stat.corr("table1.num", "table2.num")
res1: Double = 1.0

scala> dfx.stat.cov("table1.num", "table2.num")
res2: Double = 11.0

scala> dfx.stat.freqItems(Array("table1.num", "table2.num"))
res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>]

How was this patch tested?

Corresponding unit tests are added to DataFrameStatSuite.scala (marked as "SPARK-30532").

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Mar 15, 2020

Test build #119807 has finished for PR 27916 at commit 45c1d8b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kachayev
Copy link
Contributor Author

Fixed tests.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119830 has finished for PR 27916 at commit 2f24ae0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kachayev
Copy link
Contributor Author

Tests failed because of something that seems completely irrelevant (in hive). Digging deeper to understand why.

@HyukjinKwon
Copy link
Member

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge these tests into one? Then we can share the below code

val df1 = spark.sparkContext.parallelize(0 to 10).toDF("num").as("table1")
val df2 = spark.sparkContext.parallelize(0 to 10).toDF("num").as("table2")
val dfx = df2.crossJoin(df1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will submit update shortly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@SparkQA
Copy link

SparkQA commented Mar 19, 2020

Test build #120044 has finished for PR 27916 at commit 2f24ae0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2020

Test build #120149 has finished for PR 27916 at commit 37f441e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kachayev
Copy link
Contributor Author

Looks like tests failed for the same reason as previously. Could not replicate this locally (local build & tests are just fine).

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120176 has finished for PR 27916 at commit 37f441e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120185 has finished for PR 27916 at commit 37f441e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120207 has finished for PR 27916 at commit 37f441e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2020

Test build #120561 has finished for PR 27916 at commit dd9f258.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kachayev
Copy link
Contributor Author

@cloud-fan Merged with latest master. Now tests work (previously they failed with unrelated issues).

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 22bb6b0 Mar 30, 2020
cloud-fan pushed a commit that referenced this pull request Mar 30, 2020
### What changes were proposed in this pull request?
`DataFrameStatFunctions` now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably:
* `approxQuantile`
* `freqItems`
* `cov`
* `corr`

(other functions from `DataFrameStatFunctions` already work correctly).

See code examples below.

### Why are the changes needed?
With current implementation some stat functions are impossible to use when joining datasets with similar column names.

### Does this PR introduce any user-facing change?
Yes. Before the change, the following code would fail with `AnalysisException`.

```scala
scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2")
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val dfx = df2.crossJoin(df1)
dfx: org.apache.spark.sql.DataFrame = [num: int, num: int]

scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0)
res0: Array[Double] = Array(1.0)

scala> dfx.stat.corr("table1.num", "table2.num")
res1: Double = 1.0

scala> dfx.stat.cov("table1.num", "table2.num")
res2: Double = 11.0

scala> dfx.stat.freqItems(Array("table1.num", "table2.num"))
res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>]
```

### How was this patch tested?
Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as "SPARK-30532").

Closes #27916 from kachayev/fix-spark-30532.

Authored-by: Oleksii Kachaiev <kachayev@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 22bb6b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
`DataFrameStatFunctions` now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably:
* `approxQuantile`
* `freqItems`
* `cov`
* `corr`

(other functions from `DataFrameStatFunctions` already work correctly).

See code examples below.

### Why are the changes needed?
With current implementation some stat functions are impossible to use when joining datasets with similar column names.

### Does this PR introduce any user-facing change?
Yes. Before the change, the following code would fail with `AnalysisException`.

```scala
scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2")
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val dfx = df2.crossJoin(df1)
dfx: org.apache.spark.sql.DataFrame = [num: int, num: int]

scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0)
res0: Array[Double] = Array(1.0)

scala> dfx.stat.corr("table1.num", "table2.num")
res1: Double = 1.0

scala> dfx.stat.cov("table1.num", "table2.num")
res2: Double = 11.0

scala> dfx.stat.freqItems(Array("table1.num", "table2.num"))
res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>]
```

### How was this patch tested?
Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as "SPARK-30532").

Closes apache#27916 from kachayev/fix-spark-30532.

Authored-by: Oleksii Kachaiev <kachayev@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments