Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28286][SQL][PYTHON][TESTS] Convert and port 'pivot.sql' into UDF test base #25122

Closed
wants to merge 10 commits into from

Conversation

chitralverma
Copy link
Contributor

@chitralverma chitralverma commented Jul 11, 2019

What changes were proposed in this pull request?

This PR adds some tests converted from pivot.sql to test UDFs following the combination guide in SPARK-27921.

Diff comparing to 'pivot.sql'

diff --git a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-pivot.sql.out
index 9a8f783da4..cb9e4d736c 100644
--- a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-pivot.sql.out
@@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 32
+-- Number of queries: 30
 
 
 -- !query 0
@@ -40,14 +40,14 @@ struct<>
 
 -- !query 3
 SELECT * FROM (
-  SELECT year, course, earnings FROM courseSales
+  SELECT udf(year), course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 3 schema
-struct<year:int,dotNET:bigint,Java:bigint>
+struct<CAST(udf(cast(year as string)) AS INT):int,dotNET:bigint,Java:bigint>
 -- !query 3 output
 2012   15000   20000
 2013   48000   30000
@@ -56,7 +56,7 @@ struct<year:int,dotNET:bigint,Java:bigint>
 -- !query 4
 SELECT * FROM courseSales
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (2012, 2013)
 )
 -- !query 4 schema
@@ -71,11 +71,11 @@ SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings), avg(earnings)
+  udf(sum(earnings)), udf(avg(earnings))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 5 schema
-struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_avg(CAST(earnings AS BIGINT)):double,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_avg(CAST(earnings AS BIGINT)):double>
+struct<year:int,dotNET_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(avg(cast(earnings as bigint)) as string)) AS DOUBLE):double,Java_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(avg(cast(earnings as bigint)) as string)) AS DOUBLE):double>
 -- !query 5 output
 2012   15000   7500.0  20000   20000.0
 2013   48000   48000.0 30000   30000.0
@@ -83,10 +83,10 @@ struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_avg(CAST(earn
 
 -- !query 6
 SELECT * FROM (
-  SELECT course, earnings FROM courseSales
+  SELECT udf(course) as course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 6 schema
@@ -100,23 +100,23 @@ SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings), min(year)
+  udf(sum(udf(earnings))), udf(min(year))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 7 schema
-struct<dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_min(year):int,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_min(year):int>
+struct<dotNET_CAST(udf(cast(sum(cast(cast(udf(cast(earnings as string)) as int) as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(min(year) as string)) AS INT):int,Java_CAST(udf(cast(sum(cast(cast(udf(cast(earnings as string)) as int) as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(min(year) as string)) AS INT):int>
 -- !query 7 output
 63000  2012    50000   2012
 
 
 -- !query 8
 SELECT * FROM (
-  SELECT course, year, earnings, s
+  SELECT course, year, earnings, udf(s) as s
   FROM courseSales
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR s IN (1, 2)
 )
 -- !query 8 schema
@@ -135,11 +135,11 @@ SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings), min(s)
+  udf(sum(earnings)), udf(min(s))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 9 schema
-struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_min(s):int,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_min(s):int>
+struct<year:int,dotNET_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(min(s) as string)) AS INT):int,Java_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(min(s) as string)) AS INT):int>
 -- !query 9 output
 2012   15000   1       20000   1
 2013   48000   2       30000   2
@@ -152,7 +152,7 @@ SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings * s)
+  udf(sum(earnings * s))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 10 schema
@@ -167,7 +167,7 @@ SELECT 2012_s, 2013_s, 2012_a, 2013_a, c FROM (
   SELECT year y, course c, earnings e FROM courseSales
 )
 PIVOT (
-  sum(e) s, avg(e) a
+  udf(sum(e)) s, udf(avg(e)) a
   FOR y IN (2012, 2013)
 )
 -- !query 11 schema
@@ -182,7 +182,7 @@ SELECT firstYear_s, secondYear_s, firstYear_a, secondYear_a, c FROM (
   SELECT year y, course c, earnings e FROM courseSales
 )
 PIVOT (
-  sum(e) s, avg(e) a
+  udf(sum(e)) s, udf(avg(e)) a
   FOR y IN (2012 as firstYear, 2013 secondYear)
 )
 -- !query 12 schema
@@ -195,7 +195,7 @@ struct<firstYear_s:bigint,secondYear_s:bigint,firstYear_a:double,secondYear_a:do
 -- !query 13
 SELECT * FROM courseSales
 PIVOT (
-  abs(earnings)
+  udf(abs(earnings))
   FOR year IN (2012, 2013)
 )
 -- !query 13 schema
@@ -210,7 +210,7 @@ SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings), year
+  udf(sum(earnings)), year
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 14 schema
@@ -225,7 +225,7 @@ SELECT * FROM (
   SELECT course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (2012, 2013)
 )
 -- !query 15 schema
@@ -240,11 +240,11 @@ SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  ceil(sum(earnings)), avg(earnings) + 1 as a1
+  udf(ceil(udf(sum(earnings)))), avg(earnings) + 1 as a1
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 16 schema
-struct<year:int,dotNET_CEIL(sum(CAST(earnings AS BIGINT))):bigint,dotNET_a1:double,Java_CEIL(sum(CAST(earnings AS BIGINT))):bigint,Java_a1:double>
+struct<year:int,dotNET_CAST(udf(cast(CEIL(cast(udf(cast(sum(cast(earnings as bigint)) as string)) as bigint)) as string)) AS BIGINT):bigint,dotNET_a1:double,Java_CAST(udf(cast(CEIL(cast(udf(cast(sum(cast(earnings as bigint)) as string)) as bigint)) as string)) AS BIGINT):bigint,Java_a1:double>
 -- !query 16 output
 2012   15000   7501.0  20000   20001.0
 2013   48000   48001.0 30000   30001.0
@@ -255,7 +255,7 @@ SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(avg(earnings))
+  sum(udf(avg(earnings)))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 17 schema
@@ -272,7 +272,7 @@ SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, year) IN (('dotNET', 2012), ('Java', 2013))
 )
 -- !query 18 schema
@@ -289,7 +289,7 @@ SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, s) IN (('dotNET', 2) as c1, ('Java', 1) as c2)
 )
 -- !query 19 schema
@@ -306,7 +306,7 @@ SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, year) IN ('dotNET', 'Java')
 )
 -- !query 20 schema
@@ -319,7 +319,7 @@ Invalid pivot value 'dotNET': value data type string does not match pivot column
 -- !query 21
 SELECT * FROM courseSales
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (s, 2013)
 )
 -- !query 21 schema
@@ -332,7 +332,7 @@ cannot resolve '`s`' given input columns: [coursesales.course, coursesales.earni
 -- !query 22
 SELECT * FROM courseSales
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (course, 2013)
 )
 -- !query 22 schema
@@ -343,151 +343,118 @@ Literal expressions required for pivot values, found 'course#x';
 
 
 -- !query 23
-SELECT * FROM (
-  SELECT course, year, a
-  FROM courseSales
-  JOIN yearsWithComplexTypes ON year = y
-)
-PIVOT (
-  min(a)
-  FOR course IN ('dotNET', 'Java')
-)
--- !query 23 schema
-struct<year:int,dotNET:array<int>,Java:array<int>>
--- !query 23 output
-2012   [1,1]   [1,1]
-2013   [2,2]   [2,2]
-
-
--- !query 24
-SELECT * FROM (
-  SELECT course, year, y, a
-  FROM courseSales
-  JOIN yearsWithComplexTypes ON year = y
-)
-PIVOT (
-  max(a)
-  FOR (y, course) IN ((2012, 'dotNET'), (2013, 'Java'))
-)
--- !query 24 schema
-struct<year:int,[2012, dotNET]:array<int>,[2013, Java]:array<int>>
--- !query 24 output
-2012   [1,1]   NULL
-2013   NULL    [2,2]
-
-
--- !query 25
 SELECT * FROM (
   SELECT earnings, year, a
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR a IN (array(1, 1), array(2, 2))
 )
--- !query 25 schema
+-- !query 23 schema
 struct<year:int,[1, 1]:bigint,[2, 2]:bigint>
--- !query 25 output
+-- !query 23 output
 2012   35000   NULL
 2013   NULL    78000
 
 
--- !query 26
+-- !query 24
 SELECT * FROM (
-  SELECT course, earnings, year, a
+  SELECT course, earnings, udf(year) as year, a
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, a) IN (('dotNET', array(1, 1)), ('Java', array(2, 2)))
 )
--- !query 26 schema
+-- !query 24 schema
 struct<year:int,[dotNET, [1, 1]]:bigint,[Java, [2, 2]]:bigint>
--- !query 26 output
+-- !query 24 output
 2012   15000   NULL
 2013   NULL    30000
 
 
--- !query 27
+-- !query 25
 SELECT * FROM (
   SELECT earnings, year, s
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR s IN ((1, 'a'), (2, 'b'))
 )
--- !query 27 schema
+-- !query 25 schema
 struct<year:int,[1, a]:bigint,[2, b]:bigint>
--- !query 27 output
+-- !query 25 output
 2012   35000   NULL
 2013   NULL    78000
 
 
--- !query 28
+-- !query 26
 SELECT * FROM (
   SELECT course, earnings, year, s
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, s) IN (('dotNET', (1, 'a')), ('Java', (2, 'b')))
 )
--- !query 28 schema
+-- !query 26 schema
 struct<year:int,[dotNET, [1, a]]:bigint,[Java, [2, b]]:bigint>
--- !query 28 output
+-- !query 26 output
 2012   15000   NULL
 2013   NULL    30000
 
 
--- !query 29
+-- !query 27
 SELECT * FROM (
   SELECT earnings, year, m
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR m IN (map('1', 1), map('2', 2))
 )
--- !query 29 schema
+-- !query 27 schema
 struct<>
--- !query 29 output
+-- !query 27 output
 org.apache.spark.sql.AnalysisException
 Invalid pivot column 'm#x'. Pivot columns must be comparable.;
 
 
--- !query 30
+-- !query 28
 SELECT * FROM (
   SELECT course, earnings, year, m
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, m) IN (('dotNET', map('1', 1)), ('Java', map('2', 2)))
 )
--- !query 30 schema
+-- !query 28 schema
 struct<>
--- !query 30 output
+-- !query 28 output
 org.apache.spark.sql.AnalysisException
 Invalid pivot column 'named_struct(course, course#x, m, m#x)'. Pivot columns must be comparable.;
 
 
--- !query 31
+-- !query 29
 SELECT * FROM (
-  SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, "x" as x, "d" as d, "w" as w
+  SELECT course, earnings, udf("a") as a, udf("z") as z, udf("b") as b, udf("y") as y,
+  udf("c") as c, udf("x") as x, udf("d") as d, udf("w") as w
   FROM courseSales
 )
 PIVOT (
-  sum(Earnings)
+  udf(sum(Earnings))
   FOR Course IN ('dotNET', 'Java')
 )
--- !query 31 schema
+-- !query 29 schema
 struct<a:string,z:string,b:string,y:string,c:string,x:string,d:string,w:string,dotNET:bigint,Java:bigint>
--- !query 31 output
+-- !query 29 output
 a      z       b       y       c       x       d       w       63000   50000

How was this patch tested?

Tested as guided in SPARK-27921.

@chitralverma
Copy link
Contributor Author

chitralverma commented Jul 11, 2019

@HyukjinKwon I've raised this PR as a WIP till I incorporate your comments. I had some doubts regarding the tests in pivot.sql and was hoping you could clear it for me.

While porting 'pivot.sql', I ran the command below on the original sql and it fails when running for configs spark.sql.codegen.wholeStage=true,spark.sql.codegen.factoryMode=CODEGEN_ONLY

build/sbt "sql/test-only *SQLQueryTestSuite -- -z pivot.sql"

On inspection it seems like there is some discrepancy while handling the null values when passing through the udf. For Scala its expecting null, for Python its expecting None but the golden files contains nan. Thus the match is failing.

This error persists in the port also. As per the guide, I tried looking for a related Jira but couldn't find one, so I thought I'd run this by you first before creating one.

Stacktrace:

5:21:42.536 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using configs: spark.sql.codegen.wholeStage=true,spark.sql.codegen.factoryMode=CODEGEN_ONLY
[info] - udf/udf-pivot.sql - Regular Python UDF *** FAILED *** (24 seconds, 575 milliseconds)
[info]   Expected "Java	2012	20000	[nan
[info]   Java	2013	nan	30000
[info]   dotNET	2012	15000	nan
[info]   dotNET	2013	nan]	48000", but got "Java	2012	20000	[None
[info]   Java	2013	None	30000
[info]   dotNET	2012	15000	None
[info]   dotNET	2013	None]	48000" Result did not match for query #8
[info]   SELECT * FROM (
[info]     SELECT course, year, earnings, udf(s) as s
[info]     FROM courseSales
[info]     JOIN years ON year = y
[info]   )
[info]   PIVOT (
[info]     udf(sum(earnings))
[info]     FOR s IN (1, 2)
[info]   ) (SQLQueryTestSuite.scala:333)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions.assertResult(Assertions.scala:1003)

5:21:17.912 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using configs: spark.sql.codegen.wholeStage=true,spark.sql.codegen.factoryMode=CODEGEN_ONLY
[info] - udf/udf-pivot.sql - Scala UDF *** FAILED *** (25 seconds, 411 milliseconds)
[info]   Expected "Java	2012	20000	n[an
[info]   Java	2013	nan	30000
[info]   dotNET	2012	15000	nan
[info]   dotNET	2013	nan]	48000", but got "Java	2012	20000	n[ull
[info]   Java	2013	null	30000
[info]   dotNET	2012	15000	null
[info]   dotNET	2013	null]	48000" Result did not match for query #8
[info]   SELECT * FROM (
[info]     SELECT course, year, earnings, udf(s) as s
[info]     FROM courseSales
[info]     JOIN years ON year = y
[info]   )
[info]   PIVOT (
[info]     udf(sum(earnings))
[info]     FOR s IN (1, 2)
[info]   ) (SQLQueryTestSuite.scala:333)

Any help will be appreciated. Thanks,

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

add to whitelist

@HyukjinKwon
Copy link
Member

The null <> None issue should be due to different results from Python and Scala side. I described a bit of context in the PR description of #25069 for now ..

@HyukjinKwon
Copy link
Member

I discussed similar issue at #25098 with @skonto FWIW.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107571 has finished for PR 25122 at commit 9be50d5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chitralverma
Copy link
Contributor Author

chitralverma commented Jul 12, 2019

@HyukjinKwon Thanks for your review comments, I've replied to them.

I feel the suggested workaround(this and this) will not work in this case as certain udf's like sum return null if aggregated on empty datasets (SPARK-20346).

As per the guide in SPARK-27921 , do you suggest I create a separate Jira for this and comment the problematic test cases for now ?

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107592 has finished for PR 25122 at commit b0336e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@chitralverma, #25130 is merged. Can you rebase and sync against the current master?

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

Looks fine otherwise if the tests pass. I will take another look before merging it in.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107813 has finished for PR 25122 at commit b0336e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chitralverma
Copy link
Contributor Author

@HyukjinKwon I'm merging the changes and testing. will ping you once its done. thanks

@chitralverma
Copy link
Contributor Author

retest this please

@chitralverma chitralverma changed the title [SPARK-28286][SQL][PYTHON][TESTS][WIP] Convert and port 'pivot.sql' into UDF test base [SPARK-28286][SQL][PYTHON][TESTS] Convert and port 'pivot.sql' into UDF test base Jul 18, 2019
@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107838 has finished for PR 25122 at commit efb5ece.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chitralverma
Copy link
Contributor Author

chitralverma commented Jul 18, 2019

@HyukjinKwon you can review this now. thanks.

I've also updated the diff in the OP

)
PIVOT (
udf(sum(earnings))
FOR course IN ('dotNET', 'Java')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can test it by myself but I thought you know the results - just out of curiosity, what do we get if we do FOR udf(course)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FOR udf(course) in line 29 will result in a org.apache.spark.sql.catalyst.parser.ParseException

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107839 has finished for PR 25122 at commit ae938b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Thanks for working on this, @chitralverma

@chitralverma
Copy link
Contributor Author

Thanks for merging this @HyukjinKwon. This was my first contribution, looking forward to doing more. :D

@chitralverma chitralverma deleted the SPARK-28286 branch July 18, 2019 14:57
@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107849 has finished for PR 25122 at commit 46120b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107851 has finished for PR 25122 at commit f979a47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Thank YOU @chitralverma for staying focused on each diff here and writing a PR even without nits :D. Nowdays, those details are a key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants