[SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive by SaurabhChawla100 · Pull Request #33679 · apache/spark

SaurabhChawla100 · 2021-08-08T13:14:15Z

What changes were proposed in this pull request?

Add the support in Spark for having group by map datatype column for the scenario that works in Hive.
In hive this scenario works fine

describe extended complex2;
OK
id                  string 
c1                  map<int, string>   
Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc, createTime:1627994412, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:c1, type:map<int,string>, comment:null)], location:/user/hive/warehouse/complex2, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1

select * from complex2;
OK
1 {1:"u"}
2 {1:"u",2:"uo"}
1 {1:"u",2:"uo"}
Time taken: 0.363 seconds, Fetched: 3 row(s)

Working Scenario in Hive -: 

select id, c1, count(*) from complex2 group by id, c1;
OK
1 {1:"u"} 1
1 {1:"u",2:"uo"} 1
2 {1:"u",2:"uo"} 1
Time taken: 1.621 seconds, Fetched: 3 row(s)

Failed Scenario in Hive -: 
When map type is present in aggregated expression 
select id, max(c1), count(*) from complex2 group by id, c1; 
FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or complex type containing map<>.

But in spark where the group by map column failed for this scenario where the map column is used in the select without any aggregation, The one that works in hive.

scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
org.apache.spark.sql.AnalysisException: expression spark_catalog.default.complex2.`c1` cannot be used as a grouping expression because its data type map<int,string> is not an orderable data type.;
Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
+- SubqueryAlias spark_catalog.default.complex2
 +- HiveTableRelation [`default`.`complex2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], Partition Cols: []]
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)

Why are the changes needed?

There is need to add the this scenario where grouping expression can have map type if aggregated expression does not have the that map type reference. This helps in migrating the user from hive to Spark.

After the code change

scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
+---+-----------------+--------+                                                
| id|               c1|count(1)|
+---+-----------------+--------+
|  1|         {1 -> u}|       1|
|  2|{1 -> u, 2 -> uo}|       1|
|  1|{1 -> u, 2 -> uo}|       1|
+---+-----------------+--------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added the unit test and also tested using spark-shell the scenario

AmplabJenkins · 2021-08-08T13:14:30Z

Can one of the admins verify this patch?

HyukjinKwon · 2021-08-09T02:27:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala


  /**
-   * Returns true iff the data type can be ordered (i.e. can be sorted).
+   * Returns true if the data type can be ordered (i.e. can be sorted).


iff is an abbreviation of if and only if

HyukjinKwon · 2021-08-09T02:28:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala

+   * Returns true if the data type can be ordered (i.e. can be sorted).
   */
-  def isOrderable(dataType: DataType): Boolean = dataType match {
+  def isOrderable(dataType: DataType,


Should we fix #31967 first?

@HyukjinKwon - Thanks for checking this PR. Yes we can wait for this PR #32552. The fix in this will work with group by, order by , partition by in window.

c21 · 2021-08-09T05:20:24Z

I thought @maropu is still working on this? (#32552)

SaurabhChawla100 · 2021-08-09T05:57:06Z

I thought @maropu is still working on this? (#32552)

I was not aware, that there is already a jira for this map issue, Yes this PR (#32552) will fix the use case that I am trying in this PR.

github-actions · 2021-11-18T00:10:31Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Aug 8, 2021

SaurabhChawla100 changed the title ~~[Spark 36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive~~ [Spark-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive Aug 8, 2021

SaurabhChawla100 changed the title ~~[Spark-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive~~ [SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive Aug 8, 2021

SaurabhChawla100 force-pushed the SPARK-36452 branch from 2b4bf6f to fc95f3f Compare August 8, 2021 15:51

Add the support for Maptype in the Group by in spark Sql

a989080

SaurabhChawla100 force-pushed the SPARK-36452 branch from fc95f3f to 8db4d3a Compare August 8, 2021 19:22

add the unit test for the map column in group by

e6505d1

SaurabhChawla100 force-pushed the SPARK-36452 branch from 8db4d3a to e6505d1 Compare August 8, 2021 19:23

HyukjinKwon reviewed Aug 9, 2021

View reviewed changes

update the comment

a09e37f

github-actions bot added the Stale label Nov 18, 2021

github-actions bot closed this Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive#33679

[SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive#33679
SaurabhChawla100 wants to merge 3 commits intoapache:masterfrom
SaurabhChawla100:SPARK-36452

SaurabhChawla100 commented Aug 8, 2021 •

edited

Loading

Uh oh!

AmplabJenkins commented Aug 8, 2021

Uh oh!

HyukjinKwon Aug 9, 2021

Uh oh!

HyukjinKwon Aug 9, 2021

Uh oh!

SaurabhChawla100 Aug 9, 2021

Uh oh!

c21 commented Aug 9, 2021

Uh oh!

SaurabhChawla100 commented Aug 9, 2021

Uh oh!

github-actions bot commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SaurabhChawla100 commented Aug 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Aug 8, 2021

Uh oh!

HyukjinKwon Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

c21 commented Aug 9, 2021

Uh oh!

SaurabhChawla100 commented Aug 9, 2021

Uh oh!

github-actions bot commented Nov 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SaurabhChawla100 commented Aug 8, 2021 •

edited

Loading