[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly #36373

yaooqinn · 2022-04-27T09:17:57Z

What changes were proposed in this pull request?

The PR is mainly refactoring, aiming to support TimestampNTZ/LTZ at the same time in the future for the thrift server.

As we all know, in spark, we have hive dependencies, which can be classified into two types:

as a client, for accessing hive metastore/storage, etc, which is now v2.3.9, better to stay in a stable low version to be supported by higher hive metastore servers with backward compatibility
as a server, for being accessed by Hive JDBC/Thrift client, e.g. beeline, which is now v3.1.2, better to have a higher version to support more clients

The problem here is that we now convert spark results to org.apache.hadoop.hive.serde2.thrift.Type first and then to org.apache.hive.service.rpc.thrift.TTypeId. the former does not have 2 timestamp types, namely, doesn't have TIMESTAMPLOCALTZ_TYPE.

To avoid this, we take a shortcut to map spark results to thrift schema and rowset directly.

Besides, it also can avoid some unnecessary memory copies from type to type.

Most functionalities have been verified in apache/kyuubi for several years.

Why are the changes needed?

for supporting more spark datatypes through hive jdbc

Does this PR introduce any user-facing change?

How was this patch tested?

existing ut shall be enough

…ableSchema directly

yaooqinn · 2022-04-27T09:23:21Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

+        Map(
+          TCLIServiceConstants.PRECISION -> TTypeQualifierValue.i32Value(d.precision),
+          TCLIServiceConstants.SCALE -> TTypeQualifierValue.i32Value(d.scale)).asJava
+      case _ => Collections.emptyMap[String, TTypeQualifierValue]()


if char/varchar can be seen via resultset metadata, we shall also provide the char length here for client to take advantage of it

yaooqinn · 2022-04-27T09:24:45Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

+    case StringType => TTypeId.STRING_TYPE
+    case _: DecimalType => TTypeId.DECIMAL_TYPE
+    case DateType => TTypeId.DATE_TYPE
+    // TODO: Shall use TIMESTAMPLOCALTZ_TYPE, keep AS-IS now for


TimestampType keeps being converted to TTypeId.TIMESTAMP_TYPE in this PR

yaooqinn · 2022-04-27T09:29:48Z

sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/CLIServiceClient.java

@@ -35,7 +36,7 @@ public SessionHandle openSession(String username, String password)
  }

  @Override
-  public RowSet fetchResults(OperationHandle opHandle) throws HiveSQLException {
+  public TRowSet fetchResults(OperationHandle opHandle) throws HiveSQLException {


CLIServiceClient and its child is used in our test only, changes here will not have impacts on real-world clients

…ableSchema directly

sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala

cloud-fan · 2022-04-28T13:38:23Z

sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala

+
+  private def toTColumnValue(
+      ordinal: Int,
+      row: Row,


not related to this PR, but it seems more efficient if we can convert InternalRow directly to TRowSet

cloud-fan · 2022-05-12T14:50:06Z

sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSession.java

-  RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
-      long maxRows, FetchType fetchType) throws HiveSQLException;
+  TRowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
+                       long maxRows, FetchType fetchType) throws HiveSQLException;


the previous indentation was correct.

cloud-fan · 2022-05-12T14:55:55Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

+    // TODO: Shall use TIMESTAMPLOCALTZ_TYPE, keep AS-IS now for
+    // unnecessary behavior change
+    case TimestampType => TTypeId.TIMESTAMP_TYPE
+    case TimestampNTZType => TTypeId.TIMESTAMP_TYPE


Maybe I'm missing some context here. Looking at RowSetUtils, datetime types are using String as result. Where do we use TTypeId?

this is for rowset schema. jdbc client side can call getString to get the raw string or getObject to get an java object where it is used

cloud-fan

LGTM except for one question

sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSession.java

…i/session/HiveSession.java

yaooqinn · 2022-05-13T02:36:25Z

thanks, merged to master

[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TT…

2b1aa5f

…ableSchema directly

yaooqinn commented Apr 27, 2022

View reviewed changes

github-actions bot added the SQL label Apr 27, 2022

[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TT…

e4d464a

…ableSchema directly

cloud-fan reviewed Apr 28, 2022

View reviewed changes

sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala Show resolved Hide resolved

cloud-fan reviewed Apr 28, 2022

View reviewed changes

add comments

44669db

cloud-fan reviewed May 12, 2022

View reviewed changes

cloud-fan approved these changes May 12, 2022

View reviewed changes

yaooqinn commented May 12, 2022

View reviewed changes

sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSession.java Outdated Show resolved Hide resolved

Update sql/hive-thriftserver/src/main/java/org/apache/hive/service/cl…

2ac9444

…i/session/HiveSession.java

yaooqinn closed this in c82af8d May 13, 2022

yaooqinn deleted the SPARK-39041 branch May 13, 2022 02:36

pan3793 mentioned this pull request May 30, 2022

[Improvement] Decouple Kyuubi Hive JDBC from Hive Serde apache/kyuubi#2782

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly #36373

[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly #36373

yaooqinn commented Apr 27, 2022 •

edited

yaooqinn Apr 27, 2022

yaooqinn Apr 27, 2022

yaooqinn Apr 27, 2022

cloud-fan Apr 28, 2022

cloud-fan May 12, 2022

cloud-fan May 12, 2022 •

edited

yaooqinn May 12, 2022

cloud-fan left a comment

yaooqinn commented May 13, 2022

[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly #36373

[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly #36373

Conversation

yaooqinn commented Apr 27, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yaooqinn Apr 27, 2022

Choose a reason for hiding this comment

yaooqinn Apr 27, 2022

Choose a reason for hiding this comment

yaooqinn Apr 27, 2022

Choose a reason for hiding this comment

cloud-fan Apr 28, 2022

Choose a reason for hiding this comment

cloud-fan May 12, 2022

Choose a reason for hiding this comment

cloud-fan May 12, 2022 • edited

Choose a reason for hiding this comment

yaooqinn May 12, 2022

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

yaooqinn commented May 13, 2022

yaooqinn commented Apr 27, 2022 •

edited

cloud-fan May 12, 2022 •

edited