Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39041][SQL] Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly #36373

Closed
wants to merge 4 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Apr 27, 2022

What changes were proposed in this pull request?

The PR is mainly refactoring, aiming to support TimestampNTZ/LTZ at the same time in the future for the thrift server.

As we all know, in spark, we have hive dependencies, which can be classified into two types:

  • as a client, for accessing hive metastore/storage, etc, which is now v2.3.9, better to stay in a stable low version to be supported by higher hive metastore servers with backward compatibility
  • as a server, for being accessed by Hive JDBC/Thrift client, e.g. beeline, which is now v3.1.2, better to have a higher version to support more clients

The problem here is that we now convert spark results to org.apache.hadoop.hive.serde2.thrift.Type first and then to org.apache.hive.service.rpc.thrift.TTypeId. the former does not have 2 timestamp types, namely, doesn't have TIMESTAMPLOCALTZ_TYPE.

To avoid this, we take a shortcut to map spark results to thrift schema and rowset directly.

Besides, it also can avoid some unnecessary memory copies from type to type.

Most functionalities have been verified in apache/kyuubi for several years.

Why are the changes needed?

for supporting more spark datatypes through hive jdbc

Does this PR introduce any user-facing change?

How was this patch tested?

existing ut shall be enough

Map(
TCLIServiceConstants.PRECISION -> TTypeQualifierValue.i32Value(d.precision),
TCLIServiceConstants.SCALE -> TTypeQualifierValue.i32Value(d.scale)).asJava
case _ => Collections.emptyMap[String, TTypeQualifierValue]()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if char/varchar can be seen via resultset metadata, we shall also provide the char length here for client to take advantage of it

case StringType => TTypeId.STRING_TYPE
case _: DecimalType => TTypeId.DECIMAL_TYPE
case DateType => TTypeId.DATE_TYPE
// TODO: Shall use TIMESTAMPLOCALTZ_TYPE, keep AS-IS now for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimestampType keeps being converted to TTypeId.TIMESTAMP_TYPE in this PR

@@ -35,7 +36,7 @@ public SessionHandle openSession(String username, String password)
}

@Override
public RowSet fetchResults(OperationHandle opHandle) throws HiveSQLException {
public TRowSet fetchResults(OperationHandle opHandle) throws HiveSQLException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLIServiceClient and its child is used in our test only, changes here will not have impacts on real-world clients

@github-actions github-actions bot added the SQL label Apr 27, 2022

private def toTColumnValue(
ordinal: Int,
row: Row,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but it seems more efficient if we can convert InternalRow directly to TRowSet

RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
long maxRows, FetchType fetchType) throws HiveSQLException;
TRowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,
long maxRows, FetchType fetchType) throws HiveSQLException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous indentation was correct.

// TODO: Shall use TIMESTAMPLOCALTZ_TYPE, keep AS-IS now for
// unnecessary behavior change
case TimestampType => TTypeId.TIMESTAMP_TYPE
case TimestampNTZType => TTypeId.TIMESTAMP_TYPE
Copy link
Contributor

@cloud-fan cloud-fan May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing some context here. Looking at RowSetUtils, datetime types are using String as result. Where do we use TTypeId?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for rowset schema. jdbc client side can call getString to get the raw string or getObject to get an java object where it is used

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one question

@yaooqinn yaooqinn closed this in c82af8d May 13, 2022
@yaooqinn yaooqinn deleted the SPARK-39041 branch May 13, 2022 02:36
@yaooqinn
Copy link
Member Author

thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants