Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LIVY-622][LIVY-623][LIVY-624][LIVY-625][Thrift]Support GetFunctions, GetSchemas, GetTables, GetColumns in Livy thrift server #194

Closed
wants to merge 5 commits into from

Conversation

yiheng
Copy link
Contributor

@yiheng yiheng commented Aug 8, 2019

What changes were proposed in this pull request?

In this patch, we add the implementations of GetSchemas, GetFunctions, GetTables, and GetColumns in Livy Thrift server.

https://issues.apache.org/jira/browse/LIVY-622
https://issues.apache.org/jira/browse/LIVY-623
https://issues.apache.org/jira/browse/LIVY-624
https://issues.apache.org/jira/browse/LIVY-625

How was this patch tested?

Add new unit tests and integration test. Run them with existing tests.

@@ -427,8 +427,8 @@ abstract class ThriftCLIService(val cliService: LivyCLIService, val serviceName:
override def GetSchemas(req: TGetSchemasReq): TGetSchemasResp = {
val resp = new TGetSchemasResp
try {
val opHandle = cliService.getSchemas(
new SessionHandle(req.getSessionHandle), req.getCatalogName, req.getSchemaName)
val opHandle = cliService.getSchemas(createSessionHandle(req.getSessionHandle),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a session handle with the real protocol version of this session. The original is using version_v1 as default, which will not pass the requirement when generating the thrift result set. see here

@@ -44,6 +45,40 @@ abstract class MetadataOperation(sessionHandle: SessionHandle, opType: Operation
if (orientation.equals(FetchOrientation.FETCH_FIRST)) {
rowSet.setRowOffset(0)
}
rowSet
rowSet.extractSubset(maxRows)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix metadata resultset is infinite issue.

A new rowSet will be generated which contains subset data. The offset of the original row will be moved.

@yiheng
Copy link
Contributor Author

yiheng commented Aug 8, 2019

Unlike spark thrift server, we use spark catalog to fetch the metadata instead of Hive client, to avoid a too strong binding relationship between livy and hive. @mgaido91

@yiheng yiheng closed this Aug 9, 2019
@yiheng yiheng reopened this Aug 9, 2019
@yiheng yiheng closed this Aug 9, 2019
@yiheng yiheng reopened this Aug 9, 2019
Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may you please also try and test your patch using Squirrel or similar stuff, in order to ensure that the information is retrieved correctly for the metadata? It would be great to include screenshots in order to show that it is working.

Thanks for your contribution!

for(Row r : rows) {
schemas.add(new Object[]{
r.getString(0),
""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?if it is always empty, makes few sense to return it, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to Spark thrift server. The schema catalog field is not supported in spark, so here return an empty string.

This is related spark code code1 code2

Use a meaningful variable to hold the value code

}
}

public static Integer getColumnSize(org.apache.spark.sql.types.DataType type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you take this and the following methods from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're from spark thrift server. Please find the related code here

@yiheng
Copy link
Contributor Author

yiheng commented Aug 13, 2019

@mgaido91 I tested with beeline and squirrel-sql. Please notice that there're some issues in the existing metadata operation(getCatalog/getTableTypes/getTypeInfo). I raised another patch #197 to fix them.

After fixed these issues, here's the sceenshot:

image

image

image

@yiheng
Copy link
Contributor Author

yiheng commented Aug 13, 2019

I have fixed the comments. @mgaido91 can you start another round of code review, thank!

@codecov-io
Copy link

codecov-io commented Aug 14, 2019

Codecov Report

Merging #194 into master will increase coverage by 40.28%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master     #194       +/-   ##
=============================================
+ Coverage     28.33%   68.62%   +40.28%     
- Complexity      343      912      +569     
=============================================
  Files           100      100               
  Lines          5679     5679               
  Branches        855      855               
=============================================
+ Hits           1609     3897     +2288     
+ Misses         3739     1224     -2515     
- Partials        331      558      +227
Impacted Files Coverage Δ Complexity Δ
...main/scala/org/apache/livy/server/LivyServer.scala 35.96% <0%> (+0.98%) 11% <0%> (ø) ⬇️
...rver/src/main/scala/org/apache/livy/LivyConf.scala 95.87% <0%> (+1.03%) 21% <0%> (+3%) ⬆️
.../main/scala/org/apache/livy/server/WebServer.scala 53.33% <0%> (+1.66%) 10% <0%> (+1%) ⬆️
...la/org/apache/livy/server/batch/BatchSession.scala 86.17% <0%> (+2.12%) 14% <0%> (ø) ⬇️
...la/org/apache/livy/utils/SparkProcessBuilder.scala 54.44% <0%> (+2.22%) 11% <0%> (+1%) ⬆️
.../scala/org/apache/livy/sessions/SessionState.scala 61.11% <0%> (+2.77%) 2% <0%> (ø) ⬇️
...e/livy/server/interactive/InteractiveSession.scala 69.11% <0%> (+3.66%) 44% <0%> (+2%) ⬆️
...org/apache/livy/server/recovery/SessionStore.scala 80% <0%> (+5%) 10% <0%> (ø) ⬇️
...ain/scala/org/apache/livy/server/JsonServlet.scala 38.46% <0%> (+5.76%) 18% <0%> (+4%) ⬆️
.../apache/livy/server/batch/CreateBatchRequest.scala 68.75% <0%> (+6.25%) 19% <0%> (+1%) ⬆️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7f23e0...a25488c. Read the comment docs.

// The initialization need to be lazy in order not to block when the instance is created
protected lazy val rscClient = {
// This call is blocking, we are waiting for the session to be ready.
sessionManager.getLivySession(sessionHandle).client.get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check if client is not null, require(client != null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this code, it seems if we cannot get a session, an error will be thrown in livy session manager.

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some small issues.

@jerryshao
Copy link
Contributor

LGTM, merging to master branch, thanks for the contribution.

@jerryshao jerryshao closed this in cae9d97 Aug 16, 2019
Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still come critical parts in this PR despite it was merged. I'd suggest to continue the discussion and create a followup fixing the remaining issues

GetFunctionsOperation.SCHEMA
}

private def convertFunctionName(name: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I still don't understand why we need this method. May you explain me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ported from here. The basic reason is Spark is using regex to filter the function name. See here and here. We need to covert the SQL wildcard to regex.

/**
* MetadataOperation is the base class for operations which do not perform any call on Spark side
*
* @param sessionHandle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean? I mean, no description at all, the name of the parameters can also be read from the method signature...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove it.

val maxRows = maxRowsL.toInt
val results = rscClient.submit(new FetchCatalogResultJob(sessionId, jobId, maxRows)).get()

val rowSet = ThriftResultSet.apply(getResultSetSchema, protocolVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ThriftResultSet.apply(results)?

Copy link
Contributor Author

@yiheng yiheng Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results is List<Object[]>

import org.apache.livy.Job;
import org.apache.livy.JobContext;

public class CleanupCatalogResultJob implements Job<Boolean> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this and the new state, instead of reusing the existing one for statements?

Copy link
Contributor Author

@yiheng yiheng Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I fetch metadata objects from Spark SessionCatalog API and construct the result set in the Livy server side. So the schema and types of the StatementState are useless in this case. And Iterator<Row> is not quite fit the data to send.

Maybe we can change to construct the ResultSet on SparkCatalogJob and return it directly to the client on Livy server. This can reduce the number of SparkCatalogOperation. Is this what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can change to construct the ResultSet on SparkCatalogJob
This should be definitely done. We should alway transfer ResultSet on the wire since it a compressed representation of the data compared to normal arrays of objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will submit a PR to refactor the code

try {
rscClient.submit(new GetTablesJob(
convertSchemaPattern(schemaName),
convertIdentifierPattern(tableName, datanucleusFormat = true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how you decided to put datanucleusFormat to true or false in the various calls, may you explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When passing the pattern to Spark SessionCatalog API, e.g. list tables, the datanucleusFormat is set to true. The convertPattern will replace % with *. The SessionCatalog require * wildcard, and it will convert it to .* internally, please see here.

When filtering the objects by the pattern in the Livy code, the datanucleusFormat is set to false. % will be replaced by .* as we use regex to filter the names, e.g. list columns. Please see here

mgaido91 added a commit that referenced this pull request Aug 30, 2019
…Set in catalog operations

## What changes were proposed in this pull request?

This is a followup of #194 which addresses all the remaining concerns. The main changes are:

 - reverting the introduction of a state specific for catalog operations;
 - usage of `ResultSet` to send over the wire the data for catalog operations too.

## How was this patch tested?

existing modified UTs

Author: Marco Gaido <mgaido@apache.org>

Closes #217 from mgaido91/LIVY-622_followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants