Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LIVY-502] Remove dependency on hive-exec #117

Closed
wants to merge 8 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Sep 28, 2018

What changes were proposed in this pull request?

This PR removes the dependency on hive-exec. Only modules of Hive which are used after this PR are hive-service-rpc and hive-service. This drastically reduces the amount of JARs needed by the thriftserver module.

The PR takes the Hive classes which we were using and adapts them when necessary (or simpify when we don't need something) in order to work in the Livy thriftserver.

Most of the classes are just migrated replacing all occurrences of HiveConf, HiveSession and other Hive specific classes. Only one class has a quite different logic than Hive's and it is AuthFactory, as we are using a different UGI handling from Hive (as we are not running the TS in a standalone JVM).

The functionalities we are taking from Hive are:

  • the thrift protocol endpoints (the classes in the cli package), ie. the classes handling lower level details about the communication with the client;
  • the authentication layer (classes in the auth package, which are used in the ones in cli), despite this PR leaves LDAP and PAM as out of scope as they are not trivial to be ported and not needed for a working solution. We can add them later;
  • the classes in the operation package which are used to answer to metadata queries by the JDBC driver.

How was this patch tested?

existing UTs + manual tests

@mgaido91
Copy link
Contributor Author

cc @jerryshao @vanzin

@codecov-io
Copy link

codecov-io commented Sep 28, 2018

Codecov Report

Merging #117 into master will increase coverage by 41.36%.
The diff coverage is 97.01%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master     #117       +/-   ##
=============================================
+ Coverage     27.02%   68.38%   +41.36%     
- Complexity      330      883      +553     
=============================================
  Files           100      100               
  Lines          5518     5583       +65     
  Branches        832      832               
=============================================
+ Hits           1491     3818     +2327     
+ Misses         3706     1227     -2479     
- Partials        321      538      +217
Impacted Files Coverage Δ Complexity Δ
...a/org/apache/livy/server/ThriftServerFactory.scala 0% <ø> (ø) 0 <0> (ø) ⬇️
...rver/src/main/scala/org/apache/livy/LivyConf.scala 95.85% <100%> (+3.6%) 21 <0> (+3) ⬆️
...main/scala/org/apache/livy/server/LivyServer.scala 35.51% <33.33%> (+0.35%) 9 <0> (ø) ⬇️
.../main/scala/org/apache/livy/server/WebServer.scala 53.33% <0%> (+1.66%) 10% <0%> (+1%) ⬆️
...la/org/apache/livy/utils/SparkProcessBuilder.scala 54.44% <0%> (+2.22%) 11% <0%> (+1%) ⬆️
...la/org/apache/livy/server/batch/BatchSession.scala 85.39% <0%> (+2.24%) 13% <0%> (ø) ⬇️
.../scala/org/apache/livy/sessions/SessionState.scala 61.11% <0%> (+2.77%) 2% <0%> (ø) ⬇️
.../apache/livy/server/batch/CreateBatchRequest.scala 65.62% <0%> (+3.12%) 18% <0%> (ø) ⬇️
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39fa887...eacb6e6. Read the comment docs.

Copy link

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, but most are things that are better looked at later. I think as a first step getting the Hive code in, with the minimal amount of changes necessary, is better.

I need a second look, but overall take another look at whether the things you're forking really need to be forked. Having a more verbose PR description explaining what's being forked would also be good.

if (opState.isTerminal) {
// Cancel should be a no-op
// Cancel should be a no-op in either cases
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either case

schema.addToFieldSchemas(fieldSchema)
schema
}
val LOG_SCHEMA: Array[DataType] = Array(BasicDataType("string"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're touching this, indentation looks wrong.

}

def isAllowedToUse(user: String, session: InteractiveSession): Boolean = {
session.owner == user || accessManager.checkModifyPermissions(user)
}

override def stop(): Unit = {
info("Shutting down HiveServer2")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveServer2?

* This class is derived from Hive's one.
*/
private[auth] class TUGIAssumingTransportFactory(
val wrapped: TTransportFactory, val ugi: UserGroupInformation) extends TTransportFactory {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One arg per line.

/**
* CallbackHandler for SASL DIGEST-MD5 mechanism.
*/
// This code is pretty much completely based on Hadoop's SaslRpcServer.SaslDigestCallbackHandler -
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this inside the javadoc comment?

}
} catch {
case eL: IllegalStateException =>
throw new SaslException ("Invalid message format", eL)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no space before ( (also in a few other places here an in other classes)

* If the proxy user name is provided then check privileges to substitute the user.
*/
@throws[HiveSQLException]
private def getProxyUser(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another area where I think things should eventually be merge with the main server code. e.g. checking AccessManager.checkImpersonation instead of the custom configs + verifyProxyAccess being used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. As previously, I'd do this in a separate PR in order to focus carefully on how to change these parts since they are sensitive.

* This class is ported from Hive. We cannot reuse Hive's one because we need to use the
* `LivyCLIService`, `LivyConf` and `AuthFacotry` instead of Hive's one.
*/
class ThriftHttpCLIService(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that might be nice is to just mount this under a namespace in the main Livy HTTP server. But not sure whether Hive clients support that (or require the server to be on "/").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. Here we have a dedicated thread-pool associated with this which is separate from Livy's HTTP server. I am not sure it is feasible. We'd need to try.

* serviceUGI, which GSS-API will extract information from.
* In case of a SPNego request we use the httpUGI, for the authenticating service tickets.
*/
private def doKerberosAuth(request: HttpServletRequest): String = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice at some point to check whether the hadoop-auth module (which provides a server filter, and is used by the main server) can replace this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh...IIRC there was no such filter which was able to do exactly the same thing which is done here. We might have to adapt/extend one. But if we cannot just reuse it I am not sure it is worth switching to them.

* GetTypeInfoOperation.
*
*/
class GetTypeInfoOperation(sessionHandle: SessionHandle)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason why you have to fork this class, and others in this package?

Copy link
Contributor Author

@mgaido91 mgaido91 Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about your question:

  • If the question is: what do we need them for? They are used when answering to metadata calls by the JDBC driver;
  • If the question is: can't we reuse Hive's one? No, because Hive's classes use HiveSession and HiveConf and we need to get rid of them.

Let me know if this doesn't answer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the HiveConf dependency when looking at the Hive code. Well, sigh.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 2, 2018

thanks for the review @vanzin. I have updated the PR description. Let me know if it still needs to be improved.

I think as a first step getting the Hive code in, with the minimal amount of changes necessary, is better.

I am not sure, as the only difference from that was moving from Java to Scala, which saves many conversion back and forth of collections and hence adds unneeded code only to handle this.

overall take another look at whether the things you're forking really need to be forked

I am pretty sure that nothing which is there is unneeded.

@mgaido91 mgaido91 changed the title [WIP][LIVY-502] Remove dependency on hive-exec [LIVY-502] Remove dependency on hive-exec Oct 8, 2018
@mgaido91
Copy link
Contributor Author

@jerryshao @vanzin anymore comments about this?

Copy link

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly ok. I still think there's stuff to clean up here, but that can be done later. Main thing is getting rid of hive-exec and other sources of conflicts...

import org.apache.hive.service.cli.operation.Operation
import org.apache.hadoop.security.{SecurityUtil, UserGroupInformation}
import org.apache.hive.service.ServiceException
import org.apache.hive.service.cli.{FetchOrientation, FetchType, GetInfoType, GetInfoValue, HiveSQLException, OperationHandle, SessionHandle}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use a wildcard for long import lists...

// The following configs are the same present in Hive
val THRIFT_RESULTSET_DEFAULT_FETCH_SIZE =
Entry("livy.server.thrift.resultset.default.fetch.size", 1000)
val THRIFT_SPNEGO_PRINCIPAL = Entry("livy.server.thrift.authentication.spnego.principal", "")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be exactly the same as AUTH_KERBEROS_PRINCIPAL?

The spnego principal must be HTTP, so it's not like they can be different.

Which also means the keytab config will have to match, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I thought about that too but I was not sure, so I kept this. Let me remove it and replace it with AUTH_KERBEROS_PRINCIPAL then. Thanks.

}
addService(thriftCLIService)
super.init(livyConf)
Runtime.getRuntime.addShutdownHook(new Thread("LivyThriftServer Shutdown") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Seems like if needed, it's something that the main server class should handle (and then shut down all loaded plugins).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I moved it in the main server class, thanks.

* The class is taken from Hive's `HadoopThriftAuthBridge.Server`. It bridges Thrift's SASL
* transports to Hadoop's SASL callback handlers and authentication classes.
*
* This class is based on Hive's one.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment.

// TODO: add LDAP and PAM when supported
val isAuthOther = authTypeStr.equalsIgnoreCase(AuthTypes.NONE.getAuthName) ||
authTypeStr.equalsIgnoreCase(AuthTypes.CUSTOM.getAuthName)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many empty lines

import org.apache.livy.thriftserver.SessionInfo
import org.apache.livy.thriftserver.auth.{AuthenticationProvider, AuthFactory}

class ThriftHttpServlet(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about where this comes from? (Even if it's similar to the other classes above.)

import org.apache.livy.thriftserver.types.{BasicDataType, Field, Schema}

/**
* GetCatalogsOperation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super useful doc.

import org.apache.livy.thriftserver.types.{BasicDataType, Field, Schema}

/**
* GetTableTypesOperation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

caseSensitive: Boolean, searchable: Short, unsignedAttribute: Boolean, numPrecRadix: Option[Int])

/**
* GetTypeInfoOperation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

* GetTypeInfoOperation.
*
*/
class GetTypeInfoOperation(sessionHandle: SessionHandle)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the HiveConf dependency when looking at the Hive code. Well, sigh.

@mgaido91
Copy link
Contributor Author

Thanks for the review @vanzin! I addressed your comments.

I still think there's stuff to clean up here

Not sure what you mean here exactly, but I think that after this patch we may start working with the Hive community in order to make some things easier to be re-used. Eg. we cannot reuse all the auth classes because they depend on the HiveConf we may propose to the Hive community to have them accepting in the constructor the parameters they need instead of the HiveConf, so we can reuse them. This may help reducing significantly the code size here.

@mgaido91 mgaido91 closed this Nov 29, 2018
@mgaido91 mgaido91 reopened this Nov 29, 2018
@vanzin
Copy link

vanzin commented Nov 30, 2018

Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants