Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-4323 Wrapped Get/ListHDFS hadoop operations in ugi.doAs calls #2360

Closed
wants to merge 4 commits into from

Conversation

jtstorck
Copy link
Contributor

NIFI-3472 Removed explicit relogin code from HDFS/Hive/HBase components and updated SecurityUtils.loginKerberos to use UGI.loginUserFromKeytab. This brings those components in line with daemon-process-style usage, made possible by NiFi's InstanceClassloader isolation. Relogin (on ticket expiry/connection failure) can now be properly handled by hadoop-client code implicitly.

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@@ -51,7 +50,8 @@ public static synchronized UserGroupInformation loginKerberos(final Configuratio
Validate.notNull(keyTab);

UserGroupInformation.setConfiguration(config);
return UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal.trim(), keyTab.trim());
UserGroupInformation.loginUserFromKeytab(principal.trim(), keyTab.trim());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document in bold language precisely why this change is so critical. That the method that was being used did not set the static loginContext in the UGI object meant that its utility once a relogin/expiry hits was nearly nil is critical and we should document this foreverever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should also probably in that comment explain why the ticket renewal threads to attempt to force explicit renewals could be problematic/increase chances of race conditions. Specifically the subject within the UGI could be loggedout by our explicit renewal attempts while at the same time a hadoop operation occurring could kick off the Hadoop client to relogin but the subject would have been cleared/in an unexpected state. The UGI class passes the subject to the underlying jdk kerb handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added javadoc to several methods that are affected by this change. Will update the PR with these changes.

@joshelser
Copy link
Member

use UGI.loginUserFromKeytab. This brings those components in line with daemon-process-style usage, made possible by NiFi's InstanceClassloader isolation. Relogin (on ticket expiry/connection failure) can now be properly handled by hadoop-client code implicitly.

Interesting. I'm trying to understand how to state this problem outside the context of NiFi (because the fix still confounds me). It seems like what's being stated is: when UserGroupInformation.loginUserFromKeytabAndReturnUGI() is used, the implicit re-login code called inside of the HDFS client code (e.g. Client$Connection#handleSaslConnectionFailure()) does the wrong thing. Looking at this again:

if (UserGroupInformation.isLoginKeytabBased()) {
  UserGroupInformation.getLoginUser().reloginFromKeytab();
} else if (UserGroupInformation.isLoginTicketBased()) {
  UserGroupInformation.getLoginUser().reloginFromTicketCache();
}

Seems to paint a pretty clear picture as to why, when the loginUser isn't the one we're executing the call as, the relogin fails. Makes me wonder why HDFS isn't doing a getCurrentUser() instead of the getLoginUser()...

@joshelser
Copy link
Member

Corollary: can this be reproduced outside the context of NiFi now that a solution in NiFi was observed? Maybe that's a pending action item? :)

@joewitt
Copy link
Contributor

joewitt commented Dec 23, 2017

I think the primary observed scenario was that when using {code}UserGroupInformation.loginUserFromKeytabAndReturnUGI(){code} the UGI static member 'loginUser' was not being set. The 'user' object's 'login' was being set in the created UGI instance (non static). As a result explicit logins were working but implicit (hadoop initiated ones) appeared not be. We could get the lockup in that case to happen within 30 mins or so. Once we switched to using the loginFromKeytab/getCurrentUser the static member variable of UGI was set and the flow would work for hours then again lockup. Now, keep in mind here we had created a thread that ran every 60 seconds to do an explicit relogin and the hadoop client which would automatically do it when necessary. In tracing through that scenario what it appears was happening is a race condition on the 'subject' object whereby we'd do a relogin which clears out the subject (as a logout is called) and in that time we likely conducted a hadoop operation which triggered hadoop subject login but the context was wiped and so it ended up in the promptForName. That is the working theory.

The results so far are very promising.

One should be able to pretty easily replicate what NiFi is/was doing outside of NiFi in so far as interaction with HDFS.

@jomach
Copy link

jomach commented Jan 2, 2018

So The Fix is removing the Nifi Thread that renews the Kerberos Token ?
From what I noticed until now using the UserGroupInformation class is that if I call the method loginUserFromKeytab it will be set for the complete JVM, not allowing scenarios where Nifi needs to connect to two different Hadoop clusters with Kerberos. If we use loginUserFromKeytabAndReturnUGI it will be contained in that returned Object. I looked at the code verz quickly and it seems that we removed the Thread that takes care of renewing the token. Could we leverage something like: UserGroupInformation.checkTGTAndReloginFromKeytab or we could trigger UserGroupInformation.spawnAutoRenewalThreadForUserCreds using UserGroupInformation.loginUserFromSubject

NIFI-3472 NIFI-4350 Removed explicit relogin code from HDFS/Hive/HBase components and updated SecurityUtils.loginKerberos to use UGI.loginUserFromKeytab. This brings those components in line with daemon-process-style usage, made possible by NiFi's InstanceClassloader isolation.  Relogin (on ticket expiry/connection failure) can now be properly handled by hadoop-client code implicitly.
@jtstorck
Copy link
Contributor Author

jtstorck commented Jan 2, 2018

@jomach Yes, this PR will remove the explicit relogin attempts from Hadoop/HBase/Hive components. This will allow the hadoop libraries we use to handle relogin implicitly, since NiFi's explicit relogin attempts were creating race conditions in UGI and underlying classes, and not all of the code NiFi depends on or uses is thread-safe. We lessen this issue by using NiFi's instance-based classloading, which brings me to the next point.

Regarding UGI.loginUserFromKeytab, NiFi employs classloader isolation on a per-component basis. This means that each instance of a PutHDFS processor (for example) has its own classloader by which the hadoop libraries are loaded. Since the UGI instance maintains the state of the login configuration, authenticated Subject, etc, due to the classloader isolation, that state will be separate from instantiations of UGI in other components. Loosely speaking, they are considered different types since they were loaded by different classloaders, and their state will not be shared. This allows NiFi to use UGI.loginUserFromKeytab and the instances of components that use UGI can be considered "daemon processes".

UGI.spawnAutoRenewalThreadForUserCreds is only started (implicitly by the UGI class itself) if the login was done from the ticket cache, and NiFi explicitly wants to use the keytab during authentication, not the ticket cache. NiFi uses keytabs so that it can function in a multi-tenant environment. With kinit, only one principal would be able to be authenticated since it's done via an OS user, and we'd like to avoid that.

@jomach
Copy link

jomach commented Jan 2, 2018

@jtstorck

hadoop libraries we use to handle relogin implicitly

Can you give an Example which libraries handle this implicitly?
Regarding spawnAutoRenewalThreadForUserCreds, you are 100% correct I digged a little further and I was missing the !isKeytab on the if. This seems to be a general issue with Hadoop and not Nifi it self but we need to live with it.

@jtstorck
Copy link
Contributor Author

jtstorck commented Jan 2, 2018

@jomach Anything that uses org.apache.hadoop.ipc.Client should handle implicit relogin, through a method that handles connection failures: org.apache.hadoop.ipc.Client.handleSaslConnectionFailure, which resides in hadoop-common.

org.apache.hadoop.hdfs.DistributedFileSystem, in hadoop-hdfs, uses the Client to connect to HDFS. If any failure occurs when making the connection, the first thing it does is use UGI to relogin. With the knowledge of this, all the explicit NiFi relogin logic has been removed in this PR.

@jomach
Copy link

jomach commented Jan 2, 2018

Yes, the rpc clients renew the tickets. I agreed with this change now. Hope that others can understand that better now.

thanks @jtstorck

…reads and usage of UGI.loginUserFromKeytab

Readded Relogin Period property to AbstractHadoopProcessor, and updated its documentation to indicate that it is now a deprecated property
Additional cleanup of code that referenced relogin periods
Marked KerberosTicketRenewer is deprecated
@risdenk
Copy link

risdenk commented Jan 3, 2018

Change looks very promising. I'm a huge +1 for removing the explicit relogin thread and corresponding relogin interval setting. The wrapping in doas also matches the doas code that Elasticsearch uses the the HDFS snapshot repository.

For the -Djavax.security.auth.useSubjectCredsOnly=true recommendation, is this just to ensure that this is never set to false? I thought the default was true anyway. Is there a known case where this is set to false by default?

@joewitt
Copy link
Contributor

joewitt commented Jan 3, 2018

Yeah the docs suggest it will be true by default but in the environment(s) we were running against it was consistently being set to false which opened us up to this risk. By explicitly setting it to true for NiFi we're correctly asserting that under no circumstance should the NiFi app ever be asked to go grab user supplied input for Kerberos logins.

@joewitt
Copy link
Contributor

joewitt commented Jan 3, 2018

You can pretty easily verify it in your environment by running

jinfo

If you see that property set and equal to false then there is a good chance this will correct it.

I ran that against a live NiFi JVM PID and it came back with that property set and false. Now, with this PR, it comes back set and true and the prompt for name code path never executes.

@asfgit asfgit closed this in 42a1ee0 Jan 3, 2018
@jtstorck jtstorck deleted the NIFI-4323 branch January 3, 2018 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants