Skip to content

NIFI-997: Periodically Renew Kerberos Tickets#97

Closed
rickysaltzer wants to merge 3 commits intoapache:masterfrom
rickysaltzer:kerberos-renewal
Closed

NIFI-997: Periodically Renew Kerberos Tickets#97
rickysaltzer wants to merge 3 commits intoapache:masterfrom
rickysaltzer:kerberos-renewal

Conversation

@rickysaltzer
Copy link
Copy Markdown
Contributor

Adding a patch to renew ticket every 4 hours to avoid inactive Kerberos tickets. This was an issue found when running Kerberos enabled Hadoop processors for a long period of time. This technically should have been handled by the Hadoop library, but due to unknown issues, the renewal thread inside of Hadoop doesn't seem to be doing that.

This patch is fairly simplistic, and applies to all Hadoop processors as it's implemented at on the AbstractHadoopProcessor. The kerberos ticket age is checked against a threshold (4 hours is a safe bet) when getFileSystem() is called. If the age exceeds the threshold, we re-login using the UserGroupInformation class before passing back the filesystem.

- Renew ticket every 4 hours to avoid inactive Kerberos tickets.
@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 20, 2015

I've started digging into this. Testing is the major challenge

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 23, 2015

The good news. It seems to work when I tested it.

One comment on style:
Having a tuple in a tuple for what I think is a triple for hdfsResources was a little bit hard to read. Has it gotten big enough to be its own class with getters for readability?

I really had to think about what this was doing, and whether the first getValue could return null

hdfsResources.get().getValue().getValue())

this is very evident here where you created a method to read the usergroupinfo and then replicate the work of the method the line below.

    if (getUserGroupInformation() != null && isTicketOld()) {
        renewKerberosTicket(hdfsResources.get().getValue().getValue());
    }
...

    protected UserGroupInformation getUserGroupInformation() {
        return hdfsResources.get().getValue().getValue();
    }

Other comments:

I was trying to reason about your error handling if ugi.checkTGTAndReloginFromKeytab threw an IOException. If this was greenfield code, I'd say that behavior I'd expect is that getFileSystem should throw an IOException, but that would change the method signature and change all the processors that use it (and be a breaking api change). Swallowing the exception, then exponentially backing off on a retry almost seems more logical, but I think it is worth discussing.

Is there a reason you divide the millisecond times by 1000 for comparisons with seconds instead of multiplying the seconds by 1? You're losing precision on the time stamp.

Did you think about trying to get the retry interval be a property instead of being hardcoded?

@rickysaltzer
Copy link
Copy Markdown
Contributor Author

I agree with you on the tuple within a tuple, it is pretty confusing. I'll work on getting these moved into its own class. You're right, error handling is a bit tricky in this area of the code in order to avoid breaking API changes. I did the division by 1000 since the threshold is in seconds, but I could change that. Precision isn't a big deal here since the renewal isn't millisecond (or even second) time sensitive.

@rickysaltzer
Copy link
Copy Markdown
Contributor Author

@trkurc would you mind testing this latest commit, as it makes the renewal period configurable now.

@rickysaltzer
Copy link
Copy Markdown
Contributor Author

I will squash these changes before we commit. Keeping a revision history in this pull request for clarity.

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 23, 2015

@rickysaltzer - I don't have my kerberos hadoop running atm, but here are some comments:

This variable should be camelCase, rather than ALL_CAPS.

private long TICKET_RENEWAL_THRESHOLD_SEC;

I think hdfsResources.set(null); should probably be hdfsResources.set(new HdfsResources(null, null, null); .. Walking through the code by hand...won't getUserGroupInformation in getFileSystem NPE when you're not using kerberos?

Will send more when I have a chance to build and run.

@rickysaltzer
Copy link
Copy Markdown
Contributor Author

Good catch, I'll fix that.

@rickysaltzer
Copy link
Copy Markdown
Contributor Author

I don't think we would have NPE'd because of the order that getFileSystem() ends up being called, but it's good to take care of anyway just in case.

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 24, 2015

My kerberos hadoop vm has gone south. I've got to rebuild it.

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 24, 2015

well, think I discovered an (unrelated) bug in the hdfs PutFile processor that will pass a flow file on to success even if the write fails due to SASL problems.

https://issues.apache.org/jira/browse/NIFI-1062

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 24, 2015

@rickysaltzer one thing I missed is that the HDFSResources class should be static

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 25, 2015

It looks like due to a bug in hadoop 2.6.0 client code (on a java 8 JVM only?) your code doesn't try to reload the ticket:

What checkTGTAndReloginFromKeytab() does:

  public synchronized void checkTGTAndReloginFromKeytab() throws IOException {
    if (!isSecurityEnabled()
        || user.getAuthenticationMethod() != AuthenticationMethod.KERBEROS
        || !isKeytab)
      return;

Here is how isKeytab is set in 2.6.0
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L597

This returns false for me, always!

Here is what 2.6.1 is doing
https://github.com/apache/hadoop/blob/e286512a7143427f2975ec92cdc4fad0a093a456/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L623

I'm going to try a java 7 vm.

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 25, 2015

And it totally works in java 7.

@trkurc
Copy link
Copy Markdown
Contributor

trkurc commented Oct 25, 2015

Also of note, hadoop 2.6 really doesn't like when (max_life < 10m) in your kdc.conf (looks like 2.7 may be better).
https://github.com/apache/hadoop/blob/release-2.6.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1154

@asfgit asfgit closed this in f2c4f2d Oct 25, 2015
JPercivall pushed a commit to JPercivall/nifi that referenced this pull request Apr 23, 2018
This closes apache#97.

Signed-off-by: Aldrin Piri <aldrin@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants