-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18167. Add metrics to track delegation token secret manager op… #4092
Conversation
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick scan without looking into the the details of IOStatistics.
if (metrics != null) { | ||
metrics.addStoreToken(Time.monotonicNow() - start); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A metric will still be recorded when storeToken fails. Is this the same pattern used for other operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can help us correlate how changes in latency affect the whole operation. For example, if calling storeToken times out after x seconds, we should be able to recognize it by looking at this metric. That's also the pattern that I see with RpcMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happier if the addStoreToken was only updated when it was successful. We should track failures separately with a counter. (I don't think we need to track the operation that had the exception, just track the count of exceptions that happened getting delegation tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made changes to only increment the storeToken, updateToken and removeToken metrics when the operation was successful. I also added a new counter "tokenFailure" that is updated when the operation is not successful.
long start = Time.monotonicNow(); | ||
try { | ||
updateToken(id, info); | ||
} finally { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above, emitting metrics when there is a failure.
final static String UPDATE_TOKEN_STAT = "updateToken"; | ||
final static String REMOVE_TOKEN_STAT = "removeToken"; | ||
|
||
final MetricsRegistry registry = new MetricsRegistry("DelegationTokenSecretManagerMetrics"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This registry is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mainly adding it now to reduce the risk of someone changing the registry name. If we don't add this, metrics will be registered with the class name ("DelegationTokenSecretManagerMetrics"). However, if someone decides to add the registry in the future, they could choose a different name and break something. I prefer to just have it available if someone ever needs it. I've seen the same pattern in other metrics classes (ClientSCMMetrics, SharedCacheUploaderMetrics,etc), but I'm open to change this if you see a bigger problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. A minor change would be to add "LOG.debug("Initialized {}", registry);" like in ClientSCMMetrics just so the IDE and potential our style check steps in testing don't fail.
The creation of the registry can also be part of the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. As I mentioned, please move the metrics updates when the operation is successful. Also please add a counter for the number of exceptions.
if (metrics != null) { | ||
metrics.addStoreToken(Time.monotonicNow() - start); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happier if the addStoreToken was only updated when it was successful. We should track failures separately with a counter. (I don't think we need to track the operation that had the exception, just track the count of exceptions that happened getting delegation tokens.
🎊 +1 overall
This message was automatically generated. |
} | ||
|
||
public void addStoreToken(boolean success, long value) { | ||
if (success) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than have three methods add the success with separate code paths for true/false. It would be nicer to have a single addFailure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added method addTokenFailure so we can call it separately
} catch (IOException ioe) { | ||
LOG.error("Could not store token " + formatTokenId(identifier) + "!!", | ||
ioe); | ||
} finally { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with try finally, is that if the code throws an exception after an exception in the body, the original exception is lost, which makes it very hard to debug.
Please add the positive call into the body and the failure count in the ioexception handler. It will mean that we don't count the runtime exceptions, but I think that is ok in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've move the call to track metrics in the body and the exception handler. If the code in the exception handler throws another exception we would still lose the original exception though.
🎊 +1 overall
This message was automatically generated. |
@@ -50,6 +59,9 @@ | |||
import org.slf4j.Logger; | |||
import org.slf4j.LoggerFactory; | |||
|
|||
import static org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to expand this or to use the full name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've expanded this
LOG.debug("Initialized {}", registry); | ||
} | ||
|
||
public void addStoreToken(long value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addTimeStoreToken()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably time and the unit instead of plain "value"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the argument final too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've updated method names, argument name and made it final
ioStatistics.addTimedOperation(STORE_TOKEN_STAT, value); | ||
} | ||
|
||
public void addUpdateToken(long value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addTimeUpdateToken() and same for the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
public void testDelegationTokenSecretManagerMetrics() throws Exception { | ||
TestDelegationTokenSecretManager dtSecretManager = | ||
new TestDelegationTokenSecretManager(24*60*60*1000, | ||
10*1000,1*1000,3600000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces after the commas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extend the 3600000 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
try { | ||
dtSecretManager.cancelToken(token, "JobTracker"); | ||
Assert.fail("Expected exception"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use LamdaTestUtils#intercept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very helpful. Thanks!
} | ||
|
||
public DelegationTokenSecretManagerMetrics() { | ||
ioStatistics = iostatisticsStore() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific case, I would skip the static import and just do:
IOStatisticsBinding.iostatisticsStore()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed static import
generateDelegationToken(dtSecretManager, "SomeUser", "JobTracker"); | ||
Assert.assertEquals(1, dtSecretManager.metrics.tokenFailure.value()); | ||
|
||
LambdaTestUtils.intercept(Exception.class, () -> dtSecretManager.renewToken(token, "JobTracker")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be a little more specific than just Exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if on failure that renew call sleeps for a few hundred millis, you could assert that the duration of the failure was measured/counted. IOStatisticAssertions helps there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to expect IOException, since that's the exception we throw in TestFailureDelegationTokenSecretManager. We're currently not tracking duration of failures per a previous comment by @omalley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good; i've made some suggestions about how to use IOStatisticsBinding to wrap operations with duration tracking of both success and failure...sometimes it's good to differentiate the two
@@ -96,6 +108,10 @@ private String formatTokenId(TokenIdent id) { | |||
* Access to currentKey is protected by this object lock | |||
*/ | |||
private DelegationKey currentKey; | |||
/** | |||
* Metrics to track token management operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a trailing .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -430,10 +447,13 @@ protected synchronized byte[] createPassword(TokenIdent identifier) { | |||
DelegationTokenInformation tokenInfo = new DelegationTokenInformation(now | |||
+ tokenRenewInterval, password, getTrackingIdIfEnabled(identifier)); | |||
try { | |||
long start = Time.monotonicNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOStatisticsStore
is a DurationTrackerFactory
; you can use it in IOStatisticsBinding invocations to have it track min/mean/max duration of ops. distinguishing success and failure times
* and publishes them through the metrics interfaces. | ||
*/ | ||
@Metrics(about="Delegation token secret manager metrics", context="token") | ||
static class DelegationTokenSecretManagerMetrics implements IOStatisticsSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a DurationTrackerFactory and you can supply duration trackers to IOStatisticsBinding; note also there's a PairedDurationTrackerFactory
which you could wire up the iostats store. maybe this would be the time/place to add a DurationTrackerFactory for updating hadoop metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the DurationTrackerFactory changes on a separate PR? We're might not need to track failure durations here
try { | ||
dtSecretManager.startThreads(); | ||
|
||
Assert.assertEquals(0, dtSecretManager.metrics.storeToken.lastStat().numSamples()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is invoked enough it is worth factoring into its own assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to group the assertions and the invocation
generateDelegationToken(dtSecretManager, "SomeUser", "JobTracker"); | ||
Assert.assertEquals(1, dtSecretManager.metrics.tokenFailure.value()); | ||
|
||
LambdaTestUtils.intercept(Exception.class, () -> dtSecretManager.renewToken(token, "JobTracker")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if on failure that renew call sleeps for a few hundred millis, you could assert that the duration of the failure was measured/counted. IOStatisticAssertions helps there
🎊 +1 overall
This message was automatically generated. |
*/ | ||
@Metrics(about="Delegation token secret manager metrics", context="token") | ||
static class DelegationTokenSecretManagerMetrics implements IOStatisticsSource { | ||
private static final Logger LOG = LoggerFactory.getLogger(DelegationTokenSecretManagerMetrics.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is giving check style issues from Yetus.
final IOStatisticsStore ioStatistics; | ||
|
||
@Metric("Rate of storage of delegation tokens and latency (milliseconds)") | ||
MutableRate storeToken; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is giving warnings in checkstyle but I believe is needed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this private and added accessor methods
...op-common/src/test/java/org/apache/hadoop/security/token/delegation/TestDelegationToken.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Assert.assertEquals(expectedCount - 1, metric.lastStat().numSamples()); | ||
Assert.assertEquals(expectedCount - 1, stat.getSamples()); | ||
T returnedObject = callable.call(); | ||
Assert.assertEquals(expectedCount, metric.lastStat().numSamples()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing a static import of assertEquals, so no need to do Assert.assertEquals() and assertEquals() is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated to use assertEquals() only
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
#4092) * HADOOP-18167. Add metrics to track delegation token secret manager operations
@hchaverri @omalley I found that your submission in HADOOP-18167 may cause an online bug in Yarn. The code you submitted will cause the DelegationTokenSecretManagerMetrics to be initialized twice in Yarn ResourceManager and throw an exception. Please check YARN-11123 |
Thanks @slfan1989 . There was another report earlier and created HADOOP-18222. A fix is being reviewed on #4266 |
apache#4092) * HADOOP-18167. Add metrics to track delegation token secret manager operations
apache#4092) * HADOOP-18167. Add metrics to track delegation token secret manager operations
…erations
Description of PR
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?