-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33440][CORE] Use current timestamp with warning log in HadoopFSDelegationTokenProvider when the issue date for token is not set up properly #30366
Conversation
…e for each delegation token
I sought into the test code to add the test, but I haven't found any on calculation. I'll need to mock HadoopFSDelegationTokenProvider in any way (mocking, subclassing) which also requires changing some methods from |
cc. @vanzin @squito @tgravescs @jerryshao Appreciate the reviewing. Thanks in advance! |
Oh also cc. @gaborgsomogyi as he worked on this part before. |
Another possible improvement (though I'm not 100% sure about the risk) would be considering This improvement is also helpful to address another possible issue where the problematic token identifier is solely used. Spark will correctly calculates the next renewal date for the first time (min interval is actually expired date so it's going to be correct after adding 0L), but Spark also caches the interval which is "unchanged", and adding 0L to the cached interval won't change the next renewal date, hence eventually it will be earlier than current timestamp, triggering the same problem. Please review the additional possible improvement as well. I'll incorporate it once reviewers double-check about the idea and say it's OK to go. |
Btw, I guess, the token renewal interval is at least hours in practice. If I'm right here, it doesn't sound a crazy idea to always call the token.renew() to get the "correct" next renewal dates and pick the minimum value. This would eliminate almost all the safeguards and simplifies the logic greatly, having trade-off Spark should make calls per each token update. Does this make sense? |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #131036 has finished for PR 30366 at commit
|
@HeartSaVioR thanks for the detailed description and analysis, that is very helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @HeartSaVioR . BTW, it would be great if we can have a test case with mocking.
val nextRenewalDateForToken = issueDate + interval | ||
logDebug(s"Next renewal date is $nextRenewalDateForToken for token ${kind.toString}") | ||
nextRenewalDateForToken | ||
}.filterNot(_ < currentTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we hit the situation where _ < currentTime
for all tokens? Do we need to distinguish this case from the case tokenRenewalIntervals
is empty?
@@ -230,6 +230,8 @@ private[spark] class HadoopDelegationTokenManager( | |||
val now = System.currentTimeMillis | |||
val ratio = sparkConf.get(CREDENTIALS_RENEWAL_INTERVAL_RATIO) | |||
val delay = (ratio * (nextRenewal - now)).toLong | |||
logInfo(s"Calculated delay on renewal is $delay, based on next renewal $nextRenewal " + | |||
s"and the ratio $ratio") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the log message has timestamp, it would be good to have now
information together in addition to nextRenewal
in this log line because it will be subtracted and can be minus.
Thanks for the comment.
Yes that's configurable but agree that it only happens in hours time frame. The problematic token also produced expired time as 7 days later, so even longer than the Hadoop delegation token. The problem is simply just because Spark has a belief that token identifier should have valid issue date, whereas it's not "guaranteed" for every implementations. Once the precondition is broken, the calculation is completely going on unexpected way. There's safeguard but the safeguard also makes thing bad. That's why I had to change the approach. Before reaching this approach, I just fixed the issue via simply discarding the invalid next renewal date like less than now, but that really feels me just adding a workaround. (This workaround also can't handle the new case I left in above comment.) And given we're here, I'd like to collect some ideas to make this "concrete" with tolerating known case of broken assumption. |
@dongjoon-hyun In the meanwhile I'll try to look into how to add the test. |
I just changed the PR to draft, as I would make sure we consider about the better change (than the current approach of the PR) accounting my comments above. Reviewers, please consider the code change to only the reference, and go through below comments and share your opinion. Thanks in advance! |
I see another problematic case where there're two token identifiers (say, token A and B) which token B's identifier provides issue date as 0L (same as the case provided in PR description) but the token B requires shorter token renewal period than token A. In this case, we can't simply ignore the renewal period calculation for token B. And I'm also in doubt with caching the renewal interval. Suppose the same case, but token A has intermittent issue and ends up throwing exception when token.renew() is called for token A. This is "swallowed" without any log message and the renewal interval for the token A is never reconsidered. |
AFAIU the problem is that the valid issue date is not filled. I think this is a bug because the whole concept is based on token provider voting. All providers give back the time when it requires renewal and the security manager takes a lowest one. This means couple of providers obtain tokens a little bit earlier but I think this shouldn't be change. Just a question: Without valid issue date how should the security manager know when exactly the token need to be renewed? |
Looks like Spark requires two preconditions to make scheduling renewal work correctly:
This is from the case where only 1) is fulfilled. In this case, we can still schedule based on the "absolute" renewal date from |
I've had a deeper look and the title was a little bit misleading to me. I thought we intend to change the framework itself bit is seems like has effect only on HadoopFS provider. This makes the situation less invasive because the whole logic stays the same but some provider internals what we need to change. Though I think it would be include I think not filling such fields is a bug in the client itself but I feel the pain from user perspective. |
Ah yes you're right. I should have clarified the boundary of change - HadoopFS provider is only affected. Let me add the information in PR title. Regarding considering Btw, we're also opening the possibility to fix this in Hadoop side, as the problematic identifier was from KNOX and it inherited the abstract class in hadoop-aws. Once we fix it we could just keep the precondition that issue date should be valid. I'll try to deal with Hadoop side first, and revisit afterwards. |
FYI, fixing S3 identifier's invalid issue date is tracked via https://issues.apache.org/jira/browse/HADOOP-17379 The question remains, would we still want to make the code be resilient to the bad token identifier, as there're existing Hadoop versions which contain the bug? If the answer is yes, we could weigh the options we enumerated. IMHO, If the answer is no, we can close this PR and mark the issue as "Won't track here" or likewise. |
Good to hear the bug will be fixed on Hadoop side.
My personal opinion is yes and no. I don't think we should solve all problem what token providers generate especially because the change can be either risky (using |
sorry for my delay, I got busy with other things. Yeah fixing the Hadoop side is great. Its been a long time since I looked at tokens so tried to look at the Hadoop code to refresh my memory. I would say it doesn't hurt for us to protect if the issue date is 0 by using now and then logging a warning that we are doing so. That is what I thought the Hadoop secret managers did with it on renewal. I'm assuming the issueDate never changed from 0? meaning it started 0 and even after renewal it stayed 0, right? Do you know what class was renewing? |
Could you point some Hadoop code to understand your point? |
Do we re-obtain these tokens or renew already existing as well? |
According to the author's opinion, I dismissed my review comments. We can add a test case later if we need. |
Could you review this, @mridulm and @tgravescs ? |
AFAIK we don't have any guard around that. If the result of We finally calculate the schedule interval for the next renewal via |
Test build #131375 has finished for PR 30366 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status failure |
Test build #131376 has finished for PR 30366 at commit
|
cc. @tgravescs @mridulm Friendly reminder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side too. Adding test would be hell complicated (if possible).
I think you should also log when the issueDate is > the time on the local machine. As discussed, fixing up is hard, but at least printing out "clock mismatch" is a useful thing to see in the logs when you hit problems |
OK I think that's a good addition. Updated. @tgravescs I'm sorry, but could you take a quick look again? Thanks in advance. |
thanks; hopefully we won't ever see this in the logs, but at least when it causes renewal problems then there'll be something in the logs to point the blame at clock settings in whatever issued the token |
Test build #131836 has finished for PR 30366 at commit
|
@tgravescs Friendly reminder. |
Test build #131945 has finished for PR 30366 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding the log message looks fine
…SDelegationTokenProvider when the issue date for token is not set up properly ### What changes were proposed in this pull request? This PR proposes to use current timestamp with warning log when the issue date for token is not set up properly. The next section will explain the rationalization with details. ### Why are the changes needed? Unfortunately not every implementations respect the `issue date` in `AbstractDelegationTokenIdentifier`, which Spark relies on while calculating. The default value of issue date is 0L, which is far from actual issue date, breaking logic on calculating next renewal date under some circumstance, leading to 0 interval (immediate) on rescheduling token renewal. In HadoopFSDelegationTokenProvider, Spark calculates token renewal interval as below: https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L123-L134 The interval is calculated as `token.renew() - identifier.getIssueDate`, which is providing correct interval assuming both `token.renew()` and `identifier.getIssueDate` produce correct value, but it's going to be weird when `identifier.getIssueDate` provides 0L (default value), like below: ``` 20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 1603175657000 for token S3ADelegationToken/IDBroker 20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400048 for token HDFS_DELEGATION_TOKEN ``` Hopefully we pick the minimum value as safety guard (so in this case, `86400048` is being picked up), but the safety guard leads unintentional bad impact on this case. https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L58-L71 Spark leverages the interval being calculated in above, "minimum" value of intervals, and blindly adds the value to token's issue date to calculates the next renewal date for the token, and picks "minimum" value again. In problematic case, the value would be `86400048` (86400048 + 0) which is quite smaller than current timestamp. https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L228-L234 The next renewal date is subtracted with current timestamp again to get the interval, and multiplexed by configured ratio to produce the final schedule interval. In problematic case, this value goes to negative. https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L180-L188 There's a safety guard to not allow negative value, but that's simply 0 meaning schedule immediately. This triggers next calculation of next renewal date to calculate the schedule interval, lead to the same behavior, hence updating delegation token immediately and continuously. As we fetch token just before the calculation happens, the actual issue date is likely slightly before, hence it's not that dangerous to use current timestamp as issue date for the token the issue date has not been set up properly. Still, it's better not to leave the token implementation as it is, so we log warn message to let end users consult with token implementer. ### Does this PR introduce _any_ user-facing change? Yes. End users won't encounter the tight loop of schedule of token renewal after the PR. In end users' perspective of reflection, there's nothing end users need to change. ### How was this patch tested? Manually tested with problematic environment. Closes #30366 from HeartSaVioR/SPARK-33440. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit f5d2165) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Thanks all for reviewing! Merged into master and branch-3.0. |
What changes were proposed in this pull request?
This PR proposes to use current timestamp with warning log when the issue date for token is not set up properly. The next section will explain the rationalization with details.
Why are the changes needed?
Unfortunately not every implementations respect the
issue date
inAbstractDelegationTokenIdentifier
, which Spark relies on while calculating. The default value of issue date is 0L, which is far from actual issue date, breaking logic on calculating next renewal date under some circumstance, leading to 0 interval (immediate) on rescheduling token renewal.In HadoopFSDelegationTokenProvider, Spark calculates token renewal interval as below:
spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala
Lines 123 to 134 in 2c64b73
The interval is calculated as
token.renew() - identifier.getIssueDate
, which is providing correct interval assuming bothtoken.renew()
andidentifier.getIssueDate
produce correct value, but it's going to be weird whenidentifier.getIssueDate
provides 0L (default value), like below:Hopefully we pick the minimum value as safety guard (so in this case,
86400048
is being picked up), but the safety guard leads unintentional bad impact on this case.spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala
Lines 58 to 71 in 2c64b73
Spark leverages the interval being calculated in above, "minimum" value of intervals, and blindly adds the value to token's issue date to calculates the next renewal date for the token, and picks "minimum" value again. In problematic case, the value would be
86400048
(86400048 + 0) which is quite smaller than current timestamp.spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala
Lines 228 to 234 in 2c64b73
The next renewal date is subtracted with current timestamp again to get the interval, and multiplexed by configured ratio to produce the final schedule interval. In problematic case, this value goes to negative.
spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala
Lines 180 to 188 in 2c64b73
There's a safety guard to not allow negative value, but that's simply 0 meaning schedule immediately. This triggers next calculation of next renewal date to calculate the schedule interval, lead to the same behavior, hence updating delegation token immediately and continuously.
As we fetch token just before the calculation happens, the actual issue date is likely slightly before, hence it's not that dangerous to use current timestamp as issue date for the token the issue date has not been set up properly. Still, it's better not to leave the token implementation as it is, so we log warn message to let end users consult with token implementer.
Does this PR introduce any user-facing change?
Yes. End users won't encounter the tight loop of schedule of token renewal after the PR. In end users' perspective of reflection, there's nothing end users need to change.
How was this patch tested?
Manually tested with problematic environment.