New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-12472][yarn] Support setting attemptFailuresValidityInterval o… #8400
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 5517d47 (Sat Aug 28 11:20:32 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @jiasheng55. I think this feature has already been added with https://issues.apache.org/jira/browse/FLINK-2790. In fact we call this method https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L1293. The problem is that this value is currently not directly configurable. I think this is something we should change by using the APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL
.
flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java
Outdated
Show resolved
Hide resolved
flink-yarn/src/main/java/org/apache/flink/yarn/configuration/YarnConfigOptions.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this PR @jiasheng55. Changes look good now. Merging.
…f jobs on Yarn This closes apache#8400.
…f jobs on Yarn This closes apache#8400.
@jiasheng55 大哥, 测试过没啊?? 你写的默认10s, 然后yarn max attempts 至少得是2吧, 那就是说如果在10s内没有做到失败2次, 那么app master将进入无限重启啊哥... |
@langdangjushi 请平息一下情绪……你可以看一下这个“File Changed”中详细改动内容,在改动之前, |
@jiasheng55 凹,,那是前人的锅,,flink on yarn 10秒确实不合理...很容易无限重启 |
@jiasheng55 我自己测试 我手动kill掉AM进程,我设置的是默认的10s,重试次数是2,我先手动kill了一次AM的进程发现重启了,过了10几分钟又kill了一次,发现直接失败,并不是你说的那样 |
我的yarn版本就是2.6.0,没有用啊,flink-conf.yaml中设置了yarn.application-attempt-failures-validity-interval: -1(设置成什么都一样,十分钟也一样) 和yarn.application-attempts: 2 |
hi @chen-yu342 I tried to set |
…f jobs on Yarn
What is the purpose of the change
Yarn has a feature which supports users setting attemptFailuresValidityInterval for jobs, so that the application-attempts will not take failures which happen out of this interval into account.
ApplicationSubmissionContext Doc
Brief change log
Add a config option to support this feature.
Verifying this change
All the existing integration tests.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation