Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-12472][yarn] Support setting attemptFailuresValidityInterval o… #8400

Closed
wants to merge 1 commit into from

Conversation

jiasheng55
Copy link
Contributor

@jiasheng55 jiasheng55 commented May 10, 2019

…f jobs on Yarn

What is the purpose of the change

Yarn has a feature which supports users setting attemptFailuresValidityInterval for jobs, so that the application-attempts will not take failures which happen out of this interval into account.

ApplicationSubmissionContext Doc

attemptFailuresValidityInterval. The default value is -1. when attemptFailuresValidityInterval in milliseconds is set to > 0, the failure number will no take failures which happen out of the validityInterval into failure count. If failure count reaches to maxAppAttempts, the application will be failed.

Brief change log

Add a config option to support this feature.

Verifying this change

All the existing integration tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (docs)

@flinkbot
Copy link
Collaborator

flinkbot commented May 10, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 5517d47 (Sat Aug 28 11:20:32 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR @jiasheng55. I think this feature has already been added with https://issues.apache.org/jira/browse/FLINK-2790. In fact we call this method https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L1293. The problem is that this value is currently not directly configurable. I think this is something we should change by using the APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL.

@tillrohrmann tillrohrmann self-assigned this May 13, 2019
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this PR @jiasheng55. Changes look good now. Merging.

tillrohrmann pushed a commit to jiasheng55/flink that referenced this pull request May 13, 2019
@langdangjushi
Copy link

@jiasheng55 大哥, 测试过没啊?? 你写的默认10s, 然后yarn max attempts 至少得是2吧, 那就是说如果在10s内没有做到失败2次, 那么app master将进入无限重启啊哥...

@jiasheng55
Copy link
Contributor Author

@langdangjushi 请平息一下情绪……你可以看一下这个“File Changed”中详细改动内容,在改动之前,reflector.setAttemptFailuresValidityInterval传入的值是AkkaUtils.getTimeout(flinkConfiguration).toMillis(),也就是10s;
在和社区讨论的时候,为了不因为这个PR改动原有的作业默认行为,保持了10s的默认值。
如果你对这个默认配置值有改动建议,可以在Flink issue或者邮件组里表达一下自己的建议,推动改动的落地:)

@langdangjushi
Copy link

@jiasheng55 凹,,那是前人的锅,,flink on yarn 10秒确实不合理...很容易无限重启

@zhengyifanvp
Copy link

@jiasheng55 我自己测试 我手动kill掉AM进程,我设置的是默认的10s,重试次数是2,我先手动kill了一次AM的进程发现重启了,过了10几分钟又kill了一次,发现直接失败,并不是你说的那样

@chen-yu342
Copy link

chen-yu342 commented Jan 20, 2022

@zhengyifanvp 需要Yarn最低版本为2.6.0;https://github.com/jiasheng55/flink/blob/5517d47329aa62c0046db3df1109fc273f07051a/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L1379

我的yarn版本就是2.6.0,没有用啊,flink-conf.yaml中设置了yarn.application-attempt-failures-validity-interval: -1(设置成什么都一样,十分钟也一样) 和yarn.application-attempts: 2
但是没有用啊,还是无限重启
image

@bgeng777
Copy link
Contributor

hi @chen-yu342 I tried to set yarn.application-attempt-failures-validity-interval with -D yarn.application-attempt-failures-validity-interval=-1 in my YARN(ver2.8.5) cluster and it works as expected. Maybe you can try the -D option or double check if the flink-conf.yaml works in the JM log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants