-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34053][INFRA] Cancel the previous build #31104
Conversation
@dongjoon-hyun Can I ask for review? |
.github/workflows/build_and_test.yml
Outdated
path: ./build/.actions/n1hility/cancel-previous-runs | ||
ref: 953c92201f368370112ea2754545cb4468d89f12 # v2 | ||
fetch-depth: 1 | ||
- uses: ./build/.actions/n1hility/cancel-previous-runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token is needed here: https://github.com/apache/spark/pull/31104/checks?check_run_id=1673964946#step:4:15
See how the token was declared: https://github.com/apache/spark/pull/31098/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R94
Could you please add it to see if the action would be able to cancel your previous runs for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This token is not needed. ;-) Each action gets a token by default in a input named "token". The action/checkout works the same way. See: https://github.com/actions/checkout/blob/c952173edf28a2bd22e1a4926590c1ac39630461/src/input-helper.ts#L109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token is required for cancel-previous-runs: https://github.com/n1hility/cancel-previous-runs/blob/178b93a12fb2731212e48c2cc5b7e37b937fd339/action.yml#L5-L7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is passed to all actions by default. See for another example: JamesIves/github-pages-deploy-action#530
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mik-laj, please double-check the link I highlight: https://github.com/apache/spark/pull/31104/checks?check_run_id=1673964946#step:4:15
It literally says that n1hility/cancel-previous-runs
fails and it requires authentication:
Run ./build/.actions/n1hility/cancel-previous-runs
with:
env:
MODULES_TO_TEST: core, unsafe, kvstore, avro, network-common, network-shuffle, repl, launcher, examples, sketch, graphx
EXCLUDED_TAGS:
INCLUDED_TAGS:
HADOOP_PROFILE: hadoop3.2
HIVE_PROFILE: hive2.3
CONDA_PREFIX: /usr/share/miniconda
GITHUB_PREV_SHA: 50d17ba02600601f8f63fcd55c2bf70865f9dc6e
GITHUB_INPUT_BRANCH:
[@octokit/rest] `const Octokit = require("@octokit/rest")` is deprecated. Use `const { Octokit } = require("@octokit/rest")` instead
Branch is patch-1, repo is spark, and owner is apache, and id is 474259497
(node:1987) UnhandledPromiseRejectionWarning: Error: Parameter token or opts.auth is required
at Function.getAuthString (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:5899:19)
at Function.getOctokitOptions (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:5864:29)
at new GitHub (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:5848:22)
at /home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1492:25
at Generator.next (<anonymous>)
at /home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1448:71
at new Promise (<anonymous>)
at module.exports.131.__awaiter (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1444:12)
at cancelDuplicates (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1491:12)
at /home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1600:13
(node:1987) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:1987) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Apparently, that does not look like a workable action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I checked carefully and it looks like this action must be run as workflow_run.
If you use forks, you should create a separate "Cancelling" workflow_run triggered workflow. The workflow_run should be responsible for all canceling actions. The examples below show the possible ways the action can be utilized.
https://github.com/potiuk/cancel-workflow-runs#usage
I am rewriting my PR now to use workflow_run. I will also ask @potiuk for a review when I'm finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It must be a workflow_run action. Ant the 'cancel-workflow-run' action of mine has all the support for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The n1hility action I based my action initially on should be deprecated as it has no good support for that. I actually created a ticket to either deprecate it or merge my changes: n1hility/cancel-previous-runs#7 (I am happy my action to disapear and merge it back) but I have not heard from the author despite nagging several times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a new change.s I passed token and used workflow_run event.
I might be slightly late, however, https://github.com/potiuk/cancel-workflow-runs seems to be a more up to date action, and the documentation is way better there. A bonus point is that @potiuk is PMC for Apache Airflow ( https://people.apache.org/phonebook.html?uid=potiuk ) |
@mik-laj, mind mentioning the diff from the fork (https://github.com/potiuk/cancel-workflow-runs) and the original (https://github.com/n1hility/cancel-previous-runs) in the PR description? Also, it would be great if we can mention how it was tested at "How was this patch tested?". If the same way with the plugin was manually tested via another repo, that's fine too. |
ok to test |
add to whitelist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe shall we rename build_and_test_workflow_run.yml
to something like cancel_duplicate_workflow_runs.yml
.
Otherwise, looks making sense to me.
@HyukjinKwon I renamed this file. |
Thanks @mik-laj for addressing my comments. |
Let me leave it to @dongjoon-hyun to take another look. I am he are pretty active in the maintenance of GitHub Actions, and I would like to make sure having his thought too. |
Ohh. I reverted one change by mistake, but already reverted. Now we use the action from the potiuk / repository and the file has the correct name. |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status success |
Test build #133881 has finished for PR 31104 at commit
|
Test build #133883 has finished for PR 31104 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM, too. Thank you, @mik-laj , @HyukjinKwon , @vlsi , @potiuk .
Merged to master.
… Actions ### What changes were proposed in this pull request? This is kind of a followup of #31104 but I decided to track it separately with a separate JIRA. Currently the jobs are being canceled in main repo branches. If a commit is merged, for example, to master branch before the test finishes, it cancels the previous builds. This is a problem because we cannot, for example, detect logical conflict properly. We should only cancel the jobs in PRs: ![Screen Shot 2021-01-11 at 3 22 24 PM](https://user-images.githubusercontent.com/6477701/104152015-c7f04b80-5421-11eb-9e40-6b0a0e5b8442.png) This PR proposes to don't do this in the main repo branch commits but only do it in PRs. ### Why are the changes needed? - To keep the test coverage - To run the test in the synced master branch instead of relying on the builds made in each PR with an outdated master branch - To detect test failures from logical conflicts from merging two conflicting PRs at the same time. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I manually tested in - HyukjinKwon#27 - HyukjinKwon#28 I added Yi Wu as a co-author since he helped verifying the current fix in the PR above. I checked that it does not cancel in the main repo branch: ![Screen Shot 2021-01-11 at 3 58 52 PM](https://user-images.githubusercontent.com/6477701/104153656-3afbc100-5426-11eb-9309-85f6f4fd9ff3.png) I checked it cancels in PRs: ![Screen Shot 2021-01-11 at 3 58 45 PM](https://user-images.githubusercontent.com/6477701/104153658-3d5e1b00-5426-11eb-89f7-786c3ae6849a.png) Closes #31121 from HyukjinKwon/SPARK-34065. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
I enabled this only for the jobs triggered by PRs but there seems a bug in the plugin. If the PR triggers the job cancelation, it cancels all the workflows in the master branch too. |
OMG. Thank you for disabling it. |
Not all - but 'all but the latest one'. This is by design. This is not a bug. The plugin works 'aggressively' in So for every PR and every branch (including The assumption is that if you merged several merges to master, you really want to run tests on the latest version of this - all the previous merges are not 'valid' any more - they are superseded by the latest build. In case of Airflow it saves hours of needless builds.. Is this a problem for Spark ? |
The reason why it is so 'aggressive' is that when you have queues, it's very likely that your "cancel runs' are also queued, so we want to make sure that we cancel everything that we know is a duplicate, whenever any of the 'cancel' runs gets executed. The plugin can be easily extended to exclude certain branches from this cancelling if you think this is something you need, but I'd like to ask first - what do you loose if those master builds are cancelled if there is another master build already scheduled to run ? |
Actually the feature is already there - you can skip certain event types from being cancelled (in this case skipping the 'push' event will skip cancelling any of the runs that were originated by the "push" event (i.e. all master merges). I will make a PR that should enable it now and feel free to merge it if this is what you really want. |
This is due to security reasons. Otherwise any person could make a PR where they could change the workflow and do whatever they want to your repository. Workflow_run always runs wtith the workflow that is defined in your master, not in the PR. In order to test any change in the workflow_run you have to push it as master (I, for example, test such changes by pushing the PR branch as "master" to my own fork
The mode of cancelling 'cancelAllDuplicates` does exactly this. It does not even look at the original branch/PR that triggered it.
|
The PR is here: #31153 |
Sure, I used the same approach at https://github.com/apache/spark/blob/master/.github/workflows/test_report.yml to handle the security issue by leveraging I see, the new approach sounds much better. Thanks for working on this. |
…flows Changes the configuration of the cancel workflow action to skip schedule/push events from canceling. This has the effect that duplicates of all direct pushes (master merges or direct pushes to the spark repository are not cancelled) ### What changes were proposed in this pull request? Update to CI cancel policy to skip direct pushes. Duplicates will only be cancelled for Pull Requests. ### Why are the changes needed? [Apparenlty](#31104 (comment)) the aggressive behavior of the cancel action which also cancels duplicate master builds is too agressive for spark community. This change spares merges to master and scheduled builds from duplicate checking (as a result all merges to master will be always build to completion). The previous behavior of the action was that in case of subsequent merges to master, only the latest one was guaranteed to complete. Other changes could be cancelled before they complete to save job queue. ### Does this PR introduce _any_ user-facing change? No, except if the master builds were somehow facing the users (but it's unlikely taking into account the ASF release policy). There was a potential that some changes that could be detected by specific master merge failing could be detected later (in one of the subsequent builds) which could make investigation of the root cause of failure a bit more difficult, because it could have been introduced in one of the commits between two completed builds merges. But this is at most impacting the timeline of release close to release itself, not the release itself. ### How was this patch tested? This configuration parameter has been tested earlier in Airflow. We used it initially, but since our master builds are heavy and we have extensive tests in the PRs and investigation for failed builds is not as difficult we found that limiting the strain on Github Action and cancelling master builds was more important for the health of the whole ASF community and we removed that configuration. Tested in https://github.com/potiuk/spark/runs/1688506527?check_suite_focus=true#step:2:46 where the action found other master builds in progress but did not add them as candidates to cancel. Closes #31153 from potiuk/skip-schedule-push-branches. Authored-by: Jarek Potiuk <potiuk@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
You are welcome! Happy to help! Also the action has quite more sophisticated ways of cancelling the builds. I think at Apache Airflow we are saving at least 30% (in some cases even more) of the build time by pretty aggressive cancelling. For example when any of our jobs fails at the "static check" failure we also cancel the workflow aggressively as well. This is important because we have 5-10 minutes long 'static checks' and 40-50 minutes tests. but we want to start tests as soon as possible rather than waiting for static checks to complete. So we want to start them in parallel, but then if any of the static checks fails, we want to cancel all running tests because we know the change will have to be re-pushed anyway. Unfortunately GA only allow such 'fail-fast' in matrix-jobs, so I had to add this option to the cancel action. This way we save even more of the precious build time for the whole ASF community. |
Similar to: #31098 apache/calcite#2318 (solution suggestted by @vlsi - apache/pulsar#9154 (comment))
I used the action, which was maintained by potiuk instead of the original author, for two reasons:
What changes were proposed in this pull request?
This PR aims to reduce the GitHub Action usage by cancelling the previous build.
Why are the changes needed?
In most case, the last commit is meaningful.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Due to the nature of the change, testing of this change is difficult.
https://docs.github.com/en/free-pro-team@latest/actions/reference/events-that-trigger-workflows#workflow_run
However, you can see on my fork that this action is triggered.
https://github.com/mik-laj/spark/actions?query=workflow%3A%22Cancelling+Duplicates%22
I also asked the author of this action to review this change - @potiuk (PMC of Apache Airflow) and I have a positive review.