Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34053][INFRA] Cancel the previous build #31104

Closed
wants to merge 4 commits into from

Conversation

mik-laj
Copy link
Member

@mik-laj mik-laj commented Jan 9, 2021

Similar to: #31098 apache/calcite#2318 (solution suggestted by @vlsi - apache/pulsar#9154 (comment))

I used the action, which was maintained by potiuk instead of the original author, for two reasons:

If you use forks, you should create a separate "Cancelling" workflow_run triggered workflow. The workflow_run should be responsible for all canceling actions. The examples below show the possible ways the action can be utilized.

What changes were proposed in this pull request?

This PR aims to reduce the GitHub Action usage by cancelling the previous build.

Why are the changes needed?

In most case, the last commit is meaningful.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Due to the nature of the change, testing of this change is difficult.

Note: This event will only trigger a workflow run if the workflow file is on the default branch.

https://docs.github.com/en/free-pro-team@latest/actions/reference/events-that-trigger-workflows#workflow_run

However, you can see on my fork that this action is triggered.
https://github.com/mik-laj/spark/actions?query=workflow%3A%22Cancelling+Duplicates%22

I also asked the author of this action to review this change - @potiuk (PMC of Apache Airflow) and I have a positive review.

@github-actions github-actions bot added the INFRA label Jan 9, 2021
@mik-laj
Copy link
Member Author

mik-laj commented Jan 9, 2021

@dongjoon-hyun Can I ask for review?

path: ./build/.actions/n1hility/cancel-previous-runs
ref: 953c92201f368370112ea2754545cb4468d89f12 # v2
fetch-depth: 1
- uses: ./build/.actions/n1hility/cancel-previous-runs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token is needed here: https://github.com/apache/spark/pull/31104/checks?check_run_id=1673964946#step:4:15

See how the token was declared: https://github.com/apache/spark/pull/31098/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R94

Could you please add it to see if the action would be able to cancel your previous runs for this PR?

Copy link
Member Author

@mik-laj mik-laj Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This token is not needed. ;-) Each action gets a token by default in a input named "token". The action/checkout works the same way. See: https://github.com/actions/checkout/blob/c952173edf28a2bd22e1a4926590c1ac39630461/src/input-helper.ts#L109

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is passed to all actions by default. See for another example: JamesIves/github-pages-deploy-action#530

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mik-laj, please double-check the link I highlight: https://github.com/apache/spark/pull/31104/checks?check_run_id=1673964946#step:4:15

It literally says that n1hility/cancel-previous-runs fails and it requires authentication:

Run ./build/.actions/n1hility/cancel-previous-runs
  with:
  env:
    MODULES_TO_TEST: core, unsafe, kvstore, avro, network-common, network-shuffle, repl, launcher, examples, sketch, graphx
    EXCLUDED_TAGS: 
    INCLUDED_TAGS: 
    HADOOP_PROFILE: hadoop3.2
    HIVE_PROFILE: hive2.3
    CONDA_PREFIX: /usr/share/miniconda
    GITHUB_PREV_SHA: 50d17ba02600601f8f63fcd55c2bf70865f9dc6e
    GITHUB_INPUT_BRANCH: 
[@octokit/rest] `const Octokit = require("@octokit/rest")` is deprecated. Use `const { Octokit } = require("@octokit/rest")` instead

Branch is patch-1, repo is spark, and owner is apache, and id is 474259497
(node:1987) UnhandledPromiseRejectionWarning: Error: Parameter token or opts.auth is required
    at Function.getAuthString (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:5899:19)
    at Function.getOctokitOptions (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:5864:29)
    at new GitHub (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:5848:22)
    at /home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1492:25
    at Generator.next (<anonymous>)
    at /home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1448:71
    at new Promise (<anonymous>)
    at module.exports.131.__awaiter (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1444:12)
    at cancelDuplicates (/home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1491:12)
    at /home/runner/work/spark/spark/build/.actions/n1hility/cancel-previous-runs/dist/index.js:1600:13
(node:1987) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:1987) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Apparently, that does not look like a workable action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I checked carefully and it looks like this action must be run as workflow_run.

If you use forks, you should create a separate "Cancelling" workflow_run triggered workflow. The workflow_run should be responsible for all canceling actions. The examples below show the possible ways the action can be utilized.

https://github.com/potiuk/cancel-workflow-runs#usage

I am rewriting my PR now to use workflow_run. I will also ask @potiuk for a review when I'm finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It must be a workflow_run action. Ant the 'cancel-workflow-run' action of mine has all the support for that.

Copy link
Member

@potiuk potiuk Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n1hility action I based my action initially on should be deprecated as it has no good support for that. I actually created a ticket to either deprecate it or merge my changes: n1hility/cancel-previous-runs#7 (I am happy my action to disapear and merge it back) but I have not heard from the author despite nagging several times

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new change.s I passed token and used workflow_run event.

@vlsi
Copy link

vlsi commented Jan 9, 2021

I might be slightly late, however, https://github.com/potiuk/cancel-workflow-runs seems to be a more up to date action, and the documentation is way better there.

A bonus point is that @potiuk is PMC for Apache Airflow ( https://people.apache.org/phonebook.html?uid=potiuk )

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 10, 2021

@mik-laj, mind mentioning the diff from the fork (https://github.com/potiuk/cancel-workflow-runs) and the original (https://github.com/n1hility/cancel-previous-runs) in the PR description?

Also, it would be great if we can mention how it was tested at "How was this patch tested?". If the same way with the plugin was manually tested via another repo, that's fine too.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

add to whitelist

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe shall we rename build_and_test_workflow_run.yml to something like cancel_duplicate_workflow_runs.yml.

Otherwise, looks making sense to me.

@mik-laj
Copy link
Member Author

mik-laj commented Jan 10, 2021

@HyukjinKwon I renamed this file.

@HyukjinKwon
Copy link
Member

Thanks @mik-laj for addressing my comments.

@HyukjinKwon
Copy link
Member

Let me leave it to @dongjoon-hyun to take another look. I am he are pretty active in the maintenance of GitHub Actions, and I would like to make sure having his thought too.

@mik-laj
Copy link
Member Author

mik-laj commented Jan 10, 2021

Ohh. I reverted one change by mistake, but already reverted. Now we use the action from the potiuk / repository and the file has the correct name.

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38470/

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38472/

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38470/

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38472/

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Test build #133881 has finished for PR 31104 at commit 1792fc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Test build #133883 has finished for PR 31104 at commit a30c6c3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, too. Thank you, @mik-laj , @HyukjinKwon , @vlsi , @potiuk .
Merged to master.

HyukjinKwon added a commit that referenced this pull request Jan 11, 2021
… Actions

### What changes were proposed in this pull request?

This is kind of a followup of #31104 but I decided to track it separately with a separate JIRA.

Currently the jobs are being canceled in main repo branches. If a commit is merged, for example, to master branch before the test finishes, it cancels the previous builds. This is a problem because we cannot, for example, detect logical conflict properly. We should only cancel the jobs in PRs:

![Screen Shot 2021-01-11 at 3 22 24 PM](https://user-images.githubusercontent.com/6477701/104152015-c7f04b80-5421-11eb-9e40-6b0a0e5b8442.png)

This PR proposes to don't do this in the main repo branch commits but only do it in PRs.

### Why are the changes needed?

- To keep the test coverage
- To run the test in the synced master branch instead of relying on the builds made in each PR with an outdated master branch
- To detect test failures from logical conflicts from merging two conflicting PRs at the same time.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

I manually tested in
- HyukjinKwon#27
- HyukjinKwon#28

I added Yi Wu as a co-author since he helped verifying the current fix in the PR above.

I checked that it does not cancel in the main repo branch:

![Screen Shot 2021-01-11 at 3 58 52 PM](https://user-images.githubusercontent.com/6477701/104153656-3afbc100-5426-11eb-9309-85f6f4fd9ff3.png)

I checked it cancels in PRs:

![Screen Shot 2021-01-11 at 3 58 45 PM](https://user-images.githubusercontent.com/6477701/104153658-3d5e1b00-5426-11eb-89f7-786c3ae6849a.png)

Closes #31121 from HyukjinKwon/SPARK-34065.

Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member

I enabled this only for the jobs triggered by PRs but there seems a bug in the plugin. If the PR triggers the job cancelation, it cancels all the workflows in the master branch too.

@HyukjinKwon
Copy link
Member

I disabled this workflow for now.

Screen Shot 2021-01-12 at 9 47 31 AM

Let me know if this is fixed in the upstream.

@dongjoon-hyun
Copy link
Member

OMG. Thank you for disabling it.

@potiuk
Copy link
Member

potiuk commented Jan 12, 2021

I enabled this only for the jobs triggered by PRs but there seems a bug in the plugin. If the PR triggers the job cancelation, it cancels all the workflows in the master branch too.

Not all - but 'all but the latest one'. This is by design.

This is not a bug. The plugin works 'aggressively' in cancelAllDuplicates mode - whenever 'cancelAllDuplicates' is run, it will cancel all 'duplicated' runs (all runs that match the same tuple: ('branch', 'oirigin repo', 'event') - regardles which branch it is.

So for every PR and every branch (including master) it will only leave the 'latest build` running. All other will be cancelled. This is by design. The most recent build from master should not be cancelled.

The assumption is that if you merged several merges to master, you really want to run tests on the latest version of this - all the previous merges are not 'valid' any more - they are superseded by the latest build.

In case of Airflow it saves hours of needless builds.. Is this a problem for Spark ?

@potiuk
Copy link
Member

potiuk commented Jan 12, 2021

The reason why it is so 'aggressive' is that when you have queues, it's very likely that your "cancel runs' are also queued, so we want to make sure that we cancel everything that we know is a duplicate, whenever any of the 'cancel' runs gets executed.

The plugin can be easily extended to exclude certain branches from this cancelling if you think this is something you need, but I'd like to ask first - what do you loose if those master builds are cancelled if there is another master build already scheduled to run ?

@potiuk
Copy link
Member

potiuk commented Jan 12, 2021

Actually the feature is already there - you can skip certain event types from being cancelled (in this case skipping the 'push' event will skip cancelling any of the runs that were originated by the "push" event (i.e. all master merges). I will make a PR that should enable it now and feel free to merge it if this is what you really want.

@HyukjinKwon
Copy link
Member

@potiuk, I already tried such fix at #31121 but it did not work. The cancelation workflow triggered by a PR canceled workflow runs in master branch, which looks not making sense much. If a user open a PR, then it cancels the jobs in the master branch .. ?

@potiuk
Copy link
Member

potiuk commented Jan 12, 2021

  1. The cancelation worklfow must run in master branch in order to utlise GITHUB_TOKEN that has "write" capabilities (otherwise you have no rights to cancel workflo). This is how 'workflow_run' type of even works - it always runs using master code. See this stack-overflow question for example : https://stackoverflow.com/questions/63343937/how-to-use-the-github-actions-workflow-run-event#:~:text=The%20new%20workflow_run%20event%20enables,token%20as%20well%20as%20secrets.

This is due to security reasons. Otherwise any person could make a PR where they could change the workflow and do whatever they want to your repository. Workflow_run always runs wtith the workflow that is defined in your master, not in the PR. In order to test any change in the workflow_run you have to push it as master (I, for example, test such changes by pushing the PR branch as "master" to my own fork github push -f potiuk <branch>:master (where potiuk is the remote which is my own private fork).

  1. Yes. The action is 'aggressive` because it is only useful if you are in the situation where your queue is "full" which means that any new action (including the cancel one) might take hours to even start executing. So if 'any' of the cancel workflows starts running - even from PRs that were created hours ago (just to remind - it's always using a master code) - this action looks at ALL builds and cancel ALL duplicate ones. That by default includes all "push" events (any merge to master is 'push', "schedule" events, "pull_request" - both coming from forks and pull requests that are done inside the repo.

The mode of cancelling 'cancelAllDuplicates` does exactly this. It does not even look at the original branch/PR that triggered it.

  1. The change I am opening changes it's behaviour - it will not look for duplicates when there is a "push" or "schedule" event. Which means it will be cancelling ALL duplicates in ALL PRs opened that have more than one build in progress. This means that whenever cancel workflow has a chance to run, only the latest PR push will continue executing, others will be cancelled.

@potiuk
Copy link
Member

potiuk commented Jan 12, 2021

The PR is here: #31153

@HyukjinKwon
Copy link
Member

Sure, I used the same approach at https://github.com/apache/spark/blob/master/.github/workflows/test_report.yml to handle the security issue by leveraging workflow_run when workflow_run feature first came out. I am aware of that.

I see, the new approach sounds much better. Thanks for working on this.

HyukjinKwon pushed a commit that referenced this pull request Jan 12, 2021
…flows

Changes the configuration of the cancel workflow action to skip schedule/push events from canceling. This has the effect that duplicates of all direct pushes (master merges or direct pushes to the spark repository are not cancelled)

### What changes were proposed in this pull request?

Update to CI cancel policy to skip direct pushes. Duplicates will only be cancelled for Pull Requests.

### Why are the changes needed?

[Apparenlty](#31104 (comment)) the aggressive behavior of the cancel action which also cancels duplicate master builds is too agressive for spark community. This change spares merges to master and scheduled builds from duplicate checking (as a result all merges to master will be always build to completion).

The previous behavior of the action was that in case of subsequent merges to master, only the latest one was guaranteed to complete. Other changes could be cancelled before they complete to save job queue.

### Does this PR introduce _any_ user-facing change?

No, except if the master builds were somehow facing the users (but it's unlikely taking into account the ASF release policy).
There was a potential that some changes that could be detected by specific master merge failing could be detected later (in one of the subsequent builds) which could make investigation of the root cause of failure a bit more difficult, because it could have been introduced in one of the commits between two completed builds merges. But this is at most impacting the timeline of release close to release itself, not the release itself.

### How was this patch tested?

This configuration parameter has been tested earlier in Airflow. We used it initially, but since our master builds are heavy and we have extensive tests in the PRs and investigation for failed builds is not as difficult we found that limiting the strain on Github Action and cancelling master builds was more important for the health of the whole ASF community and we removed that configuration.

Tested in https://github.com/potiuk/spark/runs/1688506527?check_suite_focus=true#step:2:46 where the action found other master builds in progress but did not add them as candidates to cancel.

Closes #31153 from potiuk/skip-schedule-push-branches.

Authored-by: Jarek Potiuk <potiuk@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@potiuk
Copy link
Member

potiuk commented Jan 12, 2021

I see, the new approach sounds much better. Thanks for working on this.

You are welcome! Happy to help!

Also the action has quite more sophisticated ways of cancelling the builds. I think at Apache Airflow we are saving at least 30% (in some cases even more) of the build time by pretty aggressive cancelling.

For example when any of our jobs fails at the "static check" failure we also cancel the workflow aggressively as well.

This is important because we have 5-10 minutes long 'static checks' and 40-50 minutes tests. but we want to start tests as soon as possible rather than waiting for static checks to complete. So we want to start them in parallel, but then if any of the static checks fails, we want to cancel all running tests because we know the change will have to be re-pushed anyway.

Unfortunately GA only allow such 'fail-fast' in matrix-jobs, so I had to add this option to the cancel action.

This way we save even more of the precious build time for the whole ASF community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants