Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Workflow stop and resume by node didn't properly support offloaded nodes. Fixes #2543 #2548

Merged
merged 9 commits into from May 1, 2020

Conversation

markterm
Copy link
Contributor

@markterm markterm commented Mar 30, 2020

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
  • Optional. I've added My organization is added to the USERS.md.
  • I've signed the CLA and required builds are green.

Fixes #2621

@markterm
Copy link
Contributor Author

Am starting off by running some builds with the increased timeouts to look at the logs.

@codecov
Copy link

codecov bot commented Mar 31, 2020

Codecov Report

Merging #2548 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2548      +/-   ##
==========================================
+ Coverage   11.16%   11.19%   +0.02%     
==========================================
  Files          83       83              
  Lines       32673    32645      -28     
==========================================
+ Hits         3649     3653       +4     
+ Misses      28525    28495      -30     
+ Partials      499      497       -2     
Impacted Files Coverage Δ
workflow/common/convert.go 78.57% <0.00%> (-10.00%) ⬇️
workflow/validate/validate.go 74.13% <0.00%> (-0.28%) ⬇️
persist/sqldb/migrate.go 0.00% <0.00%> (ø)
pkg/apis/workflow/v1alpha1/workflow_types.go 10.93% <0.00%> (+0.02%) ⬆️
workflow/executor/executor.go 4.41% <0.00%> (+0.03%) ⬆️
workflow/controller/steps.go 65.35% <0.00%> (+0.46%) ⬆️
workflow/controller/operator.go 61.62% <0.00%> (+0.58%) ⬆️
workflow/controller/dag.go 51.25% <0.00%> (+0.63%) ⬆️
workflow/templateresolution/context.go 69.46% <0.00%> (+4.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffc43ce...ca3c4d6. Read the comment docs.

@markterm
Copy link
Contributor Author

markterm commented Apr 1, 2020

I'm currently unable to progress with this as I can't reproduce locally and the CI build is not executing.

@alexec
Copy link
Contributor

alexec commented Apr 5, 2020

Try merging master in?

test/e2e/cli_test.go Outdated Show resolved Hide resolved
@alexec alexec changed the title Suspend test fix, fixes #2543 fix: Suspend test fix. Fixes #2543 Apr 6, 2020
@markterm markterm force-pushed the suspend-test-fix branch 2 times, most recently from e7d340b to 961000b Compare April 11, 2020 10:17
@markterm
Copy link
Contributor Author

@alexec it looks like the issue here was related to node offload. Why is there a different set of tests in CLIWithServerSuite as compared to CLISuite? It seems we would want to run this test in both.

@simster7
Copy link
Member

simster7 commented Apr 11, 2020

@alexec it looks like the issue here was related to node offload. Why is there a different set of tests in CLIWithServerSuite as compared to CLISuite? It seems we would want to run this test in both.

The difference is that in CLIWithServerSuite the Argo Server is given access to the Offload Node Status Repo, while in the CLISuite it is not. If your test needs access to the repo, add it to CLIWithServerSuite, if it doesn't add it to CLISuite. You can also add a:

if !s.Persistence.Enabled() {
    s.T().SkipNow()
}

check to protect tests that you only want to run in an environment where the Argo Server has (or doesn't have) persistence enabled.

I just did something similar, take a look at https://github.com/argoproj/argo/pull/2645/files#diff-4f326ca1865929af6dabf0119581b09cR473 for an example.

@markterm
Copy link
Contributor Author

markterm commented Apr 12, 2020 via email

@markterm
Copy link
Contributor Author

@alexec Can you please tell me if this approach is ok before I move onto #2850

(and do you know why ci/circleci: test didn't run?)

alexec
alexec previously requested changes Apr 27, 2020
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking there was a bug in stopping workflows? If so, can the title of the PR be updated to reflect this?

@markterm
Copy link
Contributor Author

The issue was workflow stop and resume by node didn't properly support offloaded nodes, I'll change the name.

@markterm markterm changed the title fix: Suspend test fix. Fixes #2543 fix: Workflow stop and resume by node didn't properly support offloaded nodes. Fixes #2543 Apr 27, 2020
@markterm
Copy link
Contributor Author

@alexec FYI I have just rebased on master.

@alexec alexec changed the title fix: Workflow stop and resume by node didn't properly support offloaded nodes. Fixes #2543 fix(controller): Workflow stop and resume by node didn't properly support offloaded nodes. Fixes #2543 Apr 29, 2020
wf.Status.Nodes = nil
}

err = packer.CompressWorkflowIfNeeded(wf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ordering is incorrect, CompressWorkflowIfNeeded should always come before repo.Save.

This file misses unit tests. Would you please be able to add them? "No - it is too hard" would be a fine answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the ordering, wouldn't that result in it always compressing the nodes if they're too big, whereas currently if offload is enabled for the workflow then it won't try to compress the nodes.

By missing tests, are you talking about adding a test to util_test.go that covers the updateWorkflowNodeByKey function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to see this pattern elsewhere - compression can fail - then we offload

@markterm
Copy link
Contributor Author

@alexec please take another look now.

@alexec
Copy link
Contributor

alexec commented Apr 30, 2020

Please fix lint error:

workflow/util/util.go:486:27: Error return value of `compressAndOffloadNodes` is not checked (errcheck)
 
			compressAndOffloadNodes(wf, repo, nodes)
 

@alexec
Copy link
Contributor

alexec commented Apr 30, 2020

Can you login to https://app.circleci.com/? I think you might need to do that to prevent the unauthorized error.

@markterm
Copy link
Contributor Author

@alexec I get a 404 at https://app.circleci.com but I have logged into https://circleci.com/dashboard - unfortunately it didn't help.

@alexec
Copy link
Contributor

alexec commented Apr 30, 2020

OK. I've just merged a fix to master that I think will fix it. Can you sync with master please?

@sonarcloud
Copy link

sonarcloud bot commented Apr 30, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 2 Code Smells

74.1% 74.1% Coverage
0.0% 0.0% Duplication

@markterm
Copy link
Contributor Author

@alexec
Looks like that worked 👍

@alexec alexec merged commit 7d8818c into argoproj:master May 1, 2020
@alexec
Copy link
Contributor

alexec commented May 1, 2020

Boom! Merged! 🚀

alexec pushed a commit that referenced this pull request May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix TestNodeSuspendResume
4 participants