Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Upgrade test on GKE #10422

Merged
merged 7 commits into from
Mar 19, 2020
Merged

CI: Upgrade test on GKE #10422

merged 7 commits into from
Mar 19, 2020

Conversation

raybejjani
Copy link
Contributor

@raybejjani raybejjani commented Mar 3, 2020

The commits are somewhat independent.

  • Prefer helm options passed into RunHelm. This allows overriding the global overrides (e.g. force node-init to off when running preflight).
    • Ensure node-init is off when running preflight in the Update/Upgrade tests
  • Parse the agent.image tag and set registry and tag, if they are not otherwise set and .image is a full image URL.
  • Ensure preflight.image always matches agent.image. This accounts for overrides of the image.
  • Cleanup left-over preflight pods in the Update/Upgrade tests.

This change is Reviewable

@raybejjani raybejjani added wip area/CI Continuous Integration testing issue or flake labels Mar 3, 2020
@raybejjani raybejjani requested a review from a team as a code owner March 3, 2020 09:58
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

2 similar comments
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@raybejjani
Copy link
Contributor Author

test-gke K8sUpdates*

@coveralls
Copy link

coveralls commented Mar 3, 2020

Coverage Status

Coverage increased (+0.002%) to 45.611% when pulling 9a94cf3 on raybejjani:gke-upgrade-test into dbec3fd on cilium:master.

@raybejjani
Copy link
Contributor Author

test-gke K8sDemosTest*

@raybejjani
Copy link
Contributor Author

test-gke K8s*

@raybejjani
Copy link
Contributor Author

raybejjani commented Mar 3, 2020

@raybejjani
Copy link
Contributor Author

raybejjani commented Mar 3, 2020

test-gke K8s* (https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/108/ it passed!)

@raybejjani
Copy link
Contributor Author

test-gke K8s*

return imageURL, version
}

func GetLatestCiliumVersion() (imageURL, version string) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function GetLatestCiliumVersion should have comment or be unexported

@@ -172,6 +172,40 @@ func init() {
}
}

func GetStableCiliumVersion() (imageURL, version string) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function GetStableCiliumVersion should have comment or be unexported

@raybejjani
Copy link
Contributor Author

test-gke K8s*

@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

test-gke K8s*

@raybejjani
Copy link
Contributor Author

test-gke K8sUpdates*

@raybejjani
Copy link
Contributor Author

test-focus K8sUpdates*

@raybejjani
Copy link
Contributor Author

test-gke K8sUpdates*

@raybejjani
Copy link
Contributor Author

test-focus K8sUpdates*

@raybejjani
Copy link
Contributor Author

test-gke K8sUpdates*

@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

test-gke

Previously, we unconditionally applied helm options from integration
overrides. This meant that specific invocatons could not override the
overrides. This is an issue with GKE because it requires options like
node-init to be enabled, but this then conflicts with some setup steps
in the K8sUpdates test, in particular disabling the node-init pods when
generating preflight-only yaml.
The code now prefers the most local options, passed in to the function.
This shouldn't change any existing invocations since none overrode the
global overrides previously.

Signed-off-by: Ray Bejjani <ray@isovalent.com>
We have special handling of `.image` helm variables. They are considered
fully-qualified URLs if they include a '/' (in our defaults, the
registry includes the user/org prefix instead of the image).

Tests can assume that that setting only one of registry, image or tag
(often tag) will result in usable image URLs. This is true except in the
upgrade test, where the stable image is internet-sourced and the target
image is the locally built one. In the case where we pass in a
`--cilium.image` option that is a fully-qualified URL this is incorrect.
This manifests most visibly when running tests against hosted k8s, like
GKE or EKS, because the to-test image (i.e. the target upgrade image in
the update/upgrade tests) is built and hosted outside the cluster.

When given a fully-qualified image URL value we also set the
corresponding values for registry and tag.

Signed-off-by: Ray Bejjani <ray@isovalent.com>
cilium-preflight's primary purpose is to preload the cilium images into
the cluster, so the image used should always match the target agent
image.

The update/upgrade tests attempt to mimic the instructions
we give for upgrades but did not set `preflight.image`. This resulted in
the cilium:latest being used (or other arbitrary combiniations with the
value of `global.tag`) and the preflight load step would fail.

We now initialize, globally, `preflight.image` to match `agent.image`.
This allows helm generation for preflight to select the correct image.

Signed-off-by: Ray Bejjani <ray@isovalent.com>
Since we use helm install in the update/upgrade test, preflight isn't
expected to be running. When we re-run a failed test, this triggers a
helm error.

Similarly, whatever agent.image is set to during cleanup
should also be the value for preflight. It is now set appropriately.

Signed-off-by: Ray Bejjani <ray@isovalent.com>
When running on integrations with overrides that enable node-init,
K8sUpdateTest would `helm install node-init`, without later removing the
node-init pods.

This caused helm errors when the test was re-run because of the leaked
pods. The test utility functions now delete node-init when cleaning up
and disable it  when running the preflight pods.

Signed-off-by: Ray Bejjani <ray@isovalent.com>
The update/upgrade tests deleted the DNS deployment at various points in
order to force the pods to restart with new cilium instances. This left
the cluster in a bad state on failure (without DNS), or could cause the
left-over DNS deployment to be the wrong one (the test did not account
for how DNS was setup globally).

Signed-off-by: Ray Bejjani <ray@isovalent.com>
@raybejjani
Copy link
Contributor Author

test-gke K8s(Updates)|(FQDNTest)|(ServicesTest).*

@raybejjani
Copy link
Contributor Author

test-focus K8s(Updates)|(FQDNTest)|(ServicesTest).*

@raybejjani
Copy link
Contributor Author

test-gke K8s(Updates)|(FQDNTest)|(ServicesTest).*

@raybejjani
Copy link
Contributor Author

test-focus K8s(Updates)|(FQDNTest)|(ServicesTest).*

We sometimes have DNS pods in backoffs longer than the timeouts in
tests. Restarting the pods should force them to come online sooner, if
at all.

Signed-off-by: Ray Bejjani <ray@isovalent.com>
@raybejjani
Copy link
Contributor Author

test-gke K8s(Updates)|(FQDNTest)|(ServicesTest).*

@raybejjani
Copy link
Contributor Author

test-focus K8s(Updates)|(FQDNTest)|(ServicesTest).*

@raybejjani
Copy link
Contributor Author

test-gke K8s(Updates)|(FQDNTest)|(ServicesTest).*

@raybejjani
Copy link
Contributor Author

raybejjani commented Mar 19, 2020

test-focus K8s(Updates)|(FQDNTest)|(ServicesTest).* (success https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/18044/console)

@raybejjani
Copy link
Contributor Author

test-gke K8s(Updates)|(FQDNTest)|(ServicesTest).*

@raybejjani
Copy link
Contributor Author

raybejjani commented Mar 19, 2020

test-gke K8s(Updates)|(FQDNTest)|(ServicesTest).* (https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/171/console 1 fail in K8sServicesTest Checks service across nodes with L7 policy Tests NodePort with L7 Policy)

@raybejjani
Copy link
Contributor Author

raybejjani commented Mar 19, 2020

@raybejjani
Copy link
Contributor Author

raybejjani commented Mar 19, 2020

test-gke (1 fail, the managed etcd flake https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/172/consoleFull)

@nebril
Copy link
Member

nebril commented Mar 19, 2020

Let's ship it.

@raybejjani raybejjani merged commit 84dd81c into cilium:master Mar 19, 2020
@raybejjani raybejjani deleted the gke-upgrade-test branch March 19, 2020 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants