Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accepted conformance tests run without full test suite #759

Closed
MarcelMue opened this issue Oct 30, 2019 · 12 comments
Closed

Accepted conformance tests run without full test suite #759

MarcelMue opened this issue Oct 30, 2019 · 12 comments

Comments

@MarcelMue
Copy link
Contributor

While looking through past conformance test results for comparison of our own I noticed that some results show not the full test suit being run.

Here is a set of notable examples in the 1.15 conformance tests :

  • sap-cp-aws log
  • sap-cp-azure log
  • sap-cp-openstack log
  • gardener-azure log
  • gardener-aws log
  • gardener-gcp log
  • gardener-openstack log

All of these tests have in common that they run 212 test cases instead of the 215 test cases the official suit runs. Additionally the configuration is visible in the log immediately at the beginning because the line

Conformance test: not doing test setup.

Does not appear when running the official conformance e2e test. Here is a valid example from digitalocean for comparison: https://raw.githubusercontent.com/cncf/k8s-conformance/master/v1.15/digitalocean/e2e.log

This brings up several questions to me:

  1. How come these runs got accepted as k8s conformant if the logs quite clearly state that they did not run the full test suit?
  2. How can the process be improved in order to prevent issues like this occurring on accident?

In general this seems to be a very unfortunate / accidental error to me - it however lessens the value of certified kubernetes in my mind at least.
I do not believe that the parties submitting these conformance tests acted in bad faith but do however think that the verification process for conformance tests should be improved.

@MarcelMue MarcelMue changed the title Accepted conformance tests run without full test suit Accepted conformance tests run without full test suite Oct 30, 2019
@timothysc
Copy link
Member

I don't have time to dig into why there is a delta @johnSchnake can you look quick to see what the delta is here?

@johnSchnake
Copy link
Contributor

Making notes as I review this:

  • The line Conformance test: not doing test setup. comes from hack/ginkgo_e2e.sh which would make me assume this wasn't generated via running Sonobuoy (which is fine, but just FYI). Sonobuoy is recommended but not required.
  • The instructions for the first item in your list sap-cp-aws gives instructions that refer to a URL that gives me a 404. It indicates that they ran the tests via Sonobuoy using some stored YAML manifest. This, in theory, would be an acceptable way to run the tests (you dont have to use the CLI) but again the logs dont seem to indicate it was run via Sonobuoy (which doesn't use the ginkgo_e2e.sh script)
  • I ran sonobuoy run --plugin-env e2e.E2E_DRYRUN=true --kube-conformance-image-version=v1.15.2 to check the list of conformance tests for that version and confirmed (as stated in the ticket) that 215 tests are conformance tests despite only 212 being run on these.
  • The missing tests in the first case (we'd have to confirm them for the others) are:
    • [sig-apps] Job should delete a job [Conformance]
    • [sig-apps] ReplicationController should surface a failure condition on a common issue like exceeded quota [Conformance]
    • [sig-network] DNS should provide DNS for ExternalName services [Conformance]

I will take a look and see if I can find a reason why those wouldn't have been chosen either by that hack script or some other reason.

What needs to happen IMO going forward is:

  • the logs should print the focus/skip values used for ginkgo. This is printed in the pod logs but not as part of the e2e.log output.
  • A validation script automates the checks that should occur (# of tests pass/skip/etc and the versions of the server/test match)

@hh
Copy link
Collaborator

hh commented Oct 30, 2019

/assign @hh @johnSchnake

@johnSchnake
Copy link
Contributor

All 3 of those tests were also added to Conformance in v1.15; but the SHA in the logs for the test matches the SHA in my test run with 215 test so I don't think that could account for it.

@OlegLoewen You were the one to submit for at least some of these; could you clarify how these runs were obtained? They indicate they used a sonobuoy yaml script which hasn't existed for some time and also wouldn't have matched the version of tests used here.

@hh
Copy link
Collaborator

hh commented Oct 30, 2019

@johnSchnake and I will take a closer look at this together tomorrow

@MarcelMue
Copy link
Contributor Author

MarcelMue commented Oct 30, 2019

@johnSchnake and I will take a closer look at this together tomorrow

Let me know if I can helpful in any way.

I looked through the docs for conformance tests and noticed that the Reviewing Guidelines only make loose statements on the number of test cases in number 5: https://github.com/cncf/k8s-conformance/blob/master/reviewing.md#technical-requirements
Would it make sense to set the number of test cases? Changing the number of test cases would then also require an update to the reviewing guidelines. But some verification through CI as you mentioned previously would be superior.

Additionally I noticed that the wording around sonobuoy usage is quite vague. I understand that it is only recommended from what @johnSchnake wrote - however the content description for a PR here makes it seem like sonobuoy is required (which was my understanding so far).

Some guidelines around which execution methods for the test cases are actually allowed would help debugging situations like this IMO. I am not sure how feasible that would be as I have no overview of the currently used execution methods except sonobuoy.

@OlegLoewen
Copy link
Contributor

OlegLoewen commented Oct 31, 2019

@johnSchnake Thanks for pointing that out, I have found the root cause for the missing testcases. It was an unfortunate misconfiguration of our test case list file. I will fix it soon and reupload the test results. Let me explain how this test results have been obtained:

At first, the description of how to reproduce the test results sap-cp-aws is outdated. I will update it soon. Further below I also describe how we currently run the conformance tests (and other) in detail.

We are continuously working on test coverage of gardener, which is why readme descriptioins like the mentioned one are getting outdated pretty fast. At the beginning we were using suonoboy, which was fine for a moment. But since we wanted to test additional k8s e2e test cases, besides conformance and we wanted to automate test runs to ensure quality, we introduced Test Machinery and a kubetest wrapper which allow us to test clusters in a structured and consistant way. Currently we are running over 260 e2e test suites a day (on different landscsape, k8s versions, cloud providers and operating systems) which covers 1099 unique testscases and 113025 executed testcases in total per day. We are uploading conformance test results to testgrid (conformance, others) on daily basis for k8s versions 1.10 to 1.16 and multiple cloud provider.

Reasons why we were forced to build a kubetest wrapper were to gain more control over testcase grouping and test result evaluation. With that we now have:

  • The possibility to define custom testcase groups with individual tagging like fast, slow, flaky, etc. because we sometimes want custom grouping and since existing tags in test case names are not reliable. E.g. some testcases are slow, although they don't have a [Slow] tag, and same for [Flaky].
  • Filter false positive test cases
  • Parse resulting junit.xml files and e2e logs to our JSON format to ingest into Elasticsearch
  • Annotate testcases to run only for dedicated cloud providers
  • Merge junit.xml files of parallel kubetest executions (this is not used for conformance)

Regarding this discussion here, would be great to have some validation on testgrid side for e.g. testcase numbers to avoid such incomplete executions.

You will be able to reproduce the test results if you run:

#first set KUBECONFIG to your cluster
docker run -ti -e --rm -v $KUBECONFIG:/mye2e/shoot.config golang:1.13 bash

# run all commands below within container
export E2E_EXPORT_PATH=/tmp/export; export KUBECONFIG=/mye2e/shoot.config; export GINKGO_PARALLEL=false
go get github.com/gardener/test-infra; cd /go/src/github.com/gardener/test-infra
export GO111MODULE=on
go run -mod=vendor ./integration-tests/e2e -debug=true -k8sVersion=1.16.1 -cloudprovider=azure -testcasegroup="conformance"
echo "testsuite finished"

Which internally runs:

kubetest --provider=skeleton  --extract=v1.16.2 --deployment=local --test --check-version-skew=false --test_args=--clean-start --ginkgo.dryRun=false --ginkgo.focus=\\[k8s\\.io\\]\\sSecurity\\sContext\\sWhen\\screating\\sa\\spod\\swith\\sprivileged\\sshould\\srun\\sthe\\scontain... -dump=/tmp/e2e.log

@johnSchnake
Copy link
Contributor

The config here: https://github.com/gardener/test-infra/blob/68c3d60171fcb7d36b39d86935d14f5ea55064d6/integration-tests/e2e/kubetest/description/1.16/working.json#L3 makes it look the link between the testcasegroup you mention and how to set the focus/skip on the actual run, is that correct?

FWIW, for conformance certification runs you should be just setting the focus to [Conformance] and not setting a skip value at all. I understand that in basic CI situations you may want to skip serial/slow though (Sonobuoy does something similar to avoid disruptive tests).

Here is the dir for all the v1.15 tests https://github.com/gardener/test-infra/tree/68c3d60171fcb7d36b39d86935d14f5ea55064d6/integration-tests/e2e/kubetest/description/1.15 and it just seems like those tests aren't listed in any of the working/skip/false-positive files so I must be misunderstanding a component of how the test list is run.

@OlegLoewen
Copy link
Contributor

OlegLoewen commented Oct 31, 2019

The mentioned test cases weren't conformance-tagged in previous k8s release version. So since I have reused the description file of previous k8s release version for 1.15, these test cases were already assigned to the group fast/slow but not conformance, which is the reason why they have been discarded during the test suite execution. This commit is the quick fix, but I will think of something to prevent this kind of an issue in future. Also the commit somehow helps to understand the actual issue/bug.

Regarding fokus skip fields in the description file:

Our main test case groups are fast, slow and conformance (which can be combined). Test cases of the fast group are executed parallely with --ginkgo-parallel=8 to have faster results. If we run only the fast group, we still want to have the fast conformance test cases to be run as well but we don't want to have any serial or slow tagged conformance tests. To accomplish this, we use the skip and focus fields.

If we run e2e with -testcasegroup=conformance, then these line are taking into account to calculate the conformance testcases:

  { "testcase": "[Conformance]", "focus": "Serial|Slow", "groups": ["slow", "conformance"]},
  { "testcase": "[Conformance]", "skip": "Serial|Slow", "groups": ["fast", "conformance"]},

FWIW, for conformance certification runs you should be just setting the focus to [Conformance] and not setting a skip value at all [...]

If this is a requirement and using a list of test cases is not an alternative, we can do this as well.

@johnSchnake
Copy link
Contributor

I think this raises good questions about what is required for conformance certification.

As I read it, Sonobuoy is optional but that the conformance run should be executed in a single test run rather than multiple which are spliced together. I think that makes the most sense as the instructions request a single log and junit result file.

You could argue that you want to merge the junit results together but in that case you'd at least need to upload all of the log files separately.

In addition, allowing N runs to be run at separate times but combined to meet certification seems like a loosening of requirements since clusters could undergo changes between the different runs. Everything is much more clear/standardized if we just require a single run (though again, parallelizing for CI makes sense).

@hh
Copy link
Collaborator

hh commented Oct 31, 2019

I agree with @johnSchnake regarding enforcement a specific list of tests in a single test run.

I've created a ticket for creating a Prow Job to run before submissions can be accepted.

#760

@MarcelMue
Copy link
Contributor Author

AFAICT test runs are now automatically checked for completeness. Should this issue be closed?

@cncf-ci cncf-ci closed this as completed May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants