Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus: Run checks directly on the host #1334

Merged
merged 1 commit into from Aug 4, 2021

Conversation

cevich
Copy link
Member

@cevich cevich commented Jun 24, 2021

In order to meet achievable deadlines converting from Travis to Cirrus
CI, one significant artifact was carried forward (instead of fixing):

Depending on a --privileged container to execute all/most automated
checks/tests.

Prior attempts to remove this aspect resulted in several test failures.
Fixing the problems was viewed as more time-consuming than simply
preserving this runtime environment.

Time has passed, and the code has since moved on. This commit removes
the legacy need to execute CI operations in a --privileged
container, instead running them directly on the host. At the same time,
the necessary test binaries are obtained from the same container used
for development/local testing purposes. This ensures the two experiences
are virtually always identical.

@cevich cevich marked this pull request as draft June 24, 2021 16:46
@cevich
Copy link
Member Author

cevich commented Jun 24, 2021

ahh right most of the tests are skipped when not running in a container, there was something I needed to tweak to get around that...

@cevich cevich force-pushed the drop_podmanmake branch 2 times, most recently from 8dfc215 to 910bd20 Compare June 24, 2021 19:43
@cevich
Copy link
Member Author

cevich commented Jun 24, 2021

Wait...what?!?!?! It passed??????

@mtrmac PTAL (it's still a prototype).

@cevich cevich requested a review from mtrmac June 25, 2021 13:38
@cevich
Copy link
Member Author

cevich commented Jun 25, 2021

oh! wait, I made a typo, and we probably want developers to be able to build a container for testing in...

@cevich cevich force-pushed the drop_podmanmake branch 2 times, most recently from 0b2848e to e2d1464 Compare June 25, 2021 17:15
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration tests fail on OpenShift startup… Assuming that’s not transient (I have now triggered a re-run), debugging that doesn’t look like much fun.

Another sort-of-concern is that while running a custom container inside a custom VM is rather wasteful, and running CI directly on a VM is certainly nice, for local development it is convenient that make test-integration (and make test-system?) can be run in a container, so it is useful for CI to enforce that those two continue to work inside a container.

So, would running vendor+build+validate+unit directly on the VM, and integration+system in a container be (even possible and) worth the effort?

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
hack/make.sh Show resolved Hide resolved
@cevich
Copy link
Member Author

cevich commented Jun 30, 2021

for local development it is convenient

Yes, we need to do both, the challenge is keeping the environments and experience as similar as possible.

Would running vendor+build+validate+unit directly on the VM, and integration+system in a container be (even possible and) worth the effort?

There's no reason we can't/shouldn't do both in parallel (on separate VMs).

@cevich cevich force-pushed the drop_podmanmake branch 3 times, most recently from 1953bfa to a792502 Compare June 30, 2021 15:01
@cevich
Copy link
Member Author

cevich commented Jun 30, 2021

The openshift thingie seems to be failing to start due to SSl/TLS crap:

transport: http2Client.notifyError got notified that the client transport was broken remote error: tls: bad certificate.

But I also see a fair number of what looks like errors binding to port 8443 of the local IP...I wonder if maybe something in the GCP networking setup is (somehow) blocking this? I would think the system/kernel smart enough to recognize that as the local address and not actually send anything out.

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 30, 2021

The openshift thingie seems to be failing to start due to SSl/TLS crap:

transport: http2Client.notifyError got notified that the client transport was broken remote error: tls: bad certificate.

Yeah. I don’t know what that’s about :/ It’s quite possible that a networking difference changes the detected primary IP address inconsistently in some parts of the code vs. others, or something like that.

But I also see a fair number of what looks like errors binding to port 8443 of the local IP...

Binding or connecting? Connecting retries are expected, that’s just

func newPortChecker(c *check.C, port int) (portOpen <-chan bool, terminate chan<- bool) {
waiting for the server to start up.

Binding to 0.0.0.0:8443 seems to succeed in those logs; also the “remote error” strings at least hint in the direction that a “remote” connection was actually established. Do you mean that it fails from time to time?

I would think the system/kernel smart enough to recognize that as the local address and not actually send anything out.

Yes — OTOH OpenShift might be trying to be “useful” on the local network; we just run an openshift start master (in this old version that didn’t use dozens of separate containers) with no configuration at all, and I don’t really know much at all what that does.

@cevich
Copy link
Member Author

cevich commented Jul 2, 2021

quite possible that a networking difference changes the detected primary IP address inconsistently

Oh, good thought. I can check this manually by comparing the logs from a job run from master.

Binding to 0.0.0.0:8443 seems to succeed in those logs;

Interesting, I didn't notice that.

OTOH OpenShift might be trying to be “useful” on the local network

Hmmm, so the VMs definitely have TWO IP addresses. One is internal-network only, the other is external. IIRC, the logs mentioned the 10.x.x.x (internal) network. That should work, but maybe not. We're using a very vanilla/default GCP networking setup, it's possible the default is to block inter-VM communications.

In any case, I think this is going to require some hands-on work.

@cevich
Copy link
Member Author

cevich commented Jul 20, 2021

this old version that didn’t use dozens of separate containers

@mtrmac QQ: Is the purpose of this to verify Skopeo can talk to the OpenShift registry? Is there a (testing) reason why skopeo needs to remain compatible with an unsupported OpenShift?

Reason I ask is...might it be easier to get a newer/supported registry to test against rather than fight/fix old one?

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 20, 2021

QQ: Is the purpose of this to verify Skopeo can talk to the OpenShift registry?

Yes,

Is there a (testing) reason why skopeo needs to remain compatible with an unsupported OpenShift?

Somewhat.

We need the registry for testing two things:

  • The X-Registry-Supports-Signatures API, a registry-native way of storing signatures, independent of OpenShift’s k8s API. That one is, AFAIK, supported by the OpenShift integrated registry in current versions as well, and it is much easier to use than external sigstores, so it’s something we fairly strongly need to keep functional, i.e. keep testing. (It’s fairly reasonable to argue that testing against old OpenShift will over time be less representative than testing against currently-supported versions, so if testing against the current full-cluster monsters isn’t feasible, so we could just give up and test against completely fake mock servers instead, and get similar coverage as we do now. It’s arguable but it’s not obviously the right call, and anyway replacing the current tests with the fake-mock-server tests would be a somewhat sizable effort.)
  • The preceding OpenShift-k8s-specific signature API, using the atomic: transport. That is still shipped but fairly strongly deprecated (it was created before the X-R-S-S API, so we need it either not to break old atomic: users, or to allow reading/writing signatures for OpenShift versions that support the atomic: API but not X-R-S-S), and less and less relevant over time. OpenShift has already changed its API so that atomic: does not work against recent servers, we keep it only to access the old ones. At some point, I don’t know how to decide when, we could just cut our losses and stop testing atomic:, or even remove it altogether.

In an ideal world I don’t expect to happen any time soon, we would teach GitHub.com/distribution/distribution the X-R-S-S API, and test against that — a simple registry server is much smaller and easier to run. But nobody has worked on such a server, to my knowledge; and even in that case we would probably want to ensure interoperability with the OpenShift implementation of X-R-S-S. (… and in a truly ideal world, we would have testing coverage of the most of the major registry products, but that’s another problem space entirely.)

@cevich
Copy link
Member Author

cevich commented Jul 20, 2021

Okay, thanks for the insights.

Yeah, mocking it would be a major undertaking and additional maintenance, much better to use the real-deal. Too bad it's such a behemoth.

There are also now registries on both github and gitlab, in addition to quay and in the google cloud. I assume testing those wouldn't add much value, but please correct me if I'm wrong.

@cevich
Copy link
Member Author

cevich commented Jul 21, 2021

Okay, I setup an POC container build over in the c/automation_images repo.. Next I will work to incorporate it's use into both CI and the Makefile.

@cevich cevich changed the title Cirrus: Run checks directly on the host [WIP] Cirrus: Run checks directly on the host Jul 21, 2021
@cevich cevich force-pushed the drop_podmanmake branch 2 times, most recently from 75729a2 to 363e1fa Compare July 21, 2021 20:03
Makefile Show resolved Hide resolved
Makefile Show resolved Hide resolved
@cevich
Copy link
Member Author

cevich commented Jul 22, 2021

@mtrmac I've taught make how to lookup and use the container image shared with CI. This involves some compromises (pulling & running two container images), though it's a one-time cost (only pulled if absent) and the images are relatively small. The big benefit here is, CI and local development will always remain in-sync (with an option to override for trying out a newer image). Please clone this PR and try out the Makefile so far (make docs-in-container, validate, shell, test-unit, test-integration, etc). I know some of the tests are still broken, but I'd like your blessing on the "look and feel" aspect.

Meanwhile...let me see about fixing some of these docker-rate-limit problems, they're starting to drive me crazy 😠

@cevich
Copy link
Member Author

cevich commented Aug 3, 2021

Is it possible that the home directory does not exist, or is not writeable?

We're running as root, so I doubt that simple case is the problem, but could it be an SELinux issue maybe?

I'm hitting all kinds of problems in podman land with these same VM images, due to policy updates: containers/podman#10522

Specifically, any volume mounts with :ro or those missing a :Z or :z are causing all kinds of problems with registry containers.

I'm having a really (REALLY) hard time parsing all the openshift/registry output mixed in with the test output. Is there any way we could send the openshift stuff into a file instead of printing it all? It's really easy for me to tell Cirrus to attach a file after the task runs (pass or fail).

@cevich
Copy link
Member Author

cevich commented Aug 3, 2021

confirmed: $HOME is set to /root and the permissions seem correct (0755), and SELinux type admin_home_t:s0

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 3, 2021

Huh. Is it possible that the home directory does not exist, or is not writeable?

Close enough of a guess, it seems.

confirmed: $HOME is set to /root and the permissions seem correct (0755), and SELinux type admin_home_t:s0

That’s not quite what I see. The home directory, per getpwnam(3)//etc/passwd, is indeed /root, but $HOME is either unset or empty.

See #1403 , which includes various debugging printouts, and where the integration tests pass when run with HOME=/root make ….

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 3, 2021

I'm having a really (REALLY) hard time parsing all the openshift/registry output mixed in with the test output. Is there any way we could send the openshift stuff into a file instead of printing it all? It's really easy for me to tell Cirrus to attach a file after the task runs (pass or fail).

I’m probably thinking about this incorrectly — I think those log streams have identifiable prefixes, so it seems possible to filter them out after-the-fact, using tools if necessary. So far it seemed convenient enough to me to have everything in the same log stream, usually in the natural order (“starting registry” … registry start-up logs … “waiting for registry start-up finished”); separating some of those logs out into external files would almost every time require looking up one or more of those files (presumably the main log would contain paths?), which seems to be more work to me. Is it the visual noise on each individual line? Too many useless log lines? The fact that everything is mixed in a single stream, regardless of what the individual lines look like? Something entirely different? Some combination of factors?

Admittedly:

  • It is definitely confusing that when start-up of the first cluster fails, we don’t tear it down and then the other test setup fails with a “port is already in use”. That’s should be fixable with a little effort.
  • The http2Client.notifyError got notified that the client transport was broken remote error: tls: bad certificate logs certainly spam the log, and IIRC weren’t there when writing the tests originally. But if the tests pass anyway, I’m not too tempted to dig deep into this old version of software to understand what that’s about; it is likely to be non-trivial effort and the outcome of that investigation is not very probable to be reusable to any modern software.

@cevich
Copy link
Member Author

cevich commented Aug 3, 2021

So far it seemed convenient enough to me to have everything in the same log stream

That's fine too, it's just hard on my eyes. But if it makes sense to you and you can make sense of it, I'm fine just being the lever-puller.

Some combination of factors?

From the perspective of someone who doesn't see it much / ever, it's hard to visually distinguish between test harness output and subject (skopeo) output and tooling (openshift, registry, etc.) output.

IIRC weren’t there when writing the tests originally

The horrendous amount of redundant messages from openshift/etcd ("blah blah tls cert invalid blah blah") are also really confusing. But yes, this can be a problem for another day and PR. I too am concerned as I don't remember seeing them before, maybe they were simply not shown before (somehow)?

the integration tests pass when run with HOME=/root

Oh wait, this just struck me. When the tests run, it's not a login shell, but in a get_ci_vm.sh environment, it is. This may be exactly the problem, and I remember having it in tests on the other repos. Let me copy-paste the fix...

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 3, 2021

So far it seemed convenient enough to me to have everything in the same log stream

That's fine too, it's just hard on my eyes. But if it makes sense to you and you can make sense of it, I'm fine just being the lever-puller.

Some combination of factors?

From the perspective of someone who doesn't see it much / ever, it's hard to visually distinguish between test harness output and subject (skopeo) output and tooling (openshift, registry, etc.) output.

I think there’s basically two cases of the asynchronous output: Log line: comes from OpenShift, and Output registry-$port {stdout,stderr} comes from registry processes. It should be easy enough to make the two more similar, or use a more prominent log prefix, or something like that.

Some other synchronous commands log output. Skopeo itself typically doesn’t log output, unless the test fails and the output is included in an assertion message.

Probably none of that is 100% consistent, as the code grew over time.


I can’t really remember any more, and of course past problems are not that much of an indication of future problems, but it seems reasonable that those noisy Output /usr/local/bin/oc oc create -f - stderr: waiting lines were added for a reason, and losing the relative timing of those vs. other output might be bad. OTOH it’s also quite possible that those log lines were added only while developing the log catching implementation and were never relevant for timing; my commit message doesn’t say.

I don’t feel about the output format strongly… I guess PRs are welcome?


The horrendous amount of redundant messages from openshift/etcd ("blah blah tls cert invalid blah blah") are also really confusing. But yes, this can be a problem for another day and PR. I too am concerned as I don't remember seeing them before, maybe they were simply not shown before (somehow)?

My first guess is that this has something to do with recent Go only looking for host names in subjectAltName instead of CN. But that’s really just a guess with no supporting evidence at all.

Makefile Outdated Show resolved Hide resolved
@mtrmac mtrmac mentioned this pull request Aug 4, 2021
@cevich
Copy link
Member Author

cevich commented Aug 4, 2021

Huh...well it seems (somehow) the tests passed. I didn't notice anything funny in the output, but I'm a bit skeptical. Let me push this small Makefile fixup and let's see if it passes again...

@cevich cevich force-pushed the drop_podmanmake branch 2 times, most recently from f68c889 to dec382f Compare August 4, 2021 13:36
@cevich cevich changed the title [WIP] Cirrus: Run checks directly on the host Cirrus: Run checks directly on the host Aug 4, 2021
@cevich cevich force-pushed the drop_podmanmake branch 2 times, most recently from 69552c5 to 53a3d99 Compare August 4, 2021 13:42
@cevich
Copy link
Member Author

cevich commented Aug 4, 2021

Well, okay then it passed again. Removing the draft status, let's get on with final review before something breaks 😁

@cevich cevich marked this pull request as ready for review August 4, 2021 14:24
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work!

I’m really only worried about the /etc/os-release check question; the rest are non-blocking nits.

Makefile Outdated Show resolved Hide resolved
contrib/cirrus/runner.sh Outdated Show resolved Hide resolved
contrib/cirrus/runner.sh Show resolved Hide resolved
Makefile Show resolved Hide resolved
Makefile Show resolved Hide resolved
.cirrus.yml Show resolved Hide resolved
In order to meet achievable deadlines converting from Travis to Cirrus
CI, one significant artifact was carried forward (instead of fixing):

Depending on a `--privileged` container to execute all/most automated
checks/tests.

Prior attempts to remove this aspect resulted in several test failures.
Fixing the problems was viewed as more time-consuming than simply
preserving this runtime environment.

Time has passed, and the code has since moved on.  This commit removes
the legacy need to execute CI operations in a `--privileged`
container, instead running them directly on the host.  At the same time,
the necessary test binaries are obtained from the same container used
for development/local testing purposes.  This ensures the two
experiences are virtually always identical.

Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich
Copy link
Member Author

cevich commented Aug 4, 2021

@mtrmac thanks for all your help on this too, couldn't have done it w/o you. What a doozy of a PR! 😅

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks again!

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 4, 2021

FWIW https://cirrus-ci.com/build/6216764259303424 lists some “Issues found while parsing configuration” — most look like stylistic preferences rather than something that needs fixing. I have no idea about the severity (or it might have been a well-considered long-standing decision), I just happened to randomly notice that field (well, the whole page, actually) for the first time today.

@cevich
Copy link
Member Author

cevich commented Aug 4, 2021

some “Issues found while parsing configuration” — most look like stylistic preferences rather than something that needs fixing.

Ugh, I know it does that everywhere now. For some reason they really don't like how we use alias + name in jobs, but we literally have to because of a bug they (supposedly) can't fix. It's safe to ignore for now.

@mtrmac mtrmac merged commit 678682f into containers:main Aug 4, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants