Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(tests): wait for router to reload before testing config change #119

Closed
wants to merge 1 commit into from

Conversation

slack
Copy link
Member

@slack slack commented Mar 22, 2016

Test: "can set and list environment variables"
Jenkins job: https://ci.deis.io/job/workflow-beta1-test-e2e/17/

It looks like we are hitting the app before the router has had a chance to reload, causing a 404:

00:15:33.957   can set and list environment variables
00:15:33.958   /go/src/github.com/deis/workflow-e2e/tests/config_test.go:55
00:15:35.220 $ deis login http://deis.10.187.248.89.xip.io --username=test-81 --password=asdf1234
00:15:35.273 Logged in as test-81
00:15:35.278 $ deis apps:create test-23966460 
00:15:35.464 Creating Application... ...���done, created test-23966460
00:15:35.468 Git remote deis added
00:15:35.468 remote available at ssh://git@deis.10.187.248.89.xip.io:2222/test-23966460.git
00:15:35.478 $ GIT_SSH=/tmp/deis-workflow-home943620212/.ssh/git-ssh git push deis master

[REMOVED]

00:15:46.424 remote: running [git gc] in directory /home/git/test-23966460.git        
00:15:46.430 To ssh://git@deis.10.187.248.89.xip.io:2222/test-23966460.git
00:15:46.430  * [new branch]      master -> master
00:15:46.430 $ deis config:set POWERED_BY=midi-chlorians
00:15:49.709 Creating config... ...���o..���.o.���..o���...���o..���.o.���..o���...���done
00:15:49.709 
00:15:49.745 === test-23966460 Config
00:15:49.745 POWERED_BY      midi-chlorians
00:15:49.761 $ deis config:list -a test-23966460
00:15:49.793 === test-23966460 Config
00:15:49.793 POWERED_BY      midi-chlorians
00:15:49.803 $ curl -sL "http://test-23966460.10.187.248.89.xip.io"; echo
00:15:49.874 <html>
00:15:49.874 <head><title>404 Not Found</title></head>
00:15:49.875 <body bgcolor="white">
00:15:49.875 <center><h1>404 Not Found</h1></center>
00:15:49.875 <hr><center>nginx/1.9.6</center>
00:15:49.875 </body>
00:15:49.875 </html>
00:15:49.875 

Looking at the router logs, there is a reload, but we immediately see the 404 later in the logs:

[22/Mar/2016:04:35:09 +0000] - 10.184.0.9 - - - 201 - "POST /v2/apps/ HTTP/1.1" - 522 - "-" - "Deis Client v2.0.0-dev" - "~^deis\x5C.(?<domain>.+)$" - 10.187.244.132:80 - deis.10.187.248.89.xip.io - 0.175 - 0.175
2016/03/22 04:35:16 WARNING: Field "router.deis.io/certificates" value "" does not satisfy constraint /(?i)^((([a-z0-9]+(-[a-z0-9]+)*)|((\*\.)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}):([a-z0-9]+(-[a-z0-9]+)*)(\s*,\s*)?)+$/ -- skipping this field and using default value "map[]".
2016/03/22 04:35:16 INFO: Router configuration has changed in k8s.
2016/03/22 04:35:16 INFO: Reloading nginx...
2016/03/22 04:35:16 INFO: nginx reloaded.
[22/Mar/2016:04:35:23 +0000] - 10.184.0.9 - - - 201 - "POST /v2/apps/test-23966460/config/ HTTP/1.1" - 597 - "-" - "Deis Client v2.0.0-dev" - "~^deis\x5C.(?<domain>.+)$" - 10.187.244.132:80 - deis.10.187.248.89.xip.io - 3.261 - 3.261
[22/Mar/2016:04:35:23 +0000] - 10.184.0.9 - - - 200 - "GET /v2/apps/test-23966460/config/ HTTP/1.1" - 585 - "-" - "Deis Client v2.0.0-dev" - "~^deis\x5C.(?<domain>.+)$" - 10.187.244.132:80 - deis.10.187.248.89.xip.io - 0.033 - 0.034
[22/Mar/2016:04:35:23 +0000] - 10.184.0.9 - - - 200 - "GET /v2/apps/test-23966460/config/ HTTP/1.1" - 595 - "-" - "Deis Client v2.0.0-dev" - "~^deis\x5C.(?<domain>.+)$" - 10.187.244.132:80 - deis.10.187.248.89.xip.io - 0.022 - 0.023
[22/Mar/2016:04:35:23 +0000] - 10.184.0.9 - - - 404 - "GET / HTTP/1.1" - 322 - "-" - "curl/7.35.0" - "_" - - - test-23966460.10.187.248.89.xip.io - - - 0.000

There are further router reloads at (but it is difficult to tell what the router knows about at any given moment):

2016/03/22 04:37:06 INFO: Router configuration has changed in k8s.
2016/03/22 04:39:46 INFO: Router configuration has changed in k8s.
2016/03/22 04:39:56 INFO: Router configuration has changed in k8s.
2016/03/22 04:40:26 INFO: Router configuration has changed in k8s.

@vdice
Copy link
Member

vdice commented Mar 22, 2016

I am conflicted b/c I want the tests to be reliable/dependable but I also am hesitant to add too many test 'workarounds' in instances like these because here, for example, the cli has returned and a user would reasonably assume they can begin interacting with it immediately (w/o any wait). Maybe a compromise would be add this as well as a //TODO to revisit/look into a potential router or other issue? (Specifically around Looking at the router logs, there is a reload, but we immediately see the 404 later in the logs)

@slack
Copy link
Member Author

slack commented Mar 22, 2016

I think if we set an application HEALTHCHECK_URL for the app, controller won't return to the calling component until that check passes. Can you verify @helgi

Without a healthcheck we would need to have controller validate that the app and router are up by guessing.

@helgi
Copy link
Contributor

helgi commented Mar 22, 2016

Setting HEALTHCHECK_URL means we are injecting our own instead of using k8s default ones that "do stuff" (not much really) and longer timeouts and other things are put in place.

Controller will always verify that k8s has green lit all the probes, the default ones or ours.

Right now the controller has no insight into when the router is ready and if that hostname is routable. The router is on a 10 second loop I believe to wait for new namespaces and routable information to come up

@slack
Copy link
Member Author

slack commented Mar 22, 2016

Understand the router issue. The only way to fix that is to also have controller hit the app through the router as the final check (most likely to the HEALTHCHECK_URL).

We need to define the no-check behavior for k8s. I'm betting it just leans on Pod state.

I still think HEALTHCHECK_URL in test is a "good idea" (and will push the total loop > 10s so router will have a chance to reload).

@helgi
Copy link
Contributor

helgi commented Mar 22, 2016

Given we ditch the concept of a platform domain as a known entity by Controller then this is not the quickest of fixes. If we can figure out how we want to derive that or if we want to hit the router IP (using xip.io) by querying k8s about it via the controller then that's a doable check and something we used to do in v1.

I'm looking at the no-check behaviour now

@slack
Copy link
Member Author

slack commented Mar 22, 2016

Hitting the router svc address with the appropriate Host header would be sufficient.

@helgi
Copy link
Contributor

helgi commented Mar 22, 2016

Yeah, that's fair. Please open up a ticket for it since I do agree it's a good idea, if nothing else than a last check after all of k8s has gone green based on our own checks.

@helgi
Copy link
Contributor

helgi commented Mar 22, 2016

For the no-check version see http://kubernetes.io/docs/user-guide/pod-states/

LivenessProbe: indicates whether the container is live, i.e. still running. The LivenessProbe hints to the kubelet when a container is unhealthy. If the LivenessProbe fails, the kubelet will kill the container and the container will be subjected to it’s RestartPolicy. The default state of Liveness before the initial delay is Success. The state of Liveness for a container when no probe is provided is assumed to be Success.

ReadinessProbe: indicates whether the container is ready to service requests. If the ReadinessProbe fails, the endpoints controller will remove the pod’s IP address from the endpoints of all services that match the pod. Thus, the ReadinessProbe is sometimes useful to signal to the endpoints controller that even though a pod may be running, it should not receive traffic from the proxy (e.g. the container has a long startup time before it starts listening or the container is down for maintenance). The default state of Readiness before the initial delay is Failure. The state of Readiness for a container when no probe is provided is assumed to be Success.

So no probes = assumed state for both is Success

@helgi
Copy link
Contributor

helgi commented Apr 4, 2016

This shouldn't be required anymore if the controller changes are doing their thing

@mboersma
Copy link
Member

mboersma commented Apr 8, 2016

Is this time.Sleep call still needed as a workaround for test stability?

@slack
Copy link
Member Author

slack commented Apr 13, 2016

Closing this, probably not needed anymore.

@slack slack closed this Apr 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants