Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

engine: TestScheduleMachineOf fails with gRPC turned on #1712

Closed
dongsupark opened this issue Dec 2, 2016 · 0 comments
Closed

engine: TestScheduleMachineOf fails with gRPC turned on #1712

dongsupark opened this issue Dec 2, 2016 · 0 comments

Comments

@dongsupark
Copy link
Contributor

dongsupark commented Dec 2, 2016

TestScheduleMachineOf fails when enable_grpc is set to true. That failure happens only occasionally. You can see the following logs:

fleetd[94]: transport: http2Client. notifyError got notified that the client transport was broken EOF.
fleetd[94]: DEBUG agent.go:87: HeartbeatJobs tick
fleetd[94]: DEBUG unit_state.go:222: Pushing UnitState(ping.0.service) to Registry: &unit.UnitState{LoadState:"loaded", ActiveState:"active", SubState:"running", MachineID:"10c94346e797450d94d3e9b0f657cae4", UnitHash:"b4f04719a20575e450899ac57097fc6d8ac12cc4", UnitName:"ping.0.service"}
fleetd[94]: DEBUG unit_state.go:222: Pushing UnitState(pong.0.service) to Registry: &unit.UnitState{LoadState:"loaded", ActiveState:"active", SubState:"running", MachineID:"10c94346e797450d94d3e9b0f657cae4", UnitHash:"a38421afebf0062e438f25c03c5f24e8202a3a70", UnitName:"pong.0.service"}
fleetd[94]: DEBUG unit_state.go:222: Pushing UnitState(ping.0.service) to Registry: &unit.UnitState{LoadState:"loaded", ActiveState:"active", SubState:"running", MachineID:"10c94346e797450d94d3e9b0f657cae4", UnitHash:"b4f04719a20575e450899ac57097fc6d8ac12cc4", UnitName:""}
fleetd[94]: DEBUG unit_state.go:222: Pushing UnitState(pong.0.service) to Registry: &unit.UnitState{LoadState:"loaded", ActiveState:"active", SubState:"running", MachineID:"10c94346e797450d94d3e9b0f657cae4", UnitHash:"a38421afebf0062e438f25c03c5f24e8202a3a70", UnitName:""}
fleetd[94]: ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
fleetd[94]: DEBUG mux.go:56: HTTP GET /fleet/v1/machines?alt=json
fleetd[94]: ERROR registrymux.go:166: Retry to connect  to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
fleetd[94]: ERROR registrymux.go:166: Retry to connect  to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused

And finally:

--- FAIL: TestScheduleMachineOf (fleet.conf=[enable_grpc=true]) (31.66s)
	scheduling_test.go:132: failed to find 6 active units within 15s (last found: 2) 

To reproduce this issue, several conditions should be satisfied.
First, functional test must be running on a host, where little memory remains free. When there's much free memory, it becomes pretty hard to reproduce.
And even with such an environment, the functional test must run at least 3~4 times, until it hits the error. Running only once is not sufficient.
Running only a particular test is normally not sufficient. You should run a whole test without a '--run' option to make it easier to reproduce.
Sometimes TestScheduleReplace fails, instead of TestScheduleMachineOf. Though I'm not sure these 2 kinds of failures are the same.

After a long investigation, I think this issue looks like a regression from ecb121a ("registry/rpc: use simpleBalancer instead of ClientConn.State()"). Still I'm not completely sure. Let me prepare a fix.

@dongsupark dongsupark self-assigned this Dec 2, 2016
dongsupark pushed a commit to endocode/fleet that referenced this issue Dec 2, 2016
When gRPC turned on, TestScheduleMachineOf fails sometimes, as the
engine becomes unreachable with the following error messages:

====
transport: http2Client. notifyError got notified that the client transport was broken EOF.
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
====

This must have been a regression from commit ecb121a ("registry/rpc: use
simpleBalancer instead of ClientConn.State()"). Remove the additional
checking with IsRegistryReady, in order to avoid the occasional case of
engine being unreachable.

Fixes coreos#1712
dongsupark pushed a commit to endocode/fleet that referenced this issue Dec 2, 2016
When gRPC turned on, TestScheduleMachineOf fails sometimes, as the
engine becomes unreachable with the following error messages:

====
transport: http2Client. notifyError got notified that the client transport was broken EOF.
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
====

This must have been a regression from commit ecb121a ("registry/rpc: use
simpleBalancer instead of ClientConn.State()"). Remove the additional
checking with IsRegistryReady, in order to avoid the occasional case of
engine being unreachable.

Fixes coreos#1712
dongsupark pushed a commit to endocode/fleet that referenced this issue Dec 2, 2016
When gRPC turned on, TestScheduleMachineOf fails sometimes, as the
engine becomes unreachable with the following error messages:

====
transport: http2Client. notifyError got notified that the client transport was broken EOF.
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
ERROR registrymux.go:166: Retry to connect to new engine: dial tcp 172.18.1.1:50059: getsockopt: connection refused
====

This must have been a regression from commit ecb121a ("registry/rpc: use
simpleBalancer instead of ClientConn.State()"). Remove the additional
checking with IsRegistryReady, in order to avoid the occasional case of
engine being unreachable.

Fixes coreos#1712
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant