-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add keepalive to agents #2294
Add keepalive to agents #2294
Conversation
cmd/drone-agent/main.go
Outdated
@@ -80,6 +80,16 @@ func main() { | |||
Name: "healthcheck", | |||
Usage: "enables the healthcheck endpoint", | |||
}, | |||
cli.StringFlag{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use DurationFlag
here and then remove the time.ParseDuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will implement this
cmd/drone-agent/agent.go
Outdated
conn, err := grpc.Dial( | ||
c.String("server"), | ||
grpc.WithInsecure(), | ||
grpc.WithPerRPCCredentials(&credentials{ | ||
username: c.String("username"), | ||
password: c.String("password"), | ||
}), | ||
grpc.WithKeepaliveParams(agentKeepalive), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to simplify like this:
grpc.WithKeepaliveParams(
keepalive.ClientParameters{
Timeout: c.Duration("keepalive-timeout"),
Time: c.Duration("keepalive-time"),
},
),
If c.Duration
is empty it will return the zero value, so this snippet here:
agentKeepalive := keepalive.ClientParameters{}
if dur := c.Duration("keepalive-time"); dur != 0 {
agentKeepalive.Time = dur
}
should behave the same as this snippet here:
agentKeepalive := keepalive.ClientParameters{
Time: c.Duration("keepalive-time"),
}
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my only concern is I'm trying to maintain default values for Time
(infinity) and Timeout
(20s).
grpc.WithKeepaliveParams(
keepalive.ClientParameters{
Timeout: c.Duration("keepalive-timeout"),
Time: c.Duration("keepalive-time"),
},
),
would return 0
for both as defaults instead if user doesn't specify keep alive config. Let me know if I'm wrong and I can implement it that way. Otherwise I think I'll change what I have to:
agentKeepalive := keepalive.ClientParameters{}
if dur := c.Duration("keepalive-time"); dur != 0 {
agentKeepalive.Time = dur
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will still work as expected. There is no nil
value for time.Duration
so the grpc code will likely use infinity if the value is 0. I will look through the code to verify, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, looks like if c.Duration
returns a zero value, the kubernetes defaults will kick in
var defaultClientKeepaliveTime = infinity
...
if kp.Time == 0 {
kp.Time = defaultClientKeepaliveTime
}
if kp.Timeout == 0 {
kp.Timeout = defaultClientKeepaliveTimeout
}
@jmccann awesome! This one having been haunting me for a couple of weeks since 0.8.x |
@jmccann Thanks for this, from the short time I have been testing it it seems to work just fine over two swarm networks (server running on one swarm, the agents running on another swarm) There's just one thing, though I think it's just about aesthetics, however I thought I would point it out anyway in case it would have another side effect I am not aware of: {
"pending": null,
"running": null,
"stats": {
"worker_count": 43,
"pending_count": 0,
"running_count": 0,
"completed_count": 0
}
} I have three workers/agents running and the keepalives seem to behave as if new agents kept joining the server edit: Well, I seem to have a hit a sweet spot where none of the workers were connected at the time, i now have worker count at 52 (and still increasing) but a build stuck in pending |
This would tell me that something is terminating your TCP connections (loadbalancers, etc) without sending the TCP RST. The server therefore thinks the connection is still open, as it has not received the close event. There is a server-side keepalive message that can be configured. This would enable bi-directional keepalive pings and would likely prevent this sort of problem grpc/grpc-go#1119 |
@bradrydzewski shouldn't this also close #2246 ? |
I did not manage to get it working on a swarm behind nginx proxy using keepalives from #2297 . So if what @bradrydzewski mentioned above is not configurable in the current build I do not think the issue is resolved. I ended up migrating the drone agent+server outside of swarm at the end. |
* feat: [CDE-195]: Run VS Code Web as non-root user. * feat: [CDE-195]: read devcontainer.json directly for gitness
This adds keep alive to agents to send pings back to server during build inactivity.
This has helped me address issues between server<>agent when there are no builds for a period of time and firewalls kill the inactive connection. May be helpful for other server<>agent comm issues as well.
I'm in progress of testing this specific iteration. I had tested with hardcoded values and it worked beautifully. Wanted to get this out there for review at least. :)