Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic config2 #16

Closed
wants to merge 14 commits into from
Closed

Dynamic config2 #16

wants to merge 14 commits into from

Conversation

igor47
Copy link
Collaborator

@igor47 igor47 commented Oct 18, 2013

same as #10 but rebased on current master

@pcarrier @brndnmtthws

@ghost ghost assigned brndnmtthws Oct 18, 2013
@igor47
Copy link
Collaborator Author

igor47 commented Oct 18, 2013

warning: DO NOT MERGE THIS!

@brndnmtthws
Copy link

I added a new thing recently:

  • when an ephemeral service client disconnects, the service is cleaned up

@igor47
Copy link
Collaborator Author

igor47 commented Oct 22, 2013

sorry for the delay; in general, i'm not opposed to the concept here. but i'm not a fan of the implementation. i would like to cleaner separation of concerns, with a cleaner interface for dynamically adding and removing services in the service watcher and a better-documented and separated out server. i'm also not sure if eventmachine is the right approach. i would not merge this as-is.

also, i think @pcarrier has additional concerns about the concept as a whole. pierre, can you elaborate?

@brndnmtthws
Copy link

I'm not overly attached to this implementation--it's just a minimum viable product. The intention is to be able to make progress on other things while you guys decide what the overall direction is for dynamic service discovery. We need functionality like this, but it doesn't necessarily need to be done this way.

That said, we've been using this for a few weeks now without any trouble.

@igor47
Copy link
Collaborator Author

igor47 commented Oct 30, 2013

we talked in person about how to move forward on this. @brndnmtthws is going to (a) split out the server code (b) improve the interface for setting up additional watchers internally and (c) think about improving the TCP protocol for service registration to be more resilient to currently unknown effects.

this should be within the next week.

@brndnmtthws
Copy link

I'm made progress on this, but I'm stuck on an issue with JRuby right now. I can't seem to get the eventmachine jar to load correctly without forcing it to use the ruby version of eventmachine rather than the java one.

@brndnmtthws
Copy link

Bump. Added some documentation to the readme.

@brndnmtthws
Copy link

@igor47 @pcarrier

Bump!

Also, this is a bit of a duplicate of #17 since it resolves some JRuby issues.

Pierre Carrier and others added 6 commits March 7, 2014 12:44
Scenario:

Nerve terminates prematurely (maybe because of a disconnection for a ZK
server).

It gets restarted by runit.

It then does its health checks, finds that the service is up.

It tries to create the node, fails because the node already exists,
so it sets its value.

The session for the previous instance of nerve expires, and the node
being ephemeral and attached to that session, disappears.

Nerve doesn't do anything about that unless the service goes down then
back up, at which point the node gets recreated.
Expose a TCP interface for changing the nerve configuration at runtime.
@brndnmtthws
Copy link

Updated and rebased. Might be completely broken now, however.

Can we get this merged please? Thanks.

@brndnmtthws
Copy link

Still need this, fyi.

@brndnmtthws
Copy link

Yep, still important.

@brndnmtthws
Copy link

@airbnb/sre

Yep, still need this.

The original PR (#10 (comment)) was opened 5 months ago. What's the problem?

@igor47
Copy link
Collaborator Author

igor47 commented Mar 13, 2014

what does "might be completely broken now" mean?

@brndnmtthws
Copy link

You changed a bunch of things and ignored this PR, so for all I know this branch is broken now. I'll try and test it today, but it's rather frustrating.

@brndnmtthws
Copy link

I've fixed the breakage, everything works again.

@pcarrier
Copy link
Contributor

I haven't ignored this, I've suggested running multiple nerve instances to achieve the same effect.

You've ignored my recommendation.

@brndnmtthws
Copy link

Your suggestion was never ignored. We discussed it, and our team decided
this was the best path.
On Mar 13, 2014 12:35 PM, "Pierre Carrier" notifications@github.com wrote:

I haven't ignored this, I've suggested running multiple nerve instances to
achieve the same effect.

You've ignored my recommendation.

Reply to this email directly or view it on GitHubhttps://github.com//pull/16#issuecomment-37577032
.

@nelgau
Copy link

nelgau commented Mar 27, 2014

I've observed that this version of Nerve fails to announce services and logs that a service is "up" when no mention of that host can be found in Zookeeper.

Consider host i-72a9b25c:

  • Kafka is running on port 31124.
  • The last logs from Nerve (nerve-0.5.3-test.jar; built from this branch) were:
I, [2014-03-26T21:09:46.253000 #26106]  INFO -- Nerve::ServiceCheck::TcpServiceCheck: nerve: service check kafka-h1 tcp-xx.xx.xxx.xxx:31124 initial check returned true
I, [2014-03-26T21:09:46.268000 #26106]  INFO -- Nerve::ServiceWatcher: nerve: service kafka-h1 is now up

However, Zookeeper has no i-72a9b25c registered. Restarting Nerve corrected the issue.
A second instance had the same pathology and it was also corrected by a restart.

I received a stack trace from @brndnmtthws on Tuesday that might be of some help to track down the cause of the issue: https://gist.github.com/nelgau/aad34287936abce3ca08

When our ZK node is deleted or otherwise changes, reset it to ensure we
don't drop our node.
@brndnmtthws
Copy link

I tracked the problem @nelgau mentioned above in to a bug in how nerve handles ZK nodes. I've pushed a patch here:

4d5690a

With ZooKeeper, in the event of a session timeout (or other interruption) you need to recreate ephemeral nodes. It now correctly recreates the nodes when they are lost.

@jolynch
Copy link
Collaborator

jolynch commented Sep 14, 2015

I'm going to close this because I don't think that dynamic nerve/synapse config reloads should be the goal going forward. We can reconsider if our current hitless system turns out not to scale (very possible given the design), in which case we'll work on it then.

Basically, you can always restart nerve without dropping traffic by starting two nerves and then cycling between them. I actually think the system is easier to understand this way because the config is the source of truth for what nerve/synapse are doing.

@jolynch jolynch closed this Sep 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants