einhorn should monitor errors on upgrade #2

ebroder · 2012-06-14T23:45:09Z

Ran into a problem when upgrading its children process to a new version, if the new children error out immediately (due to a bug). Because they error pre-ack, the old children stick around and handle requests (yay!)

However, einhorn continuously tries to start the new children. It should possibly back off, and almost certainly notify somehow. If I hadn't happened to be doing maintenance on the server in question, who knows how long we would have gone without noticing.

ebroder · 2012-06-17T08:04:10Z

@gdb: At least from a monitoring perspective, what do you think about adding a last_upgraded element to einhorn's state? That way you can have a monitoring script that warns if any child isn't the current version once (Time.now - last_upgraded) is greater than some threshold.

Doesn't solve the underlying problem of getting in this stuck state, but at least gives you an easy way to automatically detect when you should be worried about it.

gdb · 2012-06-17T16:52:23Z

Yeah, I like the idea last_upgraded.

I think adding some sort of flapping detection would also be good, though it's unclear to me exactly what the semantics should be and how it should be reported. I guess you could do something along the lines of watching for workers dying pre-ACK and then putting the result in your state. Depends on exactly what we want to defend against, I guess.

ebroder · 2012-06-17T19:16:33Z

What's the right way for me to get the state in a machine-inspectable way over the control socket? The 'state' command gives me a string from .inspect - am I expected to..eval that or something? (That would sketch me out, but it doesn't matter because it also doesn't seem to work because, e.g., Time objects aren't eval'able)

gdb · 2012-06-17T19:20:28Z

I can see a few possible answers:

Have a separate machine_readable_state command, which returns a JSON dump of the state, or
Return a YAML dump rather than .inspect for state, or
Return a JSON dump for state

I think the third one is superior, though it has the disadvantage of not exposing type information (e.g. symbol keys vs string keys) as well as .inspect does. That's mostly useful for debugging, and so maybe there should be an inspect_state command or something.

ebroder · 2012-06-17T19:59:06Z

Hmm, JSON on its own doesn't seem to be sufficient here. In
particular, Time objects dumped to JSON generate their .inspect
representation, which is still not useful for easy parsing:

[3] pry(main)> JSON.load(JSON.dump(Time.now))
=> "Sun Jun 17 12:53:53 -0700 2012"

YAML.load(YAML.dump(Time.now)) does return a Time object, although
YAML can also do things like serialize arbitrary types, which is a bit
sketch:

[5] pry(main)> puts Set.new.to_yaml
--- !ruby/object:Set
hash: {}

We could pre- and post-process the state dict to coerce timestamps to
seconds-since-epoch and back, but you can't really do that without
encoding details about the keys in the state dict, which I'm not wild
about.

I think I'd lean towards using YAML, and probably hide it under a
separate command. Thoughts?

gdb · 2012-06-18T18:35:40Z

I've actually been meaning to get rid of JSON from the Einhorn protocol. It's the only gem dependency. If YAML allows arbitrary code execution, I'd rather not use it. If however it's safe, then we could switch to it.

…ails Helps with #2, though it doesn't do much besides stop trying to spin up tons of processes.

This makes it possible to write external monitoring (cf #2)

ebroder mentioned this issue Jun 26, 2012

einhorn should watch for stuck processes #6

Closed

gdb added a commit that referenced this issue Sep 27, 2012

command.rb: Add an exponential backoff when spinning up new workers f…

e2d29c1

…ails Helps with #2, though it doesn't do much besides stop trying to spin up tons of processes.

gdb added a commit that referenced this issue Sep 27, 2012

Add a last_upgraded field to state

ef647a5

This makes it possible to write external monitoring (cf #2)

ConradIrwin mentioned this issue Dec 6, 2012

"wait" command in einhornsh #14

Closed

zanker-stripe closed this as completed Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

einhorn should monitor errors on upgrade #2

einhorn should monitor errors on upgrade #2

ebroder commented Jun 14, 2012

ebroder commented Jun 17, 2012

gdb commented Jun 17, 2012

ebroder commented Jun 17, 2012

gdb commented Jun 17, 2012

ebroder commented Jun 17, 2012

gdb commented Jun 18, 2012

einhorn should monitor errors on upgrade #2

einhorn should monitor errors on upgrade #2

Comments

ebroder commented Jun 14, 2012

ebroder commented Jun 17, 2012

gdb commented Jun 17, 2012

ebroder commented Jun 17, 2012

gdb commented Jun 17, 2012

ebroder commented Jun 17, 2012

gdb commented Jun 18, 2012