Fix various things in Solr startup by coderoshi · Pull Request #136 · basho/yokozuna

coderoshi · 2013-07-09T03:18:55Z

This was originally written to address #134. Mainly, it was to change wait_for_solr from throwing, causing the crash of the yz_solr_proc:init function and thus causing the whole node to stop. It turns out that even without the throw, anything but {ok, State} will cause the supervisor tree to shutdown completely. It's worse than that, the init code has a default of 5s to start according to @beerriot, so the wait_for_solr timeout isn't actually even honored.

The new plan is to cause the Riak node to have a full shutdown if the Solr process cannot stay up. @beerriot will be making a PR to move the start out of the init function. Until then various fixes have been made around the Solr startup process. The main changes were to avoid throwing/exiting inside the process, instead use the stop directive for gen_server, and to move the solr proc directly under yz_sup in order to make sure max restart frequency can actually be reached and cause the Riak node to shutdown.

rzezeski · 2013-07-10T13:37:03Z

This PR is meant to address #134.

rzezeski · 2013-07-11T12:56:37Z

src/yz_sup.erl

This gives me an uneasy feeling. This process flag is being applied to the caller of start_link which in this case happens to be Yokozuna's application master. I'm not so sure we should be playing process flag games with a process we don't own.

Furthermore, I noticed this seems to cause issues when stoping. I don't think it's caused by this PR. Rather when calling stop the yz_events server gets a gen_event exit from riak_core_ring_events which it doesn't seem to have a handle_info for and thus causes a function clause error to get thrown. This in turn trips the trap exit and the "Yokozuna had a problem starting..." message.

OH...wow, yea this is bad. This call is never returning because of the receive call below. It's blocking because this receive call is executing on application master process. That process, in turn, doesn't wait for a message from the application it's starting, the app start is synchronous. So the only thing that is going to trip this receive is a) a crash or b) some other message that might be sent to application master but that is OTP land and nothing we should touch. Thus just dropping the message like we do below could have other unintended consequences.

rzezeski · 2013-07-11T13:05:19Z

I'm -1 on this PR because of the process flag change in yz_sup. This seems against the grain of OTP and I've already found it causes start_link to not return until a crash. Perhaps a failing init is meant to ripple through the system and stop all applications? The application startup process in Riak is still something that is a bit magic to me. I don't know if it's a reltool thing, a supervisor thing, or what. I'm starting to wonder if doing this work in the init function is the wrong thing to do in the first place. I notice that if a Solr instance is already up you don't actually fail in init but instead get past init and get a handle_info call with an exit status of 1 from the port because it couldn't bind to the port. It gets past init because there is already a Solr instance up that will give a 200 response to the yz_solr:cores call. When this type of failure happens the worker is just restarted over and over and over like a typical OTP supervision tree. This is "the erlang way" but it seems like keeping track of Solr restarts and shutting down only the Yokozuna application might make more sense. I'm not sure. One problem right now is the restart frequencies aren't set so well so in this case you get infinite restart attempts rather than bubbling up the supervisor chain and stopping whole node.

coderoshi · 2013-07-11T16:48:53Z

According to the OTP supervisor init process, if any children fail on init, the whole supervisor is meant to fail. When the supervisor fails, the yokozuna application fails, and when that fails, the failure chains up to the whole node.

I chose to do the work in init, because the alternative is to startup the supervisor, then separately attach the children supervisors one by one, which was kind of a weird workflow I've never seen replicated in Riak.

I'd love to stop the launch of the actual yokozuna application, however, application:start/2 only specs two return values: {ok, Pid} and {error, Reason}. Sending error propagates failure to take down the whole node. But claiming the process started ok is also not exactly true. This was the crux of my unease with the whole idea of catching a failure on startup.

The main purpose of this commit was to avoid calling `proplists:get_value` on an undefined port (Erlang port, not socket). This would happen when the Solr process has a non-zero exit, e.g. because the specific socket port number is already bound. Some additional fixes I made along the way: * Move `getpid` function in API section where it belongs. * Fix the exit status match in `handle_info`. * Check for a regular port exit msg in `handle_info`. * Only close the port in `terminate` if the pid is not `undefined`.

coderoshi · 2013-07-25T22:26:23Z

src/yz_solr_proc.erl

Where is this used?

Nevermind, yz_monitor_solr

Move the solr proc server under the yz_sup so that max restart frequency will actually be reached and shutdown the Riak node. The thinking is that if Solr cannot run then Riak should shutdown to avoid missing index data and other badness. That is, better to have a downed node than one that is partially up.

coderoshi · 2013-07-25T22:56:05Z

+1

Fix various things in Solr startup

Stop all yokozuna supervisors rather than crash full node

117b264

ghost assigned coderoshi Jul 10, 2013

rzezeski reviewed Jul 11, 2013
View reviewed changes

coderoshi and others added 3 commits July 22, 2013 16:03

Remove trap exit and let the node crash

4acf561

Merge branch 'master' into er-port-in-use-crash

6d5f4f1

coderoshi reviewed Jul 25, 2013
View reviewed changes

rzezeski added a commit that referenced this pull request Jul 25, 2013

Merge pull request #136 from basho/er-port-in-use-crash

d3a2042

Fix various things in Solr startup

rzezeski merged commit d3a2042 into master Jul 25, 2013

coderoshi removed their assignment Feb 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix various things in Solr startup#136

Fix various things in Solr startup#136
rzezeski merged 5 commits intomasterfrom
er-port-in-use-crash

coderoshi commented Jul 9, 2013

Uh oh!

rzezeski commented Jul 10, 2013

Uh oh!

rzezeski Jul 11, 2013

Uh oh!

rzezeski commented Jul 11, 2013

Uh oh!

coderoshi commented Jul 11, 2013

Uh oh!

coderoshi Jul 25, 2013

Uh oh!

coderoshi Jul 25, 2013

Uh oh!

coderoshi commented Jul 25, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coderoshi commented Jul 9, 2013

Uh oh!

rzezeski commented Jul 10, 2013

Uh oh!

rzezeski Jul 11, 2013

Choose a reason for hiding this comment

Uh oh!

rzezeski commented Jul 11, 2013

Uh oh!

coderoshi commented Jul 11, 2013

Uh oh!

coderoshi Jul 25, 2013

Choose a reason for hiding this comment

Uh oh!

coderoshi Jul 25, 2013

Choose a reason for hiding this comment

Uh oh!

coderoshi commented Jul 25, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants