Skip to content

Fix various things in Solr startup#136

Merged
rzezeski merged 5 commits intomasterfrom
er-port-in-use-crash
Jul 25, 2013
Merged

Fix various things in Solr startup#136
rzezeski merged 5 commits intomasterfrom
er-port-in-use-crash

Conversation

@coderoshi
Copy link
Copy Markdown
Contributor

This was originally written to address #134. Mainly, it was to change wait_for_solr from throwing, causing the crash of the yz_solr_proc:init function and thus causing the whole node to stop. It turns out that even without the throw, anything but {ok, State} will cause the supervisor tree to shutdown completely. It's worse than that, the init code has a default of 5s to start according to @beerriot, so the wait_for_solr timeout isn't actually even honored.

The new plan is to cause the Riak node to have a full shutdown if the Solr process cannot stay up. @beerriot will be making a PR to move the start out of the init function. Until then various fixes have been made around the Solr startup process. The main changes were to avoid throwing/exiting inside the process, instead use the stop directive for gen_server, and to move the solr proc directly under yz_sup in order to make sure max restart frequency can actually be reached and cause the Riak node to shutdown.

@ghost ghost assigned coderoshi Jul 10, 2013
@rzezeski
Copy link
Copy Markdown
Contributor

This PR is meant to address #134.

src/yz_sup.erl Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives me an uneasy feeling. This process flag is being applied to the caller of start_link which in this case happens to be Yokozuna's application master. I'm not so sure we should be playing process flag games with a process we don't own.

Furthermore, I noticed this seems to cause issues when stoping. I don't think it's caused by this PR. Rather when calling stop the yz_events server gets a gen_event exit from riak_core_ring_events which it doesn't seem to have a handle_info for and thus causes a function clause error to get thrown. This in turn trips the trap exit and the "Yokozuna had a problem starting..." message.

OH...wow, yea this is bad. This call is never returning because of the receive call below. It's blocking because this receive call is executing on application master process. That process, in turn, doesn't wait for a message from the application it's starting, the app start is synchronous. So the only thing that is going to trip this receive is a) a crash or b) some other message that might be sent to application master but that is OTP land and nothing we should touch. Thus just dropping the message like we do below could have other unintended consequences.

@rzezeski
Copy link
Copy Markdown
Contributor

I'm -1 on this PR because of the process flag change in yz_sup. This seems against the grain of OTP and I've already found it causes start_link to not return until a crash. Perhaps a failing init is meant to ripple through the system and stop all applications? The application startup process in Riak is still something that is a bit magic to me. I don't know if it's a reltool thing, a supervisor thing, or what. I'm starting to wonder if doing this work in the init function is the wrong thing to do in the first place. I notice that if a Solr instance is already up you don't actually fail in init but instead get past init and get a handle_info call with an exit status of 1 from the port because it couldn't bind to the port. It gets past init because there is already a Solr instance up that will give a 200 response to the yz_solr:cores call. When this type of failure happens the worker is just restarted over and over and over like a typical OTP supervision tree. This is "the erlang way" but it seems like keeping track of Solr restarts and shutting down only the Yokozuna application might make more sense. I'm not sure. One problem right now is the restart frequencies aren't set so well so in this case you get infinite restart attempts rather than bubbling up the supervisor chain and stopping whole node.

@coderoshi
Copy link
Copy Markdown
Contributor Author

According to the OTP supervisor init process, if any children fail on init, the whole supervisor is meant to fail. When the supervisor fails, the yokozuna application fails, and when that fails, the failure chains up to the whole node.

I chose to do the work in init, because the alternative is to startup the supervisor, then separately attach the children supervisors one by one, which was kind of a weird workflow I've never seen replicated in Riak.

I'd love to stop the launch of the actual yokozuna application, however, application:start/2 only specs two return values: {ok, Pid} and {error, Reason}. Sending error propagates failure to take down the whole node. But claiming the process started ok is also not exactly true. This was the crux of my unease with the whole idea of catching a failure on startup.

coderoshi and others added 3 commits July 22, 2013 16:03
The main purpose of this commit was to avoid calling
`proplists:get_value` on an undefined port (Erlang port, not socket).
This would happen when the Solr process has a non-zero exit,
e.g. because the specific socket port number is already bound.

Some additional fixes I made along the way:

* Move `getpid` function in API section where it belongs.

* Fix the exit status match in `handle_info`.

* Check for a regular port exit msg in `handle_info`.

* Only close the port in `terminate` if the pid is not `undefined`.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, yz_monitor_solr

Move the solr proc server under the yz_sup so that max restart
frequency will actually be reached and shutdown the Riak node.  The
thinking is that if Solr cannot run then Riak should shutdown to avoid
missing index data and other badness.  That is, better to have a
downed node than one that is partially up.
@coderoshi
Copy link
Copy Markdown
Contributor Author

+1

rzezeski added a commit that referenced this pull request Jul 25, 2013
Fix various things in Solr startup
@rzezeski rzezeski merged commit d3a2042 into master Jul 25, 2013
@coderoshi coderoshi removed their assignment Feb 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants