Conversation
|
This PR is meant to address #134. |
src/yz_sup.erl
Outdated
There was a problem hiding this comment.
This gives me an uneasy feeling. This process flag is being applied to the caller of start_link which in this case happens to be Yokozuna's application master. I'm not so sure we should be playing process flag games with a process we don't own.
Furthermore, I noticed this seems to cause issues when stoping. I don't think it's caused by this PR. Rather when calling stop the yz_events server gets a gen_event exit from riak_core_ring_events which it doesn't seem to have a handle_info for and thus causes a function clause error to get thrown. This in turn trips the trap exit and the "Yokozuna had a problem starting..." message.
OH...wow, yea this is bad. This call is never returning because of the receive call below. It's blocking because this receive call is executing on application master process. That process, in turn, doesn't wait for a message from the application it's starting, the app start is synchronous. So the only thing that is going to trip this receive is a) a crash or b) some other message that might be sent to application master but that is OTP land and nothing we should touch. Thus just dropping the message like we do below could have other unintended consequences.
|
I'm -1 on this PR because of the process flag change in |
|
According to the OTP supervisor init process, if any children fail on init, the whole supervisor is meant to fail. When the supervisor fails, the yokozuna application fails, and when that fails, the failure chains up to the whole node. I chose to do the work in init, because the alternative is to startup the supervisor, then separately attach the children supervisors one by one, which was kind of a weird workflow I've never seen replicated in Riak. I'd love to stop the launch of the actual yokozuna application, however, application:start/2 only specs two return values: {ok, Pid} and {error, Reason}. Sending error propagates failure to take down the whole node. But claiming the process started ok is also not exactly true. This was the crux of my unease with the whole idea of catching a failure on startup. |
The main purpose of this commit was to avoid calling `proplists:get_value` on an undefined port (Erlang port, not socket). This would happen when the Solr process has a non-zero exit, e.g. because the specific socket port number is already bound. Some additional fixes I made along the way: * Move `getpid` function in API section where it belongs. * Fix the exit status match in `handle_info`. * Check for a regular port exit msg in `handle_info`. * Only close the port in `terminate` if the pid is not `undefined`.
There was a problem hiding this comment.
Where is this used?
There was a problem hiding this comment.
Nevermind, yz_monitor_solr
Move the solr proc server under the yz_sup so that max restart frequency will actually be reached and shutdown the Riak node. The thinking is that if Solr cannot run then Riak should shutdown to avoid missing index data and other badness. That is, better to have a downed node than one that is partially up.
|
+1 |
Fix various things in Solr startup
This was originally written to address #134. Mainly, it was to change
wait_for_solrfrom throwing, causing the crash of theyz_solr_proc:initfunction and thus causing the whole node to stop. It turns out that even without the throw, anything but{ok, State}will cause the supervisor tree to shutdown completely. It's worse than that, the init code has a default of 5s to start according to @beerriot, so thewait_for_solrtimeout isn't actually even honored.The new plan is to cause the Riak node to have a full shutdown if the Solr process cannot stay up. @beerriot will be making a PR to move the start out of the
initfunction. Until then various fixes have been made around the Solr startup process. The main changes were to avoid throwing/exiting inside the process, instead use thestopdirective forgen_server, and to move the solr proc directly underyz_supin order to make sure max restart frequency can actually be reached and cause the Riak node to shutdown.