New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add configuration to fail server startup on non-good status checker #11347
add configuration to fail server startup on non-good status checker #11347
Conversation
cc @Jackie-Jiang, this should be the last of the PRs to handle server startup behavior in relation to realtime table issues. |
Codecov Report
@@ Coverage Diff @@
## master #11347 +/- ##
=============================================
- Coverage 68.48% 49.77% -18.71%
+ Complexity 6535 884 -5651
=============================================
Files 2233 2224 -9
Lines 119952 119908 -44
Branches 18191 18243 +52
=============================================
- Hits 82148 59687 -22461
- Misses 31964 56213 +24249
+ Partials 5840 4008 -1832
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 970 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -478,8 +478,9 @@ private void startupServiceStatusCheck(long endTimeMs) { | |||
long checkIntervalMs = _serverConf.getProperty(Server.CONFIG_OF_STARTUP_SERVICE_STATUS_CHECK_INTERVAL_MS, | |||
Server.DEFAULT_STARTUP_SERVICE_STATUS_CHECK_INTERVAL_MS); | |||
|
|||
Status serviceStatus = ServiceStatus.getServiceStatus(_instanceId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor)
Status serviceStatus = ServiceStatus.getServiceStatus(_instanceId); | |
Status serviceStatus = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way so we'd have the initial status in log message below if for some reason endTimeMs was immediate. but i guess that seems highly unlikely since it's added to current time, and null is valid in the log
serviceStatus, System.currentTimeMillis() - startTimeMs, ServiceStatus.getStatusDescription()); | ||
LOGGER.error(errorMessage); | ||
// Stop the server so that it will be removed from the Helix cluster | ||
stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest not calling stop()
here since the initialization is not done yet. It should be enough to just:
_adminApiApplication.stop();
_helixManager.disconnect();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fair
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP = | ||
"pinot.server.starter.exitServerOnStartupStatusFailure"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it next to CONFIG_OF_STARTUP_ENABLE_SERVICE_STATUS_CHECK
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP = | |
"pinot.server.starter.exitServerOnStartupStatusFailure"; | |
public static final String CONFIG_OF_STARTUP_EXIT_ON_SERVICE_STATUS_CHECK_FAILURE = | |
"pinot.server.startup.exitOnServiceStatusCheckFailure"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also suggesting changing the key to make it more specific. See the suggested change
@@ -651,8 +663,12 @@ public void stop() { | |||
Server.DEFAULT_SHUTDOWN_ENABLE_RESOURCE_CHECK)) { | |||
shutdownResourceCheck(endTimeMs); | |||
} | |||
_serverQueriesDisabledTracker.stop(); | |||
_realtimeLuceneIndexRefreshState.stop(); | |||
if (_serverQueriesDisabledTracker != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these null checks are technically not being used anymore, but the still feel valid to have in place since it seems folks can subclass this and do whatever they want
// If we exit here, only the _adminApiApplication and _helixManager are initialized, so we only stop them | ||
_adminApiApplication.stop(); | ||
_helixManager.disconnect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, suggest moving this part into start()
(add a try over startupServiceStatusCheck()
and handle them in the catch clause)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. let me know what you think now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP = | ||
"pinot.server.starter.exitServerOnStartupStatusFailure"; | ||
public static final boolean DEFAULT_EXIT_SERVER_ON_INCOMPLETE_STARTUP = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional) Suggest making it more specific and matching CONFIG_OF_STARTUP_ENABLE_SERVICE_STATUS_CHECK
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP = | |
"pinot.server.starter.exitServerOnStartupStatusFailure"; | |
public static final boolean DEFAULT_EXIT_SERVER_ON_INCOMPLETE_STARTUP = false; | |
public static final String CONFIG_OF_STARTUP_EXIT_ON_SERVICE_STATUS_CHECK_FAILURE = | |
"pinot.server.startup.exitOnServiceStatusCheckFailure"; | |
public static final boolean DEFAULT_STARTUP_EXIT_ON_SERVICE_STATUS_CHECK_FAILURE = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies, missed this comment both times
Please also update the pinot doc :) |
will do once #11345 is also merged |
This is a small feature addition to allow servers to stop if they have no reach
GOOD
status by the end of their startup timeout. This is off by default, so it should not have any affect on existing deployments.We have several instances per day in our production clusters where new servers start but they are not fully caught up. We plan to use this in the future to prevent those servers from starting.
This was tested on internal clusters with a very small timeout. We observed the server attempt to start, poll status, then quickly destroy segments and deregister from helix.
Once this and #11345 are merged, I plan to add a new documentation page outlining the many server startup configurations and what is best to set for high availability and accuracy.