Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add configuration to fail server startup on non-good status checker #11347

Merged

Conversation

jadami10
Copy link
Contributor

This is a small feature addition to allow servers to stop if they have no reach GOOD status by the end of their startup timeout. This is off by default, so it should not have any affect on existing deployments.

We have several instances per day in our production clusters where new servers start but they are not fully caught up. We plan to use this in the future to prevent those servers from starting.

This was tested on internal clusters with a very small timeout. We observed the server attempt to start, poll status, then quickly destroy segments and deregister from helix.

Once this and #11345 are merged, I plan to add a new documentation page outlining the many server startup configurations and what is best to set for high availability and accuracy.

@jadami10
Copy link
Contributor Author

cc @Jackie-Jiang, this should be the last of the PRs to handle server startup behavior in relation to realtime table issues.

@codecov-commenter
Copy link

codecov-commenter commented Aug 14, 2023

Codecov Report

Merging #11347 (2a2e7e5) into master (916ac18) will decrease coverage by 18.71%.
Report is 30 commits behind head on master.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master   #11347       +/-   ##
=============================================
- Coverage     68.48%   49.77%   -18.71%     
+ Complexity     6535      884     -5651     
=============================================
  Files          2233     2224        -9     
  Lines        119952   119908       -44     
  Branches      18191    18243       +52     
=============================================
- Hits          82148    59687    -22461     
- Misses        31964    56213    +24249     
+ Partials       5840     4008     -1832     
Flag Coverage Δ
integration <0.01% <0.00%> (?)
integration1 <0.01% <0.00%> (-20.98%) ⬇️
integration2 0.00% <0.00%> (-25.63%) ⬇️
java-11 49.76% <0.00%> (-18.66%) ⬇️
java-17 49.64% <0.00%> (-18.64%) ⬇️
java-20 49.62% <0.00%> (-18.67%) ⬇️
temurin 49.77% <0.00%> (-18.71%) ⬇️
unittests 67.27% <ø> (?)
unittests1 67.27% <ø> (-0.04%) ⬇️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% <0.00%> (-59.24%) ⬇️
...va/org/apache/pinot/spi/utils/CommonConstants.java 22.09% <ø> (ø)

... and 970 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -478,8 +478,9 @@ private void startupServiceStatusCheck(long endTimeMs) {
long checkIntervalMs = _serverConf.getProperty(Server.CONFIG_OF_STARTUP_SERVICE_STATUS_CHECK_INTERVAL_MS,
Server.DEFAULT_STARTUP_SERVICE_STATUS_CHECK_INTERVAL_MS);

Status serviceStatus = ServiceStatus.getServiceStatus(_instanceId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor)

Suggested change
Status serviceStatus = ServiceStatus.getServiceStatus(_instanceId);
Status serviceStatus = null;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way so we'd have the initial status in log message below if for some reason endTimeMs was immediate. but i guess that seems highly unlikely since it's added to current time, and null is valid in the log

serviceStatus, System.currentTimeMillis() - startTimeMs, ServiceStatus.getStatusDescription());
LOGGER.error(errorMessage);
// Stop the server so that it will be removed from the Helix cluster
stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not calling stop() here since the initialization is not done yet. It should be enough to just:

  _adminApiApplication.stop();
  _helixManager.disconnect();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair

Comment on lines 564 to 565
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP =
"pinot.server.starter.exitServerOnStartupStatusFailure";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it next to CONFIG_OF_STARTUP_ENABLE_SERVICE_STATUS_CHECK

Suggested change
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP =
"pinot.server.starter.exitServerOnStartupStatusFailure";
public static final String CONFIG_OF_STARTUP_EXIT_ON_SERVICE_STATUS_CHECK_FAILURE =
"pinot.server.startup.exitOnServiceStatusCheckFailure";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also suggesting changing the key to make it more specific. See the suggested change

@@ -651,8 +663,12 @@ public void stop() {
Server.DEFAULT_SHUTDOWN_ENABLE_RESOURCE_CHECK)) {
shutdownResourceCheck(endTimeMs);
}
_serverQueriesDisabledTracker.stop();
_realtimeLuceneIndexRefreshState.stop();
if (_serverQueriesDisabledTracker != null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these null checks are technically not being used anymore, but the still feel valid to have in place since it seems folks can subclass this and do whatever they want

Comment on lines 511 to 513
// If we exit here, only the _adminApiApplication and _helixManager are initialized, so we only stop them
_adminApiApplication.stop();
_helixManager.disconnect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, suggest moving this part into start() (add a try over startupServiceStatusCheck() and handle them in the catch clause)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. let me know what you think now

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 594 to 596
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP =
"pinot.server.starter.exitServerOnStartupStatusFailure";
public static final boolean DEFAULT_EXIT_SERVER_ON_INCOMPLETE_STARTUP = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) Suggest making it more specific and matching CONFIG_OF_STARTUP_ENABLE_SERVICE_STATUS_CHECK

Suggested change
public static final String CONFIG_OF_EXIT_SERVER_ON_INCOMPLETE_STARTUP =
"pinot.server.starter.exitServerOnStartupStatusFailure";
public static final boolean DEFAULT_EXIT_SERVER_ON_INCOMPLETE_STARTUP = false;
public static final String CONFIG_OF_STARTUP_EXIT_ON_SERVICE_STATUS_CHECK_FAILURE =
"pinot.server.startup.exitOnServiceStatusCheckFailure";
public static final boolean DEFAULT_STARTUP_EXIT_ON_SERVICE_STATUS_CHECK_FAILURE = false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies, missed this comment both times

@Jackie-Jiang Jackie-Jiang merged commit 5915aae into apache:master Aug 22, 2023
21 checks passed
@Jackie-Jiang Jackie-Jiang added the Configuration Config changes (addition/deletion/change in behavior) label Aug 22, 2023
@Jackie-Jiang
Copy link
Contributor

Please also update the pinot doc :)

@jadami10
Copy link
Contributor Author

Please also update the pinot doc :)

will do once #11345 is also merged

@Jackie-Jiang Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants