Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require x tservers on start #1158

Merged
merged 3 commits into from
Jan 27, 2020

Conversation

EdColeman
Copy link
Contributor

This adds a check for when the master get lock that there are a configurable number of tservers available. This was some recent discussion on the mailing list concerning this. This pull request is to solicit comments on the approach. I believe that it is complete, but looking for feedback if this meets
requirements.

@keith-turner
Copy link
Contributor

It might make sense to generalize this and avoid tablet assignment and balancing if the number of tablet servers is less than desired. This would handle a case where the master is currently running and 90% of the tservers die.

Below is the code that does assignment.

https://github.com/apache/accumulo/blob/rel/1.9.3/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java#L930

Below is the code that does balancing.

https://github.com/apache/accumulo/blob/rel/1.9.3/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1078

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this PR was targeting 1.9.4. Because it is a new feature, I have some reservations about that. Although our configuration properties aren't strictly "public API", I think it's useful to treat them similarly, if at all possible, and follow the spirit of Semver. Also, this does pose some risk (albeit minimal) of destabilizing a current release in a patch version. So, I think it'd be better if this targeted 2.0 or 2.1.

@EdColeman
Copy link
Contributor Author

The pull request was modified to address issues noted by @keith-turner. The targeting on 2.x is valid - however, there may be interest in back-porting / patching.

Intended to add tests.

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log messages are nice

/**
* Allows property configuration to block master start-up waiting for a minimum number of tservers
* to register in zookeeper. It also accepts a maximum time to wait - if the time expires, the
* start-up will continue with any tservers available. The following properties are used to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add if the time expires, the start-up will continue with any tservers available to the property documentation.

@ivakegg
Copy link
Contributor

ivakegg commented May 23, 2019

+1

@ivakegg
Copy link
Contributor

ivakegg commented Jun 11, 2019

@jzgithub1 Please test this out on our small cluster

@jzgithub1
Copy link
Contributor

@Manno15 successfully configured our 6 TServer cluster with @EdColeman 's require_x_tservers_on_start branch. The cluster has all of the 6 TServers up and running without a failure on startup. I am looking at the ticket for more exhaustive ways to test these changes. Thank you again @Manno15.

@jzgithub1
Copy link
Contributor

jzgithub1 commented Jun 12, 2019

Put the two new property values in accumulo-site.xml and restarted cluster. There was no problem getting the TServers up and running on the 6 TServer cluster.

From the logs:

[guest@flash accumulo]$ grep master.startup.tserver.avail.min.count *
gc_flash.superheroes.local.debug.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
gc_flash.superheroes.local.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
master_flash.superheroes.local.debug.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
master_flash.superheroes.local.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
Binary file monitor_flash.superheroes.local.debug.log matches
monitor_flash.superheroes.local.log:2019-06-12 14:27:04,154 [server.Accumulo] INFO :  master.startup.tserver.avail.min.count = 2
tracer_flash.superheroes.local.debug.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
tracer_flash.superheroes.local.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
[guest@flash accumulo]$ grep master.startup.tserver.avail.max.wait *
gc_flash.superheroes.local.debug.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
gc_flash.superheroes.local.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
master_flash.superheroes.local.debug.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
master_flash.superheroes.local.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
Binary file monitor_flash.superheroes.local.debug.log matches
monitor_flash.superheroes.local.log:2019-06-12 14:27:04,154 [server.Accumulo] INFO :  master.startup.tserver.avail.max.wait = 40
tracer_flash.superheroes.local.debug.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
tracer_flash.superheroes.local.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40

[guest@flash accumulo]$ grep -in ERROR *
gc_flash.superheroes.local.debug.log:157:2019-06-12 14:27:07,718 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
gc_flash.superheroes.local.log:154:2019-06-12 14:27:07,718 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
master_flash.superheroes.local.debug.log:155:2019-06-12 14:27:07,632 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
master_flash.superheroes.local.log:152:2019-06-12 14:27:07,632 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
Binary file monitor_flash.superheroes.local.debug.log matches
monitor_flash.superheroes.local.log:24:2019-06-12 14:26:49,005 [zookeeper.ClientCnxn] INFO :  Opening socket connection to server raven.superheroes.local/10.0.0.6:2181. Will not attempt to authenticate using SASL (unknown error)
monitor_flash.superheroes.local.log:179:2019-06-12 14:27:04,157 [server.Accumulo] INFO :  table.bloom.error.rate = 0.5%
tracer_flash.superheroes.local.debug.log:156:2019-06-12 14:27:08,327 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
tracer_flash.superheroes.local.log:153:2019-06-12 14:27:08,327 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%

Check for tserver availability when master gets lock.  Adds two properties:

MASTER_STARTUP_TSERVER_AVAIL_MIN_COUNT - the number of desired tservers
MASTER_STARTUP_TSERVER_AVAIL_MAX_WAIT - the max time willing to wait
@EdColeman EdColeman force-pushed the require_x_tservers_on_start branch from a74bbad to 6202be2 Compare June 13, 2019 17:17
EdColeman added a commit to EdColeman/accumulo that referenced this pull request Jun 13, 2019
Adds two parameters:

 - MASTER_STARTUP_TSERVER_AVAIL_MIN_COUNT - sets desried number of tservers
 - MASTER_STARTUP_TSERVER_AVAIL_MAX_WAIT - sets maximum time to wait.

This is the same changes submitted as pull request apache#1158 for 1.9.x, merged to
2.0.  Request apache#1158 will be available as a patch, but is not expected to be merged
becuase of semver requirements.
@ctubbsii ctubbsii removed the v2.0.0 label Jun 13, 2019
@ctubbsii
Copy link
Member

Since #1204 supersedes this one, we can close this now. The PR will still be available, in case somebody wants to grab the patch from it to backport to 1.9.

@ctubbsii ctubbsii closed this Jun 13, 2019
@EdColeman
Copy link
Contributor Author

EdColeman commented Jun 13, 2019

Although this is closed (superseded with 2.0 version) testing this would still be valid (esp. if you're set-up).

One condition that I really wanted to verify is that the blocking is happening correctly at the correct "start-up phase". I believe that the blocking is set to before the operations that perform table assignment and wal recovery, but the are threads that have been started and wanted to verify that they are not attempting work that really should also be blocked.

Two validate this I intended to (using your 6 servers as an example):

  1. Set required servers to 8 and a wait time of say 5 minutes - validate that master start up blocks and that no other activity that is assigning tablets occurs) until the time is reached.

  2. If time allowed, kill a few servers, set the threshold to 6. start master, confirm blocking and no other assignments are occurring and then start the remaining servers and see that start-up then continues without issues.

Additional testing would help those that wish to back-port this to 1.9.x versions.

EdColeman added a commit that referenced this pull request Jun 13, 2019
Adds two parameters:

 - MASTER_STARTUP_TSERVER_AVAIL_MIN_COUNT - sets desried number of tservers
 - MASTER_STARTUP_TSERVER_AVAIL_MAX_WAIT - sets maximum time to wait.

This is the same changes submitted as pull request #1158 for 1.9.x, merged to
2.0.  Request #1158 will be available as a patch, but is not expected to be merged
becuase of semver requirements.
@jzgithub1
Copy link
Contributor

I set required servers to 8 and max wait to 10 minutes. None of the tablet servers come up in GUI after 11 minutes. There are only 6 TServers in the cluster. I guess this is the desired outcome.

@jzgithub1
Copy link
Contributor

I was able able to do test #2 above. Tests are complete for now. @Manno15 has to take the cluster for some other testing now.

@milleruntime
Copy link
Contributor

Reopening to revisit for 1.10.

@milleruntime milleruntime reopened this Jan 21, 2020
@EdColeman EdColeman merged commit ce5085e into apache:1.9 Jan 27, 2020
@EdColeman EdColeman deleted the require_x_tservers_on_start branch May 10, 2021 19:53
@ctubbsii ctubbsii added this to the 1.10.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants