-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require x tservers on start #1158
Conversation
server/master/src/main/java/org/apache/accumulo/master/util/TserverStartThreshold.java
Outdated
Show resolved
Hide resolved
It might make sense to generalize this and avoid tablet assignment and balancing if the number of tablet servers is less than desired. This would handle a case where the master is currently running and 90% of the tservers die. Below is the code that does assignment. Below is the code that does balancing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this PR was targeting 1.9.4. Because it is a new feature, I have some reservations about that. Although our configuration properties aren't strictly "public API", I think it's useful to treat them similarly, if at all possible, and follow the spirit of Semver. Also, this does pose some risk (albeit minimal) of destabilizing a current release in a patch version. So, I think it'd be better if this targeted 2.0 or 2.1.
The pull request was modified to address issues noted by @keith-turner. The targeting on 2.x is valid - however, there may be interest in back-porting / patching. Intended to add tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log messages are nice
/** | ||
* Allows property configuration to block master start-up waiting for a minimum number of tservers | ||
* to register in zookeeper. It also accepts a maximum time to wait - if the time expires, the | ||
* start-up will continue with any tservers available. The following properties are used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add if the time expires, the start-up will continue with any tservers available
to the property documentation.
server/master/src/main/java/org/apache/accumulo/master/Master.java
Outdated
Show resolved
Hide resolved
server/master/src/main/java/org/apache/accumulo/master/Master.java
Outdated
Show resolved
Hide resolved
+1 |
@jzgithub1 Please test this out on our small cluster |
@Manno15 successfully configured our 6 TServer cluster with @EdColeman 's require_x_tservers_on_start branch. The cluster has all of the 6 TServers up and running without a failure on startup. I am looking at the ticket for more exhaustive ways to test these changes. Thank you again @Manno15. |
Put the two new property values in accumulo-site.xml and restarted cluster. There was no problem getting the TServers up and running on the 6 TServer cluster. From the logs: [guest@flash accumulo]$ grep master.startup.tserver.avail.min.count *
gc_flash.superheroes.local.debug.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
gc_flash.superheroes.local.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
master_flash.superheroes.local.debug.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
master_flash.superheroes.local.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
Binary file monitor_flash.superheroes.local.debug.log matches
monitor_flash.superheroes.local.log:2019-06-12 14:27:04,154 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
tracer_flash.superheroes.local.debug.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
tracer_flash.superheroes.local.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.min.count = 2
[guest@flash accumulo]$ grep master.startup.tserver.avail.max.wait *
gc_flash.superheroes.local.debug.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
gc_flash.superheroes.local.log:2019-06-12 14:27:07,713 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
master_flash.superheroes.local.debug.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
master_flash.superheroes.local.log:2019-06-12 14:27:07,626 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
Binary file monitor_flash.superheroes.local.debug.log matches
monitor_flash.superheroes.local.log:2019-06-12 14:27:04,154 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
tracer_flash.superheroes.local.debug.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
tracer_flash.superheroes.local.log:2019-06-12 14:27:08,322 [server.Accumulo] INFO : master.startup.tserver.avail.max.wait = 40
[guest@flash accumulo]$ grep -in ERROR *
gc_flash.superheroes.local.debug.log:157:2019-06-12 14:27:07,718 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
gc_flash.superheroes.local.log:154:2019-06-12 14:27:07,718 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
master_flash.superheroes.local.debug.log:155:2019-06-12 14:27:07,632 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
master_flash.superheroes.local.log:152:2019-06-12 14:27:07,632 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
Binary file monitor_flash.superheroes.local.debug.log matches
monitor_flash.superheroes.local.log:24:2019-06-12 14:26:49,005 [zookeeper.ClientCnxn] INFO : Opening socket connection to server raven.superheroes.local/10.0.0.6:2181. Will not attempt to authenticate using SASL (unknown error)
monitor_flash.superheroes.local.log:179:2019-06-12 14:27:04,157 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
tracer_flash.superheroes.local.debug.log:156:2019-06-12 14:27:08,327 [server.Accumulo] INFO : table.bloom.error.rate = 0.5%
tracer_flash.superheroes.local.log:153:2019-06-12 14:27:08,327 [server.Accumulo] INFO : table.bloom.error.rate = 0.5% |
Check for tserver availability when master gets lock. Adds two properties: MASTER_STARTUP_TSERVER_AVAIL_MIN_COUNT - the number of desired tservers MASTER_STARTUP_TSERVER_AVAIL_MAX_WAIT - the max time willing to wait
a74bbad
to
6202be2
Compare
Adds two parameters: - MASTER_STARTUP_TSERVER_AVAIL_MIN_COUNT - sets desried number of tservers - MASTER_STARTUP_TSERVER_AVAIL_MAX_WAIT - sets maximum time to wait. This is the same changes submitted as pull request apache#1158 for 1.9.x, merged to 2.0. Request apache#1158 will be available as a patch, but is not expected to be merged becuase of semver requirements.
Since #1204 supersedes this one, we can close this now. The PR will still be available, in case somebody wants to grab the patch from it to backport to 1.9. |
Although this is closed (superseded with 2.0 version) testing this would still be valid (esp. if you're set-up). One condition that I really wanted to verify is that the blocking is happening correctly at the correct "start-up phase". I believe that the blocking is set to before the operations that perform table assignment and wal recovery, but the are threads that have been started and wanted to verify that they are not attempting work that really should also be blocked. Two validate this I intended to (using your 6 servers as an example):
Additional testing would help those that wish to back-port this to 1.9.x versions. |
Adds two parameters: - MASTER_STARTUP_TSERVER_AVAIL_MIN_COUNT - sets desried number of tservers - MASTER_STARTUP_TSERVER_AVAIL_MAX_WAIT - sets maximum time to wait. This is the same changes submitted as pull request #1158 for 1.9.x, merged to 2.0. Request #1158 will be available as a patch, but is not expected to be merged becuase of semver requirements.
I set required servers to 8 and max wait to 10 minutes. None of the tablet servers come up in GUI after 11 minutes. There are only 6 TServers in the cluster. I guess this is the desired outcome. |
Reopening to revisit for 1.10. |
This adds a check for when the master get lock that there are a configurable number of tservers available. This was some recent discussion on the mailing list concerning this. This pull request is to solicit comments on the approach. I believe that it is complete, but looking for feedback if this meets
requirements.