New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make replication services start when configured #868

Merged
merged 8 commits into from Jan 7, 2019

Conversation

Projects
None yet
5 participants
@milleruntime
Copy link
Contributor

milleruntime commented Jan 2, 2019

  • Only start the replication services in TabletServer if the required
    properties are set at startup

I noticed tserver will always start multiple threads for replication, even if replication name is not configured. As far as I can tell, this is the only required property for replication and if it is not set at startup anyway, there is a good chance not all data will be replicated.

@milleruntime milleruntime requested a review from joshelser Jan 2, 2019

@joshelser

This comment has been minimized.

Copy link
Member

joshelser commented Jan 2, 2019

Pardon my ignorance: do we differentiate properties that can be changed dynamically and those which require a restart? That's my biggest concern -- I think your change would require a config-change+restart instead of just at config-change that we have now.

@ctubbsii

This comment has been minimized.

Copy link
Member

ctubbsii commented Jan 2, 2019

Pardon my ignorance: do we differentiate properties that can be changed dynamically and those which require a restart?

In the past, I don't think we've done a good job at this differentiation. I think there's a natural differentiation, which we could use: properties which can be set in ZooKeeper, and those that can't, but even those that can be set in ZooKeeper, there's not well-defined behavior as to when they will go into effect. This is something that we should think about doing better in future. That's outside the scope of this issue, in the general case, but in this narrow case, I think Josh makes a good point: if you can change the configuration for a replication receiver on a running system, we should probably make sure that the services respond to that configuration change... and we can't do that if they are disabled on startup and don't monitor for configuration changes to re-enable.

@joshelser

This comment has been minimized.

Copy link
Member

joshelser commented Jan 2, 2019

if you can change the configuration for a replication receiver on a running system, we should probably make sure that the services respond to that configuration change... and we can't do that if they are disabled on startup and don't monitor for configuration changes to re-enable.

Yes, this was half of my point. The other half is that we need to have documentation that would tell people that they need to restart Accumulo after making this change, especially since this changes previous semantics. Probably release notes to advertise the immediate change and user manual to tell people for the future that they need to do it.

@ctubbsii

This comment has been minimized.

Copy link
Member

ctubbsii commented Jan 2, 2019

It looks like there's a "fixed Zoo property" concept, which we label on some properties, which indicates things that can be changed in ZK, but won't be recognized until restart of some service/daemon... not sure if that is used in the generated docs for properties... but it seems like it easily could be. (EDIT: turns out "fixed properties" already indicate the need to restart to take effect, so we can just ensure it's properly marked as a "fixed property".)

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 2, 2019

Yeah a fixed property will get a little snippet in the generated doc like gc.port.client does here:
"zk mutable: yes but requires restart of the gc"

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 2, 2019

With this change, we could throw an illegal state exception if the user sets a property on the fly, expecting data to get replicated. I am not sure where that would be done though.

@mikewalch

This comment has been minimized.

Copy link
Member

mikewalch commented Jan 2, 2019

It's easy to make replication.name a fixed property but it looks like this method will need to updated for the docs to say Accumulo needs to be restarted (as right now it will say replication needs to be restarted.)

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 2, 2019

It's easy to make replication.name a fixed property but it looks like this method will need to updated for the docs to say Accumulo needs to be restarted (as right now it will say replication needs to be restarted.)

This makes me wonder if the REPLICATION prefix properties should all be TSERVER. The names would be longer but seems more accurate to me.

@ctubbsii

This comment has been minimized.

Copy link
Member

ctubbsii commented Jan 2, 2019

This makes me wonder if the REPLICATION prefix properties should all be TSERVER. The names would be longer but seems more accurate to me.

If it affects the remote replication receiver service built in to the tserver to act as a destination for replicas, then probably.

@mikewalch

This comment has been minimized.

Copy link
Member

mikewalch commented Jan 2, 2019

This makes me wonder if the REPLICATION prefix properties should all be TSERVER. The names would be longer but seems more accurate to me.

I am not too familiar with replication but it's properties might be used in tserver and master. I think the documentation should not call out the process to be restarted by using the prefix of the property. Either we should make sure that this prefix is an Accumulo process (i.e master, tserver, monitor, etc) or just have a general message in the docs like zk mutable: yes but requires a restart. Users can determine which processes should be restarted.

@joshelser

This comment has been minimized.

Copy link
Member

joshelser commented Jan 2, 2019

I am not too familiar with replication but it's properties might be used in tserver and master.

That's correct. There's logic across master and tabletserver for replication. The intent to have its own prefix was to keep all replication configuration in "one place".

just have a general message in the docs like zk mutable: yes but requires a restart

+1

With this change, we could throw an illegal state exception if the user sets a property on the fly, expecting data to get replicated. I am not sure where that would be done though.

That would be client side validation, right? I suppose it's not unreasonable for this case to let client-side validation be sufficient (e.g. it's not bad if a user circumvents our own code and tries to set a zk-mutable property -- worst case, it's just ineffectual). That sounds like a good idea. Even if not an exception, a WARN would go a long way for operators to realize they straying from the "norm"

@ctubbsii

This comment has been minimized.

Copy link
Member

ctubbsii commented Jan 2, 2019

We can certainly fix the message without fixing the property names, but if it a configuration knob for tserver, then it should be prefixed accordingly. The convention we have is that properties for specific processes are prefixed with the process type, if they affect all of Accumulo, they are prefixed with either general. or instance. (the latter being used for instance-identifying information... information that together establishes the an identity for a cluster and must be the same across the entire cluster in order for each component to identify as part of the same Accumulo instance).

@joshelser

This comment has been minimized.

Copy link
Member

joshelser commented Jan 2, 2019

We can certainly fix the message without fixing the property names

Let's do one thing at a time, please. This discussion doesn't belong on a PR about not starting up services in certain situations.

@ctubbsii

This comment has been minimized.

Copy link
Member

ctubbsii commented Jan 2, 2019

Let's do one thing at a time, please.

Agreed. My thinking is that there are multiple possible follow-on actions which may affect the decision about the initial action. Have we reached consensus on the initial set of actions here? If so, is it the following?

  1. enable/disable replication based on startup config,
  2. used fixed properties to document that changes in ZK require restart, and
  3. clean up wording of said documentation so it's not based on property name

Or something else?

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 2, 2019

Let's do one thing at a time, please.

Agreed. My thinking is that there are multiple possible follow-on actions which may affect the decision about the initial action. Have we reached consensus on the initial set of actions here? If so, is it the following?

  1. enable/disable replication based on startup config,
  2. used fixed properties to document that changes in ZK require restart, and
  3. clean up wording of said documentation so it's not based on property name

Or something else?

I also suggest throwing an error on the client if replication properties are set (through the shell I guess) when the tserver doesn't have a replication.name. Perhaps this could be done in ConfigSanityCheck and print a warning?

@ctubbsii

This comment has been minimized.

Copy link
Member

ctubbsii commented Jan 2, 2019

It's not clear to me how all the replication properties work together... which ones are on the send side and which ones affect the receive side.

It might be best to just add a new fixed property (requires restart), tserver.replicationservices.enabled=true/false and leave all the replication property stuff for future cleanup/improvements.

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 2, 2019

It's not clear to me how all the replication properties work together... which ones are on the send side and which ones affect the receive side.

It might be best to just add a new fixed property (requires restart), tserver.replicationservices.enabled=true/false and leave all the replication property stuff for future cleanup/improvements.

+1 I like this the best. This also won't require changing the documentation for the other properties since it would get generated to say "zk mutable: yes but requires restart of the tserver".

@milleruntime milleruntime force-pushed the milleruntime:repl-startup branch from ae2ebfa to 0c758de Jan 2, 2019

@keith-turner

This comment has been minimized.

Copy link
Contributor

keith-turner commented Jan 3, 2019

Could do the following :

  • If replication.name is set to empty string then don't start up replication related services in the tserver.
  • In the tserver create a timer that checks if replication.name transitions from empty to non-empty then automatically start the services in the tserver. This is similar to how there are timers that check if thread pool sizes need to be adjusted.

This change should hopefully be transparent to the user.

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 3, 2019

Could do the following :

  • If replication.name is set to empty string then don't start up replication related services in the tserver.
  • In the tserver create a timer that checks if replication.name transitions from empty to non-empty then automatically start the services in the tserver. This is similar to how there are timers that check if thread pool sizes need to be adjusted.

This change should hopefully be transparent to the user.

Ok I think this would work the best. And also, the more I thought about it, it doesn't make sense to have a new property (or have a tserver property) so I will go back to how I was checking initially on replication.name.

@milleruntime milleruntime changed the title Make tserver start repl services when configured Make replication services start when configured Jan 3, 2019

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 3, 2019

OK I think these changes work best for all use cases and require no action from users. I made a thread in each Master and tserver that checks every 5 seconds if replication.name is set. Then if it is set, startup all the services. That seemed to be a reasonable time to check since any longer, and the tests began failing.

The only other part in question was whether to kill the threads after starting the services. I didn't see a graceful way to do that with SimpleTimer though.

milleruntime added some commits Jan 2, 2019

Make tserver start repl services when configured
* Only start the replication services in TabletServer if the required
properties are set at startup

@milleruntime milleruntime force-pushed the milleruntime:repl-startup branch from 6aceb77 to 0723eae Jan 3, 2019

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 4, 2019

MultiTserverReplicationIT was failing due to this change (it assumed services were always running) but was easily fixed in 01f89e2. @joshelser were your concerns resolved with my latest changes?

@joshelser

This comment has been minimized.

Copy link
Member

joshelser commented Jan 4, 2019

Taking a look at the latest you have pushed now, @milleruntime .

@joshelser
Copy link
Member

joshelser left a comment

A couple of minor requests for changes, but this looks really close, Mike. Thanks for the ping for another review.

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 4, 2019

All ITs in test/src/main/java/org/apache/accumulo/test/replication are now passing.

@ctubbsii ctubbsii added the v2.0.0 label Jan 5, 2019

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 7, 2019

@joshelser let me know if you are happy with the updates I made on this PR and I will merge.

@joshelser

This comment has been minimized.

Copy link
Member

joshelser commented Jan 7, 2019

@milleruntime one comment up above on SimpleTimer (if you could double-check that?), then I'm good. Thanks for the review ping.

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 7, 2019

one comment up above on SimpleTimer (if you could double-check that?), then I'm good.

I looked into it a bit and replied to your comment. I think we are OK.

@milleruntime

This comment has been minimized.

Copy link
Contributor

milleruntime commented Jan 7, 2019

All ITs in accumulo/test/src/main/java/org/apache/accumulo/test/replication are still passing

@milleruntime milleruntime merged commit 9198818 into apache:master Jan 7, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@milleruntime milleruntime deleted the milleruntime:repl-startup branch Jan 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment