Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host regex volume chooser for WALs #1607

Open
ivakegg opened this issue May 12, 2020 · 8 comments
Open

Host regex volume chooser for WALs #1607

ivakegg opened this issue May 12, 2020 · 8 comments

Comments

@ivakegg
Copy link
Contributor

ivakegg commented May 12, 2020

There is a need to allow one to tie the volumes for sets of tservers to a specific volume. This is especially useful if one wants to tie the WALs to a specific volume for performance or space purposes. I suggest that we create a HostRegexVolumeChooser that can choose a volume based on the tserver hostname. So given the following configuration:

general.volume.chooser=org.apache.accumulo.server.fs.PerTableVolumeChooser
general.custom.volume.chooser.logger=org.apache.accumulo.server.fs.HostRegexVolumeChooser

We might create the following configuration

general.custom.volume.chooser.hostgroup.A=host[0-4]
general.custom.volume.chooser.hostgroup.A.volumes=hdfs://cluster1/accumulo
general.custom.volume.chooser.hostgroup.B=host[5-9]
general.custom.volume.chooser.hostgroup.B.volumes=hdfs://cluster2/accumulo

This will tie the WALs for host group A to volume1 and the WALs for host group B to volume2

@ctubbsii
Copy link
Member

FWIW, the PerTableVolumeChooser property for delegating the logger scope is: general.custom.volume.chooser.logger, rather than volume.chooser.logger, and the other properties would also have the general.custom. prefix.

Would such a chooser be useful for other scopes, other than the logger scope? (the answer may drive the name and when the implementation is more general).

@ivakegg
Copy link
Contributor Author

ivakegg commented May 15, 2020

Yes, such a chooser would be useful for other scopes as well. If we have a set of tservers running on top of a particular HDFS cluster, then it might be beneficial to only choose that cluster for storing RFiles. The implementation would work for both table and "non-table" volume choosing.

So in the table case, the choice configuration is similar to the PreferredVolumeChooser in that the configuration could be per table. I cannot decide whether what constitutes a "hostgroup" should be definable on a table by table bases or not.

Also I updated the description per your observation above.

@ivakegg
Copy link
Contributor Author

ivakegg commented May 15, 2020

After thinking a little more about this, I think this chooser should extend the PreferredVolumeChooser and hence the properties and defaults should follow the same scheme. Here is the class javadoc I am starting with. Please tell me if this makes sense:

/**
 * A {@link PreferredVolumeChooser} that limits its choices from a given set of
 * options to the subset of those options preferred for a particular table (or
 * non-table) and then for the tserver host making the choice.  If no
 * configuration for the tserver host is defined, then this class delegates to
 * the PreferredVolumeChooser to make the choice.
 *
 * For tables, the configuration of a hostgroup regex is defined as follows
 * {properties later in the list override properties earlier in the list}:
 *
 * general.custom.volume.hostregex.{group}
 * table.custom.volume.hostregex.{group}
 *
 * Then the volumes for a given hostgroup are defined using the following
 * properties:
 *
 * general.custom.volume.preferred.default.{group}
 * table.custom.volume.preferred.{group}
 *
 * For non-tables (e.g. the logger scope), the configuration of a hostgroup
 * regex is defined using the following:
 *
 * general.custom.volume.hostregex.{group}
 * general.custom.volume.hostregex.{scope}.{group}
 *
 * Then the volumes for a given hostgroup are defined using the following:
 *
 * general.custom.volume.preferred.default.{group}
 * general.custom.volume.preferred.{scope}.{group}
 *
 */

ivakegg added a commit to ivakegg/accumulo that referenced this issue May 15, 2020
@ctubbsii
Copy link
Member

I was thinking more about the use case for configuring different servers to choose different volumes and realized that we don't need a new class to achieve this. This can already be easily accomplished by setting PreferredVolumeChooser with different config on different hosts.

There is some convenience in deploying the same config globally, but still behaving differently from server to server, but I think that convenience might be marginal. What do you think?

@ivakegg
Copy link
Contributor Author

ivakegg commented May 19, 2020

I was a little scared to do that because of the configuration comparison mechanism that tries to ensure configurations are consistent across hosts. Are you saying that the volume chooser configurations to not get factored into that comparison? I will try to find that code. In any case when handling large systems, having inconsistent configurations on different sets of nodes is usually asking for trouble.

@milleruntime
Copy link
Contributor

I was a little scared to do that because of the configuration comparison mechanism that tries to ensure configurations are consistent across hosts. Are you saying that the volume chooser configurations to not get factored into that comparison? I will try to find that code. In any case when handling large systems, having inconsistent configurations on different sets of nodes is usually asking for trouble.

Only the properties that begin with instance have to be the same across an entire cluster.

@ivakegg
Copy link
Contributor Author

ivakegg commented May 20, 2020

Well, then we will try doing this with local configurations and see how that goes. If that goes well, then I will not need to create this chooser unless you feel it is still worthwhile.

@ivakegg
Copy link
Contributor Author

ivakegg commented May 29, 2020

The experiment worked well. We simply have the PerTableVolumeChooser being used at the top, and then the PreferredVolumeChooser for the logger. The preferred volumes for the logger are then set differently in the accumulo-site.xml depending on the host. This has the desired effect. That being said I still think this might be a useful chooser. It will require getting the host and port available through the parameters but that does not look too hard. This is however lower priority now given our system is working as desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants