Explore supporting external compaction processes in accumulo-cluster script #2138

keith-turner · 2021-06-02T19:17:56Z

The accumulo-cluster script could be modified to support starting coordinator and compactor processes for external compactions if the corresponding files exists. Not sure if this is a good idea in general, but thought it was worth considering. Trying to determine if this is something worth pursuing, came up with the following list of pros and cons. What other things are there to consider?

Pros

Makes it easier to start these processes on a cluster for testing.

Cons

More to maintain in the accumulo-cluster script, this may only be an issue if there is specialization for compactors.
Compactor processes run against a queue. This may not fit well w/ the accumulo cluster script. Would need to figure out an elegant way to do this.

ctubbsii · 2021-06-03T17:46:24Z

Rather than create new hosts files, I would like to see the hosts files consolidated into a single accumulo-cluster.conf file which includes the hosts for all the services the accumulo-cluster script manages (there may be an existing ticket for this... I didn't check before writing this). If host configuration were streamlined into a single config file, it would be easier to specify the startup of additional services, perhaps even with separate config for each host.

keith-turner · 2021-06-03T18:47:21Z

I would like to see the hosts files consolidated into a single accumulo-cluster.conf

That sounds interesting. Any thoughts on the structure and how compactor+queue info might fit into that? For compactors I was thinking of the following possibilities, had not considered a single file.

A dir name compactors with files under it for each queue. The name of the file would be queue name and the contents of the file would be host to start compactors.
A file named compactors where the contents are lines like : <host>,<queue>. This would repeat information alot and seems error prone.
A file named compactors where contents has sections. The section is the queue name and the contents are host. Like

[Q1]
host1
host2

[Q2]
host3
host4

(comment edited by @ctubbsii to view format of option 3 better, and to number the options, to make it easier to reference)

ctubbsii · 2021-06-03T20:29:12Z

Of the three, option 3 is closest to what I was thinking. However, if it was done in a single file, it would strongly depend on the format for that file. If it were YAML, it might look something like:

managers:
 - host1

gc:
 - host1

tservers:
 - host2
 - host3

compaction:
  coordinator: host4
  queues:
   -
     name: queue1
     workers: [ host5, host6 ]
   -
     name: queue2
     workers:
      - host7
      - host8
      - host9

dlmarion · 2021-06-23T14:48:14Z

Would you expect that yamllint and yq be prerequisites for installing Accumulo so that we could use them to validate and parse the yaml file?

keith-turner · 2021-06-23T16:08:22Z

I have not done anything w/ yaml using command line tools. I wonder if there are tools that would make it possible to convert the existing files to yaml in a few lines of bash. If so the script could detect the old files and just print an error message w/ the suggested commands to run to convert them.

dlmarion · 2021-06-23T16:19:25Z

Install yamllint and yq:

sudo yum install yamllint
wget https://github.com/mikefarah/yq/releases/download/v4.9.6/yq_linux_amd64 -O yq

So, for this file:

---
managers:
  - localhost1
  - localhost2

monitors:
  - localhost1
  - localhost2

tracer: # EDIT by @ctubbsii : this should be 'tracer' not 'trace', since this refers to the accumulo-tracer service.
  - localhost

gc:
  - localhost

tservers:
  - localhost1
  - localhost2
  - localhost3
  - localhost4

# compaction:
#   coordinators:
#     - localhost1
#     - localhost2
#   compactors:
#     - queue: q1
#       hosts:
#         - localhost1
#         - localhost2
#     - queue: q2
#       hosts:
#         - localhost1
#         - localhost2

These commands work:

yamllint -s hosts.yaml      # 0 == success, anything else is error and malformed
managers=$(yq eval '.managers | join(",")' hosts.yaml) 
monitors=$(yq eval '.monitors | join(",")' hosts.yaml)
gc=$(yq eval '.gc | join(",")' hosts.yaml)
tservers=$(yq eval '.tservers | join(",")' hosts.yaml)
yq eval '.compaction' hosts.yaml     #  null --> means its commented out
coordinators=$(yq eval '.compaction.coordinators | join(",")' hosts.yaml)
queues=$(yq eval '[.compaction.compactors.[].queue] | join(",")' hosts.yaml)
queue1_hosts=$(yq eval '.compaction.compactors.[] | select (.queue == "q1") | .hosts | join(",")' hosts.yaml)

ctubbsii · 2021-06-23T17:02:03Z

Would you expect that yamllint and yq be prerequisites for installing Accumulo so that we could use them to validate and parse the yaml file?

I've never used these. I'm not sure what we'd use to parse. However, I don't think it'd be unreasonable to add a yaml parser dependency for the accumulo-cluster script. (Note: I specified that it'd be fore the accumulo-cluster script, because I really think of that as a separate cluster-management utility that ships alongside of Accumulo, but is not part of Accumulo. It's optional, and users may wish to use a completely different cluster management mechanism. These are only included in the tarball because we've trained users to expect some kind of cluster management scripts ... aka 'start-all'... from early days. My preference is that they live in a separate repo, or live under a contrib folder. I mention this here, only to clarify the distinction between what is "Accumulo" and what is "accumulo-cluster", so as to keep the interface between the two separate and avoid tight coupling between the two).

ctubbsii · 2021-06-23T17:13:21Z

I am concerned about adding yq as a dependency, since it's not conveniently packaged for RPM-based distros. I wonder if shyaml might be better, which wraps pyyaml.

dlmarion · 2021-06-23T17:33:25Z

It looks like to install shyaml and pyyaml you need to first install python and pip. In some environments you may need to specially configure pip. yq is a single binary. It looks like rpm is the only format in which it's not packaged, although I found easy instructions for doing it. It could be installed on CentOS / RHEL via Snap: https://snapcraft.io/install/yq/rhel

ctubbsii · 2021-06-24T01:43:57Z

If you do sudo dnf install shyaml, it automatically pulls in dependencies. You don't need to install pip at all, and since dnf/yum already depends on python, it will already be installed. That said, I already use snaps for a few things, so I didn't have a problem installing the snap for yq. So, whatever you think would be easiest for users. If you prefer, we could also just do JSON instead of YAML, only because there's a ton of parsers already available.

We could also do something in Java. We already have commons-configuration2 on the classpath, and it supports YAML. We already call Java in the accumulo-cluster script for ZooZap. It wouldn't be hard to write something small that parses our config file for us and extracts the relevant bits.

dlmarion · 2021-06-29T12:21:35Z

If you prefer, we could also just do JSON instead of YAML, only because there's a ton of parsers already available.

Json would work too. Unfortunately json does not allow for comments in files. So you would need two different types of files depending on whether or not you are using external compactions. The examples below are with and without external compactions.

{
  "managers": [
    "localhost1",
    "localhost2"
    ],
  "monitors": [
    "localhost1",
    "localhost2"
    ],
  "tracers": [
    "localhost1"
    ],
  "gc": [
    "localhost1"
    ],
  "tservers": [
    "localhost1",
    "localhost2",
    "localhost3",
    "localhost4"
    ],
  "compaction": {
    "coordinators": [
      "localhost1",
      "localhost2"
      ],
    "compactors": [
      {
        "queue": "q1",
        "hosts": [
          "localhost1",
          "localhost2"
          ]
      },
      {
        "queue": "q2",
        "hosts": [
          "localhost1",
          "localhost2"
          ]
      }
    ]
  }
}

{
  "managers": [
    "localhost1",
    "localhost2"
    ],
  "monitors": [
    "localhost1",
    "localhost2"
    ],
  "tracers": [
    "localhost1"
    ],
  "gc": [
    "localhost1"
    ],
  "tservers": [
    "localhost1",
    "localhost2",
    "localhost3",
    "localhost4"
    ]
}

yq is an extension of jq link and I believe that jq packages are more widely available ( see https://stedolan.github.io/jq/download/). The following commands using jq work (very similar to the yq examples above):

managers=$(jq '.managers | join(",")' hosts.json)
monitors=$(jq '.monitors | join(",")' hosts.json)
gc=$(jq '.gc | join(",")' hosts.json)
tracers=$(jq '.tracers | join(",")' hosts.json)
tserver=$(jq '.tservers | join(",")' hosts.json)
jq '.compaction' hosts.json  # null means it does not exist
coordinators=$(jq '.compaction.coordinators | join(",")' hosts-compaction.json)
queues=$(jq '.compaction.compactors | [.[].queue] | join(",")' hosts-compaction.json)
queue1_hosts=$(jq '.compaction.compactors | .[] | select(.queue == "q1") | .hosts | join(",")' hosts-compaction.json)

ctubbsii · 2021-06-29T22:20:44Z

JSON is uglier, but more ubiquitous. Ultimately, I think I'd prefer YAML, but if there are things we can do to make it depend on less (such as having a bit of java code using one of our existing dependencies) to make the bar lower, that would be nice. If not, I think I'd still rather users get yq or shyaml themselves than force the ugliness of JSON on them.

Could also do simple properties files as well, but the schema would be a little bit frustrating. Commons-configuration2 could easily manage it, though... but if we're going that route, it'd be better to just use YAML, since commons-configuration2 can handle that too.

manager.hosts=managerhost1,managerhost2
...
compaction.coordinator.1=server1
compaction.coordinator.2=server2
compaction.queues.q1.compactors=server3,server4
compaction.queues.q2.compactors=server5,server6

keith-turner · 2021-07-07T16:24:51Z

One approach to supporting yaml is that we write a Java program to parse the yaml and emit text that is easy to deal with in bash. This avoids the dependency on yq or shyaml commands, while possibly introducing a yaml dep in java. Then could do something like the following in the accumulo-cluster bash script.

accumulo org.apache.accumulo.core.cluster.YamlExtract <cluster conf file>

The java program could extract specific information based on arguments OR convert the yaml to CSV like the following that is easy to process in bash.

tserver,host1
tserver,host2
tserver,host3
manager,host4
manager,host7
gc,host5
monitor,host6
compactor,host1,Q1
compactor,host2,Q2
compactor,host3,Q2
compactor,host3,Q2

Not sure if its better to make the java program smart w/ lots of options or make it dumb and spit out something like CSV moving more logic to bash. Making the java program smarter would look like the following, where the list of tservers is extracted from the yaml file.

$ accumulo org.apache.accumulo.core.cluster.YamlExtract --server-type tserver <cluster conf file>
host1
host2
host3

ctubbsii · 2021-07-07T19:06:56Z

@keith-turner that's basically what I was thinking, since we probably have everything we need on the class path already in lib/. I wasn't sure which way to go either, but I think I prefer keeping the bash simpler by having arguments. Emitting lists of things as newline- or space-delimited makes it easier to loop over, if necessary, in bash. (But we'd want to avoid making it so smart that it's effectively another yq or shyaml implementation... we're not trying to write a generic utility.)

keith-turner · 2021-07-12T19:32:11Z

@keith-turner that's basically what I was thinking, since we probably have everything we need on the class path already in lib/.

I didn't see your earlier message I see it now. It seems like doing the heavy lifting in Java is the best way to go, especially if we already had the needed deps on the classpath.

I think I prefer keeping the bash simpler by having arguments.

I think I do too. I also agree that we want to avoid creating something like yq or shyaml in Java. The Java code and options should be tailored to the accumulo cluster file format.

Modified the cluster start/stop scripts to use a new file (cluster.yml) for defining the hosts that will run the different server components. Added a class (and test) that parses the yaml file into a form that is usable by the scripts. Added ability to specify and start/stop the external compaction server processes. Closes apache#2138

dlmarion · 2021-08-20T18:06:05Z

@keith-turner @ctubbsii - take a look at the draft PR, let me know if this is what you had in mind. I have not tested it yet.

Modified the cluster start/stop scripts to use a new file (cluster.yaml) for defining the hosts that will run the different server components. Added a class (and test) that parses the yaml file into a form that is usable by the scripts. Added ability to specify and start/stop the external compaction server processes. Closes #2138 Co-authored-by: Keith Turner <kturner@apache.org>

keith-turner added the enhancement This issue describes a new feature, improvement, or optimization. label Jun 2, 2021

dlmarion mentioned this issue Aug 20, 2021

Add compaction coordinator and compactor to cluster start / stop scripts #2238

Merged

dlmarion self-assigned this Aug 27, 2021

dlmarion closed this as completed in #2238 Sep 30, 2021

ctubbsii added this to To do in 2.1.0 via automation Sep 30, 2021

ctubbsii moved this from To do to Done in 2.1.0 Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore supporting external compaction processes in accumulo-cluster script #2138

Explore supporting external compaction processes in accumulo-cluster script #2138

keith-turner commented Jun 2, 2021

ctubbsii commented Jun 3, 2021

keith-turner commented Jun 3, 2021 •

edited by ctubbsii

ctubbsii commented Jun 3, 2021

dlmarion commented Jun 23, 2021

keith-turner commented Jun 23, 2021

dlmarion commented Jun 23, 2021 •

edited by ctubbsii

ctubbsii commented Jun 23, 2021

ctubbsii commented Jun 23, 2021

dlmarion commented Jun 23, 2021

ctubbsii commented Jun 24, 2021

dlmarion commented Jun 29, 2021

ctubbsii commented Jun 29, 2021

keith-turner commented Jul 7, 2021 •

edited

ctubbsii commented Jul 7, 2021 •

edited

keith-turner commented Jul 12, 2021

dlmarion commented Aug 20, 2021

Explore supporting external compaction processes in accumulo-cluster script #2138

Explore supporting external compaction processes in accumulo-cluster script #2138

Comments

keith-turner commented Jun 2, 2021

ctubbsii commented Jun 3, 2021

keith-turner commented Jun 3, 2021 • edited by ctubbsii

ctubbsii commented Jun 3, 2021

dlmarion commented Jun 23, 2021

keith-turner commented Jun 23, 2021

dlmarion commented Jun 23, 2021 • edited by ctubbsii

ctubbsii commented Jun 23, 2021

ctubbsii commented Jun 23, 2021

dlmarion commented Jun 23, 2021

ctubbsii commented Jun 24, 2021

dlmarion commented Jun 29, 2021

ctubbsii commented Jun 29, 2021

keith-turner commented Jul 7, 2021 • edited

ctubbsii commented Jul 7, 2021 • edited

keith-turner commented Jul 12, 2021

dlmarion commented Aug 20, 2021

keith-turner commented Jun 3, 2021 •

edited by ctubbsii

dlmarion commented Jun 23, 2021 •

edited by ctubbsii

keith-turner commented Jul 7, 2021 •

edited

ctubbsii commented Jul 7, 2021 •

edited