Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore supporting external compaction processes in accumulo-cluster script #2138

Closed
keith-turner opened this issue Jun 2, 2021 · 16 comments · Fixed by #2238
Closed

Explore supporting external compaction processes in accumulo-cluster script #2138

keith-turner opened this issue Jun 2, 2021 · 16 comments · Fixed by #2238
Assignees
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects

Comments

@keith-turner
Copy link
Contributor

The accumulo-cluster script could be modified to support starting coordinator and compactor processes for external compactions if the corresponding files exists. Not sure if this is a good idea in general, but thought it was worth considering. Trying to determine if this is something worth pursuing, came up with the following list of pros and cons. What other things are there to consider?

Pros

  • Makes it easier to start these processes on a cluster for testing.

Cons

  • More to maintain in the accumulo-cluster script, this may only be an issue if there is specialization for compactors.
  • Compactor processes run against a queue. This may not fit well w/ the accumulo cluster script. Would need to figure out an elegant way to do this.
@keith-turner keith-turner added the enhancement This issue describes a new feature, improvement, or optimization. label Jun 2, 2021
@ctubbsii
Copy link
Member

ctubbsii commented Jun 3, 2021

Rather than create new hosts files, I would like to see the hosts files consolidated into a single accumulo-cluster.conf file which includes the hosts for all the services the accumulo-cluster script manages (there may be an existing ticket for this... I didn't check before writing this). If host configuration were streamlined into a single config file, it would be easier to specify the startup of additional services, perhaps even with separate config for each host.

@keith-turner
Copy link
Contributor Author

keith-turner commented Jun 3, 2021

I would like to see the hosts files consolidated into a single accumulo-cluster.conf

That sounds interesting. Any thoughts on the structure and how compactor+queue info might fit into that? For compactors I was thinking of the following possibilities, had not considered a single file.

  1. A dir name compactors with files under it for each queue. The name of the file would be queue name and the contents of the file would be host to start compactors.
  2. A file named compactors where the contents are lines like : <host>,<queue>. This would repeat information alot and seems error prone.
  3. A file named compactors where contents has sections. The section is the queue name and the contents are host. Like
[Q1]
host1
host2

[Q2]
host3
host4

(comment edited by @ctubbsii to view format of option 3 better, and to number the options, to make it easier to reference)

@ctubbsii
Copy link
Member

ctubbsii commented Jun 3, 2021

Of the three, option 3 is closest to what I was thinking. However, if it was done in a single file, it would strongly depend on the format for that file. If it were YAML, it might look something like:

managers:
 - host1

gc:
 - host1

tservers:
 - host2
 - host3

compaction:
  coordinator: host4
  queues:
   -
     name: queue1
     workers: [ host5, host6 ]
   -
     name: queue2
     workers:
      - host7
      - host8
      - host9

@dlmarion
Copy link
Contributor

Would you expect that yamllint and yq be prerequisites for installing Accumulo so that we could use them to validate and parse the yaml file?

@keith-turner
Copy link
Contributor Author

I have not done anything w/ yaml using command line tools. I wonder if there are tools that would make it possible to convert the existing files to yaml in a few lines of bash. If so the script could detect the old files and just print an error message w/ the suggested commands to run to convert them.

@dlmarion
Copy link
Contributor

dlmarion commented Jun 23, 2021

Install yamllint and yq:

sudo yum install yamllint
wget https://github.com/mikefarah/yq/releases/download/v4.9.6/yq_linux_amd64 -O yq

So, for this file:

---
managers:
  - localhost1
  - localhost2

monitors:
  - localhost1
  - localhost2

tracer: # EDIT by @ctubbsii : this should be 'tracer' not 'trace', since this refers to the accumulo-tracer service.
  - localhost

gc:
  - localhost

tservers:
  - localhost1
  - localhost2
  - localhost3
  - localhost4

# compaction:
#   coordinators:
#     - localhost1
#     - localhost2
#   compactors:
#     - queue: q1
#       hosts:
#         - localhost1
#         - localhost2
#     - queue: q2
#       hosts:
#         - localhost1
#         - localhost2

These commands work:

yamllint -s hosts.yaml      # 0 == success, anything else is error and malformed
managers=$(yq eval '.managers | join(",")' hosts.yaml) 
monitors=$(yq eval '.monitors | join(",")' hosts.yaml)
gc=$(yq eval '.gc | join(",")' hosts.yaml)
tservers=$(yq eval '.tservers | join(",")' hosts.yaml)
yq eval '.compaction' hosts.yaml     #  null --> means its commented out
coordinators=$(yq eval '.compaction.coordinators | join(",")' hosts.yaml)
queues=$(yq eval '[.compaction.compactors.[].queue] | join(",")' hosts.yaml)
queue1_hosts=$(yq eval '.compaction.compactors.[] | select (.queue == "q1") | .hosts | join(",")' hosts.yaml)

@ctubbsii
Copy link
Member

Would you expect that yamllint and yq be prerequisites for installing Accumulo so that we could use them to validate and parse the yaml file?

I've never used these. I'm not sure what we'd use to parse. However, I don't think it'd be unreasonable to add a yaml parser dependency for the accumulo-cluster script. (Note: I specified that it'd be fore the accumulo-cluster script, because I really think of that as a separate cluster-management utility that ships alongside of Accumulo, but is not part of Accumulo. It's optional, and users may wish to use a completely different cluster management mechanism. These are only included in the tarball because we've trained users to expect some kind of cluster management scripts ... aka 'start-all'... from early days. My preference is that they live in a separate repo, or live under a contrib folder. I mention this here, only to clarify the distinction between what is "Accumulo" and what is "accumulo-cluster", so as to keep the interface between the two separate and avoid tight coupling between the two).

@ctubbsii
Copy link
Member

I am concerned about adding yq as a dependency, since it's not conveniently packaged for RPM-based distros. I wonder if shyaml might be better, which wraps pyyaml.

@dlmarion
Copy link
Contributor

It looks like to install shyaml and pyyaml you need to first install python and pip. In some environments you may need to specially configure pip. yq is a single binary. It looks like rpm is the only format in which it's not packaged, although I found easy instructions for doing it. It could be installed on CentOS / RHEL via Snap: https://snapcraft.io/install/yq/rhel

@ctubbsii
Copy link
Member

If you do sudo dnf install shyaml, it automatically pulls in dependencies. You don't need to install pip at all, and since dnf/yum already depends on python, it will already be installed. That said, I already use snaps for a few things, so I didn't have a problem installing the snap for yq. So, whatever you think would be easiest for users. If you prefer, we could also just do JSON instead of YAML, only because there's a ton of parsers already available.

We could also do something in Java. We already have commons-configuration2 on the classpath, and it supports YAML. We already call Java in the accumulo-cluster script for ZooZap. It wouldn't be hard to write something small that parses our config file for us and extracts the relevant bits.

@dlmarion
Copy link
Contributor

If you prefer, we could also just do JSON instead of YAML, only because there's a ton of parsers already available.

Json would work too. Unfortunately json does not allow for comments in files. So you would need two different types of files depending on whether or not you are using external compactions. The examples below are with and without external compactions.

{
  "managers": [
    "localhost1",
    "localhost2"
    ],
  "monitors": [
    "localhost1",
    "localhost2"
    ],
  "tracers": [
    "localhost1"
    ],
  "gc": [
    "localhost1"
    ],
  "tservers": [
    "localhost1",
    "localhost2",
    "localhost3",
    "localhost4"
    ],
  "compaction": {
    "coordinators": [
      "localhost1",
      "localhost2"
      ],
    "compactors": [
      {
        "queue": "q1",
        "hosts": [
          "localhost1",
          "localhost2"
          ]
      },
      {
        "queue": "q2",
        "hosts": [
          "localhost1",
          "localhost2"
          ]
      }
    ]
  }
}
{
  "managers": [
    "localhost1",
    "localhost2"
    ],
  "monitors": [
    "localhost1",
    "localhost2"
    ],
  "tracers": [
    "localhost1"
    ],
  "gc": [
    "localhost1"
    ],
  "tservers": [
    "localhost1",
    "localhost2",
    "localhost3",
    "localhost4"
    ]
}

yq is an extension of jq link and I believe that jq packages are more widely available ( see https://stedolan.github.io/jq/download/). The following commands using jq work (very similar to the yq examples above):

managers=$(jq '.managers | join(",")' hosts.json)
monitors=$(jq '.monitors | join(",")' hosts.json)
gc=$(jq '.gc | join(",")' hosts.json)
tracers=$(jq '.tracers | join(",")' hosts.json)
tserver=$(jq '.tservers | join(",")' hosts.json)
jq '.compaction' hosts.json  # null means it does not exist
coordinators=$(jq '.compaction.coordinators | join(",")' hosts-compaction.json)
queues=$(jq '.compaction.compactors | [.[].queue] | join(",")' hosts-compaction.json)
queue1_hosts=$(jq '.compaction.compactors | .[] | select(.queue == "q1") | .hosts | join(",")' hosts-compaction.json)

@ctubbsii
Copy link
Member

JSON is uglier, but more ubiquitous. Ultimately, I think I'd prefer YAML, but if there are things we can do to make it depend on less (such as having a bit of java code using one of our existing dependencies) to make the bar lower, that would be nice. If not, I think I'd still rather users get yq or shyaml themselves than force the ugliness of JSON on them.

Could also do simple properties files as well, but the schema would be a little bit frustrating. Commons-configuration2 could easily manage it, though... but if we're going that route, it'd be better to just use YAML, since commons-configuration2 can handle that too.

manager.hosts=managerhost1,managerhost2
...
compaction.coordinator.1=server1
compaction.coordinator.2=server2
compaction.queues.q1.compactors=server3,server4
compaction.queues.q2.compactors=server5,server6

@keith-turner
Copy link
Contributor Author

keith-turner commented Jul 7, 2021

One approach to supporting yaml is that we write a Java program to parse the yaml and emit text that is easy to deal with in bash. This avoids the dependency on yq or shyaml commands, while possibly introducing a yaml dep in java. Then could do something like the following in the accumulo-cluster bash script.

accumulo org.apache.accumulo.core.cluster.YamlExtract <cluster conf file>

The java program could extract specific information based on arguments OR convert the yaml to CSV like the following that is easy to process in bash.

tserver,host1
tserver,host2
tserver,host3
manager,host4
manager,host7
gc,host5
monitor,host6
compactor,host1,Q1
compactor,host2,Q2
compactor,host3,Q2
compactor,host3,Q2

Not sure if its better to make the java program smart w/ lots of options or make it dumb and spit out something like CSV moving more logic to bash. Making the java program smarter would look like the following, where the list of tservers is extracted from the yaml file.

$ accumulo org.apache.accumulo.core.cluster.YamlExtract --server-type tserver <cluster conf file>
host1
host2
host3

@ctubbsii
Copy link
Member

ctubbsii commented Jul 7, 2021

@keith-turner that's basically what I was thinking, since we probably have everything we need on the class path already in lib/. I wasn't sure which way to go either, but I think I prefer keeping the bash simpler by having arguments. Emitting lists of things as newline- or space-delimited makes it easier to loop over, if necessary, in bash. (But we'd want to avoid making it so smart that it's effectively another yq or shyaml implementation... we're not trying to write a generic utility.)

@keith-turner
Copy link
Contributor Author

@keith-turner that's basically what I was thinking, since we probably have everything we need on the class path already in lib/.

I didn't see your earlier message I see it now. It seems like doing the heavy lifting in Java is the best way to go, especially if we already had the needed deps on the classpath.

I think I prefer keeping the bash simpler by having arguments.

I think I do too. I also agree that we want to avoid creating something like yq or shyaml in Java. The Java code and options should be tailored to the accumulo cluster file format.

dlmarion added a commit to dlmarion/accumulo that referenced this issue Aug 20, 2021
Modified the cluster start/stop scripts to use a new file (cluster.yml)
for defining the hosts that will run the different server components. Added
a class (and test) that parses the yaml file into a form that is usable by
the scripts. Added ability to specify and start/stop the external compaction
server processes.

Closes apache#2138
@dlmarion
Copy link
Contributor

@keith-turner @ctubbsii - take a look at the draft PR, let me know if this is what you had in mind. I have not tested it yet.

@dlmarion dlmarion self-assigned this Aug 27, 2021
dlmarion added a commit that referenced this issue Sep 30, 2021
Modified the cluster start/stop scripts to use a new file (cluster.yaml)
for defining the hosts that will run the different server components. Added
a class (and test) that parses the yaml file into a form that is usable by
the scripts. Added ability to specify and start/stop the external compaction
server processes.

Closes #2138


Co-authored-by: Keith Turner <kturner@apache.org>
@ctubbsii ctubbsii added this to To do in 2.1.0 via automation Sep 30, 2021
@ctubbsii ctubbsii moved this from To do to Done in 2.1.0 Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
No open projects
2.1.0
  
Done
Development

Successfully merging a pull request may close this issue.

3 participants