Merge coll sysinfo metadata log #2005

portante · 2020-11-21T16:27:00Z

Merge collect sysinfo and metadata log

Merge pbench-collect-sysinfo and pbench-metadata-log operations and behaviors into the Tool Meister framework.

To make it easier to review, this PR was broken into 4 commits:

All the changes to the agent/bench-scripts directory without unit tests (14 files)
All the changes to the agent/util-scripts and lib/pbench/agent hierarchies (23 files)
UML Sequence Diagram updates (1 file)
All testing related updates (100+ files)

Once reviewed, we merged them back down to a single commit.

Description of combined changes

The last set of ssh commands issued by the pbench-agent that are not involved in orchestrating the Tool Meisters, outside of those issued by the bench-scripts, are for hostname -s data collection on all the registered remote tool hosts.

The Tool Meisters are now given the responsibility for collecting information about their environment, including the version of the pbench agent, and a more comprehensive collection of host name information, returning that with their startup acknowledgement payload. This change also addresses a previous bug where registered hosts labels were not being used in the on-disk data capture directories as they were before.

This collected data set is now recorded by the Tool Data Sink, after restructuring the action flow so that clients send action playloads to the Tool Data Sink, which then forwards to the Tool Meisters. This gives the Tool Data Sink the ability to collect all the Tool Meister information and write it to disk before acknowledging to the client that startup is complete.

We add an entirely new section in the metadata.log file which is per-host/per-tool, recording the tool options, and the output from the tool install checks (if any). This section supercedes the previous tool options in the per-host sections, though we leave those options there for compatibility for now.

We add graceful handling of bad PUT requests from Tool Meisters to the Tool Data Sink.

With this refactoring, we also collapse all the "sysinfo", "init", and "end" actions into the respective pbench-tool-meister-start and pbench-tool-meister-stop interfaces to simplify the CLI behaviors.

To make this possible, we add a new Client API used by the CLI interfaces pbench-tool-meister-start and pbench-tool-meister-stop.

We have also undertaken a major refactoring of pbench-sysinfo-dump to remove the dependency on base for environment variables. As a result of this work, we drop the stockpile configuration entirely, as it is no longer required.

We have enhanced the Tool Meister tests to use persistent tools, and enhanced the unit test framework to properly clean up the Tool Meister test environment when tests fail.

We have updated the UML sequence diagram describing the Tool Meister operation to add the init/end/sysinfo steps, and the fact that the Tool Data Sink is the gatekeeper of actions, forwarding them on to the Tool Meisters.

Notes

Okay, this is where Python 3 falls down: replacing command line utilities available via bash with Python 3 modules.

Turns out you can't use shutil.copytree() inside a container to replace cp -RL.

That code attempts to use os.setxattr() at the lowest level to copy all the attributes properly. But when not running as a real root user in a container, you can't copy all attributes, and a "Permission denied" exception is raised.

The original pbench-metadata-log code just used cp -rL and it worked both in and out of a container. So we just invoke that command directly again from the new Python code in the tool_data_sink.py module.

agent/base

agent/bench-scripts/pbench-dbench

agent/bench-scripts/pbench-fio

agent/bench-scripts/pbench-linpack

agent/bench-scripts/pbench-migrate

agent/bench-scripts/pbench-netperf

agent/bench-scripts/pbench-trafficgen

agent/bench-scripts/pbench-uperf

agent/util-scripts/pbench-tool-meister-client

agent/util-scripts/pbench-tool-meister-stop

agent/util-scripts/tool-meister/pbench-sysinfo-dump

lib/pbench/agent/utils.py

portante

Still a work-in-progress, as we need to implement the pbench-metadata-log functionality in as a callable API perhaps.

agent/base

lib/pbench/agent/tool_meister_client.py

lib/pbench/agent/tool_data_sink.py

The last set of `ssh` commands issued by the pbench-agent that are not involved in orchestrating the Tool Meisters, outside of those issued by the `bench-scripts`, are for `hostname -s` data collection on all the registered remote tool hosts. The Tool Meisters are now given the responsibility for collecting information about their environment, including the version of the pbench agent, and a more comprehensive collection of host name information, returning that with their startup acknowledgement payload. This change also addresses a previous bug where registered host labels were not being used in the on-disk data capture directories as they were before. This collected data set is now recorded by the Tool Data Sink, after restructuring the action flow so that clients send action payloads to the Tool Data Sink, which then forwards those payloads to the Tool Meisters. This gives the Tool Data Sink the ability to collect all the Tool Meister information and write it to disk before acknowledging to the client that startup is complete. We add an entirely new section in the `metadata.log` file which is per-host/per-tool, recording the tool options, and the output from the tool install checks (if any). This section supercedes the previous tool options in the per-host sections, though we leave those options there for compatibility for now. We add graceful handling of bad PUT requests from Tool Meisters to the Tool Data Sink. With this refactoring, we also collapse all the "sysinfo", "init", and "end" actions into the respective `pbench-tool-meister-start` and `pbench-tool-meister-stop` interfaces to simplify the CLI behaviors. To make this possible, we add a new `Client` API used by the CLI interfaces `pbench-tool-meister-start` and `pbench-tool-meister-stop`. We have also undertaken a major refactoring of `pbench-sysinfo-dump` to remove the dependency on `base` for environment variables. As a result of this work, we drop the stockpile configuration entirely, as it is no longer required. The Tool Meister unit tests were enhanced to use persistent tools, and the unit test framework now properly cleans up the Tool Meister test environment when tests fail. We moved the pbench-sysinfo-dump CLI command to the `agent/util-scripts/tool-meister` directory so that users won't see that internal command in their `PATH`. We updateed the UML Seq Diag to reflect the code, addng the `init`/`end`/`sysinfo` actions, and the fact that the Tool Data Sink is now the gatekeeper of actions, forwarding them on to the Tool Meisters instead of the client sending actions directly to the Tool Meisters. Since we no longer have the CLI command `pbench-collect-sysinfo`, we drop it from the unit tests, and replace it with more appropriate tests. To that end: * tests 25 - 30 now just invoke `pbench-verify-sysinfo-options` * tests 54 & 55 just invoke the `--help` option on `pbench-tool-meister-start` and `-stop` * tests 23 * 24 are dropped The Tool Meisters need to start their persistent tools first before the collectors can run correctly. The PCP pmlogger collectors wait for the remote pmcds to start listening on the expected ports. We have to pass along the "init" action to the Tool Meisters before we setup the PCP collectors. The `pbench-tool-meister-stop` command will now return an error when the `end` operation fails, or if we can't create the directory for the `end` operation to work. We always wait for the local Tool Data Sink and local Tool Meister to exit before killing the Redis server, regardless of the success or failure of `terminate` operation.

Maxusmusti

Re-tested with local and remote node, transient tools + persistent tools, works as expected

dbutenhof

Doesn't appear that GitHub thinks anything significant has changed since my last review, so let's get it in.

portante added this to the v0.71 milestone Nov 21, 2020

portante self-assigned this Nov 21, 2020

portante force-pushed the merge-coll-sysinfo-metadata-log branch 2 times, most recently from 9af8657 to 048a86a Compare November 23, 2020 20:35

portante commented Nov 23, 2020

View reviewed changes

agent/base Outdated Show resolved Hide resolved

portante commented Nov 23, 2020

View reviewed changes

agent/bench-scripts/pbench-dbench Show resolved Hide resolved

portante commented Nov 23, 2020

View reviewed changes

agent/bench-scripts/pbench-fio Show resolved Hide resolved