Skip to content

Conversation

@DarrylWong
Copy link
Contributor

@DarrylWong DarrylWong commented May 15, 2025

This change adds cgroup v2 as a prometheus scrape target and starts using it in the failure-injection/smoke-test roachtest. See commits for details.

Fixes: #144052
Release note: none


Example from a read + write disk stall on node 1. (n2 sees no reads because we don't evict the stores for it)

image

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@DarrylWong DarrylWong May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a working example of this test using the clusterstats package to query for read/write bytes if we really want to go that route. But IMO there's so much boilerplate required by clusterstats that I don't think it's necessarily a cleaner solution than what we have right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this happens when the node is overloaded. I noticed it happens less when I run the entire smoke test (not just disk stalls) and never (at least so far) when I rebase off of #144868 to give the cluster more time to recover after restarts.

@DarrylWong DarrylWong force-pushed the add-cgroup-exporter branch 3 times, most recently from 7e55be5 to f7c6376 Compare May 15, 2025 17:13
@DarrylWong DarrylWong marked this pull request as ready for review May 15, 2025 17:16
@DarrylWong DarrylWong requested a review from a team as a code owner May 15, 2025 17:16
@DarrylWong DarrylWong requested review from herkolategan and srosenberg and removed request for a team May 15, 2025 17:16
Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Just a comment on forking the cgroup exporter repo so we don't flake if it disappears off github.

}

func InstallGoVersion(ctx context.Context, l *logger.Logger, c *SyncedCluster, v string) error {
retryRunOpts := WithNodes(c.Nodes).WithRetryOpts(retry.Options{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why more than the default retry options are required here?

// The first retry is after 5s, the second and final is after 25s
var DefaultRetryOpt = &retry.Options{
	InitialBackoff: 5 * time.Second,
	Multiplier:     5,
	MaxBackoff:     1 * time.Minute,
	// This will run a total of 3 times `runWithMaybeRetry`
	MaxRetries: 2,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied and pasted from

var canaryRetryOptions = retry.Options{
InitialBackoff: 10 * time.Second,
Multiplier: 2,
MaxBackoff: 5 * time.Minute,
MaxRetries: 10,
}

that the old installGolang function was using. It's very old code though, I think I'll remove it and go with the default.

git clone %[2]s %[1]s
fi
cd %[1]s && git checkout 97b83d6d495b3cb6f959a4368fd93ac342d23706`,
"cgroup-exporter", "https://github.com/arianvp/cgroup-exporter.git")); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should rather fork this repo. Could probably do release builds then on it as well with github actions to maybe avoid the build steps + go installation. Just a thought, but I think forking should still be done to avoid the case where the repo is removed / discontinued which ends up failing our suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should rather fork this repo

Sounds good, I sent a ticket to fork it.

Could probably do release builds then on it as well with github actions to maybe avoid the build steps + go installation.

Do you know how the binaries for node_exporter were created? (manually?):

rm -rf node_exporter && mkdir -p node_exporter && curl -fsSL \
		https://storage.googleapis.com/cockroach-test-artifacts/prometheus/node_exporter-%s.linux-${ARCH}.tar.gz |
		tar zxv --strip-components 1 -C node_exporter

I don't expect this repo to change very much so it seems like a nightly build would be overkill (although maybe I misunderstand what github actions would do).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golgeek has previously built a really nice (go) packager used exactly for this purpose... maybe we could resurrect it? :)

Do you know how the binaries for node_exporter were created? (manually?):

Copied from the official release: https://github.com/prometheus/prometheus/releases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golgeek has previously built a really nice (go) packager used exactly for this purpose... maybe we could resurrect it? :)

Created an issue: #147352.

We could easily host a .deb package repository on GCP's Artifact Registry and host precompiled dependencies for amd64, arm64 and s390x architectures.

"go1.22.2.linux-s390x.tar.gz": "2b39019481c28c560d65e9811a478ae10e3ef765e0f59af362031d386a71bfef",
}

func InstallGoVersion(ctx context.Context, l *logger.Logger, c *SyncedCluster, v string) error {
Copy link
Collaborator

@herkolategan herkolategan May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default install tool still has:

	"go": `
sudo apt-get update;
sudo apt-get install -y golang-go;`,

Curious if we should still keep that around, maybe serves the purpose of installing the latest for the OS?
Mostly mentioning, because I tried it the other day and it hanged indefinitely - but might just have been a fluke.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe serves the purpose of installing the latest for the OS?

Yeah exactly. Works for me, although I opted against using it as it installs by default:

go version
go version go1.18.1 linux/amd64

Audited the usages of the default install tool and looks like just the cdc tests that use it to run a webhook-server. No reason why it has to use 1.18, just running a very simple go script.

Based on that, I removed the default install tool completely, and switched the cdc tests to install go1.23.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if an "install the same go version that cockroachdb uses" option would be useful. Not because they need to be the same, but more so we avoid running these archaic versions because no one manually bumped it.

func (t *failureSmokeTest) annotateGrafana(
ctx context.Context, l *logger.Logger, c cluster.Cluster, text string,
) error {
if err := c.AddGrafanaAnnotation(ctx, l, grafana.AddAnnotationRequest{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can probably fire off the requests (one below) in parallel with errgroup or similar to reduce the overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

}

promCfg := &prometheus.Config{}
promCfg.WithPrometheusNode(c.WorkloadNode().InstallNodes()[0]).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / question: this assumes the cluster always has a workload node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow your question here. This is only used by the failure-injection roachtest(s) which both provision a workload node. I do have some opinions on how we can make spinning up prometheus much less boilerplate, although that seems out of the scope of this PR.

@DarrylWong DarrylWong force-pushed the add-cgroup-exporter branch 2 times, most recently from a993a94 to d479960 Compare May 20, 2025 18:46
@DarrylWong DarrylWong requested review from a team as code owners May 20, 2025 18:46
@DarrylWong DarrylWong force-pushed the add-cgroup-exporter branch 4 times, most recently from 800965a to c3e2861 Compare May 21, 2025 15:46
if err != nil {
return err
}
req.Tags = append(req.Tags, "roachtest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why the tags are different between AddGrafanaAnnotation and AddInternalGrafanaAnnotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of roachtest, the centralized service contains data from many runs while the internal service should only persist for one run. In the former, we need the cluster id, test id, etc. to differentiate between which annotation belongs to which test/cluster.

For the internal service, while we don't need that level of granularity, we still need something to filter for due to how grafana annotations work [1]. I could modify all the internal service dashboards to be compatible with cluster id, but it seemed overkill for something with no benefit?

[1] You can either add annotations to a specific dashboard that always shows up or you can add them to a workspace and query for them. We go with the latter approach because we have dozens of dashboards, it would be cumbersome to add them all to the request.

// Consider an installation failure to be a flake which is out of
// our control. This should be rare.
err := c.Install(ctx, t.L(), ct.sinkNodes, "go")
err := c.InstallGoVersion(ctx, t.L(), ct.sinkNodes, "1.23.7")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the one from go.mod to make this future-proof.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a stab. It seems like there's no built in go way to grab the version from go.mod, you have to parse the file itself. e.g. github actions seems to also resort to parsing the go.mod file.

Unfortunately I think this breaks if there's no go.mod file, e.g. you are running the binary standalone away from the repo. I added a fallback to use runtime.Version(), but I'm on the fence on if this would silently cause unexpected behavior.

#!/bin/bash
set -e -o pipefail
wget https://go.dev/dl/go1.23.7.linux-amd64.tar.gz -O /tmp/go.tar.gz
sudo rm -rf /usr/local/go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good riddance.

// instance.
func (t *failureSmokeTest) annotateGrafana(c cluster.Cluster, rt test.Test, text string) error {
errGroup := rt.NewErrorGroup()
errGroup.Go(func(ctx context.Context, l *logger.Logger) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda redundant... a helper to the rescue? E.g., c.AddGrafanaAnnotation(..., all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I refactored c.AddGrafanaAnnotation so that its uses functional operators which includes an option to choose the service.

WithGrafanaDashboardJSON(dashboard.CgroupIOGrafanaJSON).
WithCgroupExporter(c.CRDBNodes().InstallNodes())

if err = c.StartGrafana(ctx, promSetupLogger, promCfg); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to spin up internal prometheus and grafana? The change in createLabels adds the dynamic scrape target (via promhelper) to the centralized instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original goal was to switch over the cgroup smoke tests to use the internal prom instance. However I put that on hold since I wasn't satisfied with the implementation using clusterstats. Eventually I do want to still switch it over, so i figured I might as well keep it in to "smoke test" it visually.

EbpfExporterPort = 9435
// EbpfExporterMetricsPath is the path that EbpfExporter serves metrics on.
EbpfExporterMetricsPath = "/metrics"
// CgroupExporterPort is the port that EbpfExporter listens on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope :) Remnants of copy/paste.

// EbpfExporterMetricsPath is the path that EbpfExporter serves metrics on.
EbpfExporterMetricsPath = "/metrics"
// CgroupExporterPort is the port that EbpfExporter listens on.
CgroupExporterPort = 9101
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is technically taken although I have no clue what Bacula Director is/does,

cat /etc/services |grep 9101
bacula-dir	9101/tcp			# Bacula Director

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubuntu@darrylcockroachlabscom-test-0001:~$ cat /etc/services | grep 9101
bacula-dir	9101/tcp			# Bacula Director
ubuntu@darrylcockroachlabscom-test-0001:~$ cat /etc/services | grep 9102
bacula-fd	9102/tcp			# Bacula File Daemon
ubuntu@darrylcockroachlabscom-test-0001:~$ cat /etc/services | grep 9103
bacula-sd	9103/tcp			# Bacula Storage Daemon
ubuntu@darrylcockroachlabscom-test-0001:~$ cat /etc/services | grep 9104

9104 it is.

@DarrylWong DarrylWong force-pushed the add-cgroup-exporter branch 2 times, most recently from 5763f01 to c6abf08 Compare May 22, 2025 21:01
@DarrylWong DarrylWong force-pushed the add-cgroup-exporter branch from c6abf08 to f303912 Compare May 23, 2025 15:00
@blathers-crl
Copy link

blathers-crl bot commented May 23, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

This change adds cgroup as a scrape target in prometheus for
the internal grafana service. It uses the cgroup_exporter library
imported from https://github.com/arianvp/cgroup-exporter.
@DarrylWong DarrylWong force-pushed the add-cgroup-exporter branch from 77d05bb to 7b325ac Compare May 28, 2025 14:50
This change uses the newly added cgroup exporter to spin up
a prometheus instance in failure-injection/smoke-test roachtest.
Currently this is only uses for visual validation through the
internal grafana dashboard, but in the future the test should
use the prom instance for performance assertions.
This adds two small changes to the cgroup disk stall smoke
test in efforts to deflake it.
@DarrylWong DarrylWong force-pushed the add-cgroup-exporter branch from 7b325ac to 6acc011 Compare May 28, 2025 15:39
@rafiss rafiss removed the request for review from a team June 6, 2025 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachprod/failure-injection: add cgroups metrics as a scrape target

5 participants