CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics #102

sarankk · 2024-02-29T06:30:27Z

No description provided.

yifan-c

First, thank you for adding a metrics framework!

I only skimmed through the first dozen files. But it becomes hard to follow, as this patch contains both new framework and changes to the existing metric publish.

I would suggest to trim this patch to contain only the new framework, which is the most valuable. And we can focus on getting the framework right first.
Then, you can create the follow-up patches that migrate the existing metrics.

Let me know if it makes sense to you.

src/main/dist/conf/sidecar.yaml

src/main/java/org/apache/cassandra/sidecar/metrics/MetricName.java

src/main/java/org/apache/cassandra/sidecar/metrics/instance/InstanceMetricRegistry.java

sarankk · 2024-03-13T00:07:15Z

Thanks for the review Yifan, removed changes to existing metrics

src/main/java/org/apache/cassandra/sidecar/config/VertxMetricsConfiguration.java

yifan-c · 2024-03-21T00:28:46Z

src/test/java/org/apache/cassandra/sidecar/tasks/HealthCheckPeriodicTaskTest.java

-        promise.future().onComplete(context.succeedingThenComplete());
+        promise.future().onComplete(context.failing(v -> {


The assertion change makes sense. How was the test passing before? hmm...

Yes I saw a flaky test on circleci. Locally it was passing when I ran all tests together, if I ran just that test it was always failing

src/main/java/org/apache/cassandra/sidecar/metrics/NamedMetric.java

yifan-c · 2024-03-21T00:57:23Z

src/main/java/org/apache/cassandra/sidecar/metrics/NamedMetric.java

+public class NamedMetric<T extends Metric>
+{
+    public final String canonicalName;
+    public final T metric;


can you add a test case for NamedMetric to assert the retrieved metric with canonicalName is the same metric?

src/main/java/org/apache/cassandra/sidecar/metrics/NamedMetric.java

src/main/java/org/apache/cassandra/sidecar/metrics/instance/StreamSSTableComponentMetrics.java

src/main/java/org/apache/cassandra/sidecar/metrics/instance/UploadSSTableComponentMetrics.java

yifan-c · 2024-03-21T05:02:47Z

src/main/java/org/apache/cassandra/sidecar/metrics/instance/UploadSSTableComponentMetrics.java

+        diskUsageHighErrors
+        = NamedMetric.builder(instanceMetricRegistry::meter)
+                     .withDomain(DOMAIN)
+                     .withName("disk_usage_high_errors")


it is the same as insufficient_staging_space. can you remove this metric? and collect with that one instead.

It is not the same currently. Code in upload handler to check if sufficient space is there, is different from DiskprotectionHandler. If we want to use same metric, we can modify SSTableUploadHandler

frankgh

This is a great contribution. Thanks for the patch. I've added some comments

src/main/dist/conf/sidecar.yaml

src/main/java/org/apache/cassandra/sidecar/cluster/instance/InstanceMetadata.java

src/main/java/org/apache/cassandra/sidecar/cluster/instance/InstanceMetadataImpl.java

src/main/java/org/apache/cassandra/sidecar/config/yaml/MetricsConfigurationImpl.java

frankgh · 2024-03-22T18:54:35Z

src/main/java/org/apache/cassandra/sidecar/metrics/instance/UploadSSTableMetrics.java

+            rateLimitedCalls
+            = NamedMetric.builder(metricRegistry::meter)
+                         .withDomain(DOMAIN)
+                         .withName("throttled_429")


should we make these names constants of the class with public visibility?

we expose the full name, if we want to use the metric name during testing. I don't see a use case for just the name extension elsewhere, wdyt?

src/main/java/org/apache/cassandra/sidecar/routes/DiskSpaceProtectionHandler.java

frankgh · 2024-03-22T18:57:40Z

src/main/java/org/apache/cassandra/sidecar/routes/sstableuploads/SSTableUploadHandler.java

@@ -116,10 +119,15 @@ public void handleInternal(RoutingContext context,
        // accept the upload.
        httpRequest.pause();

+        InstanceMetrics instanceMetrics = metadataFetcher.instance(host).metrics();


InstanceMetadata can be null, we usually return 503 Unavailable when this occurs. I think we run into the issue of an NPE here.

In that case we can have instance(host) throw unavailable exception. Else we will have to do null check everywhere before retrieving metrics

It does it currently throw new NoSuchElementException("Instance id " + id + " not found")

We handle nulls coming from metadata fetcher in the codebase already , so I think we should continue handling them

frankgh · 2024-03-22T19:05:11Z

src/main/java/org/apache/cassandra/sidecar/tasks/HealthCheckPeriodicTask.java

@@ -101,6 +108,11 @@ public void execute(Promise<Void> promise)

        // join always waits until all its futures are completed and will not fail as soon as one of the future fails
        Future.join(futures)
+              .onComplete(v -> {
+                  int instancesUp = instancesConfig.instances().size() - instanceDown.get();


you don't really need an AtomicInteger to keep track of the down instances. You can check the results instead.

v.result().causes().stream().filter(Objects::nonNull).count()

But what if the futures failed due to some other reason, other than what is caught in catch block?

how is that possible? we explicitly set the promise to either failed or succeeded

AtomicInteger instanceDown provides better readability w/o perf sacrifice. I am leaning to the current implementation

I tried these changes, we have to process the value of result to get cause. Maintaining the count outside seems simpler, wdyt?

frankgh · 2024-03-22T19:08:30Z

src/test/java/org/apache/cassandra/sidecar/restore/RestoreJobManagerTest.java

@@ -66,7 +68,7 @@ class RestoreJobManagerTest
    @BeforeEach
    void setup()
    {
-        Injector injector = Guice.createInjector(new MainModule());
+        Injector injector = Guice.createInjector(Modules.override(new MainModule()).with(new TestModule()));


why do we need the test module here? the metrics should be provided in the main module

Here it is needed because, the test doesn't provide all the classes we need. It only provides mock(RestoreJobConfiguration.class). But for building Vert.x instance we need sidecar configuration object. The test throws error saying, yaml not found if we use MainModule. TestModule provides SidecarConfiguration hence we avoid yaml error.

maybe we need to add these fields to the yaml?

frankgh · 2024-03-22T19:09:18Z

src/test/java/org/apache/cassandra/sidecar/server/ServerSSLTest.java

@@ -124,6 +123,7 @@ void tearDown() throws InterruptedException
    void failsWhenKeyStoreIsNotConfigured()
    {
        builder.sslConfiguration(SslConfigurationImpl.builder().enabled(true).build());
+        vertx = vertx();


why is this change needed?

Since now creating a Vert.x instance needs SidecarConfiguration to get metrics options. Changed this. Because in these tests, the configuration is updated inside test as per each test. After the configuration update, building the vert.x instance

It is not needed for all tests cases, example some of the failure test cases. But changed it consistently for easier test reading

sarankk · 2024-03-22T19:59:59Z

Thanks for the review @frankgh addressed your comments.

frankgh

+1 Thanks for addressing the comments.

sarankk changed the title ~~Improve observability in Sidecar with Dropwizard metrics~~ CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics Feb 29, 2024

yifan-c reviewed Mar 11, 2024

View reviewed changes

sarankk force-pushed the add_metrics branch 2 times, most recently from 1c7e0dd to 177033f Compare March 12, 2024 21:18

sarankk requested a review from yifan-c March 13, 2024 00:07

yifan-c reviewed Mar 18, 2024

View reviewed changes

src/main/java/org/apache/cassandra/sidecar/config/VertxMetricsConfiguration.java Show resolved Hide resolved

sarankk requested a review from yifan-c March 20, 2024 22:42

yifan-c requested changes Mar 21, 2024

View reviewed changes

sarankk force-pushed the add_metrics branch from f68a765 to 7d8d2d3 Compare March 21, 2024 21:23

yifan-c approved these changes Mar 21, 2024

View reviewed changes

sarankk added 19 commits March 21, 2024 18:44

Remove existing metrics

3c9c103

Address review

4fd4717

Updates after self review

79b3a0a

Checkstyle fix

6364919

Fix tests

90e2d3f

Fix SSL tests

dbb76f9

Fix style

9e2febd

Remove registry name for vert.x metrics, use global registry

f71f6a1

Address review comments

a0b4370

Address feedback

0263666

Update test

6ea114a

Add test

d61217d

Update test

e386c1b

Fix health test

b57b455

Address review, remove getMetric

47d09c3

Address review comments

73e2a54

Update instance metrics update

2b0b1aa

Test for NamedMetric

571262d

Add check for registry name

cd15937

sarankk added 4 commits March 21, 2024 18:44

Update main module

afc4ad6

Rename methods

6647667

Set global registry name for integration tests

e130817

Changes.txt

d98e481

sarankk force-pushed the add_metrics branch from 3e3a303 to d98e481 Compare March 22, 2024 01:45

frankgh reviewed Mar 22, 2024

View reviewed changes

Address Francisco's comments

ade3e2e

sarankk requested a review from frankgh March 22, 2024 20:32

sarankk added 2 commits March 22, 2024 13:35

Checkstyle unused import

f4463d8

Rename

f330755

frankgh approved these changes Mar 22, 2024

View reviewed changes

frankgh merged commit 056faad into apache:trunk Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics #102

CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics #102

sarankk commented Feb 29, 2024

yifan-c left a comment

sarankk commented Mar 13, 2024

yifan-c Mar 21, 2024

sarankk Mar 21, 2024

yifan-c Mar 21, 2024

yifan-c Mar 21, 2024

sarankk Mar 21, 2024 •

edited

frankgh left a comment

frankgh Mar 22, 2024

sarankk Mar 22, 2024 •

edited

frankgh Mar 22, 2024

sarankk Mar 22, 2024 •

edited

sarankk Mar 22, 2024 •

edited

frankgh Mar 22, 2024

sarankk Mar 22, 2024

frankgh Mar 22, 2024

sarankk Mar 22, 2024

frankgh Mar 22, 2024

yifan-c Mar 22, 2024

sarankk Mar 22, 2024 •

edited

frankgh Mar 22, 2024

sarankk Mar 22, 2024

frankgh Mar 22, 2024

frankgh Mar 22, 2024

sarankk Mar 22, 2024

sarankk Mar 22, 2024

sarankk commented Mar 22, 2024

frankgh left a comment

		promise.future().onComplete(context.succeedingThenComplete());
		promise.future().onComplete(context.failing(v -> {

CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics #102

CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics #102

Conversation

sarankk commented Feb 29, 2024

yifan-c left a comment

Choose a reason for hiding this comment

sarankk commented Mar 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarankk Mar 21, 2024 • edited

Choose a reason for hiding this comment

frankgh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarankk Mar 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarankk Mar 22, 2024 • edited

Choose a reason for hiding this comment

sarankk Mar 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarankk Mar 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarankk commented Mar 22, 2024

frankgh left a comment

Choose a reason for hiding this comment

sarankk Mar 21, 2024 •

edited

sarankk Mar 22, 2024 •

edited

sarankk Mar 22, 2024 •

edited

sarankk Mar 22, 2024 •

edited

sarankk Mar 22, 2024 •

edited