Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics #102

Merged
merged 26 commits into from Mar 23, 2024

Conversation

sarankk
Copy link
Contributor

@sarankk sarankk commented Feb 29, 2024

No description provided.

@sarankk sarankk changed the title Improve observability in Sidecar with Dropwizard metrics CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics Feb 29, 2024
Copy link
Contributor

@yifan-c yifan-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thank you for adding a metrics framework!

I only skimmed through the first dozen files. But it becomes hard to follow, as this patch contains both new framework and changes to the existing metric publish.

I would suggest to trim this patch to contain only the new framework, which is the most valuable. And we can focus on getting the framework right first.
Then, you can create the follow-up patches that migrate the existing metrics.

Let me know if it makes sense to you.

src/main/dist/conf/sidecar.yaml Outdated Show resolved Hide resolved
src/main/dist/conf/sidecar.yaml Outdated Show resolved Hide resolved
@sarankk sarankk force-pushed the add_metrics branch 2 times, most recently from 1c7e0dd to 177033f Compare March 12, 2024 21:18
@sarankk
Copy link
Contributor Author

sarankk commented Mar 13, 2024

Thanks for the review Yifan, removed changes to existing metrics

@sarankk sarankk requested a review from yifan-c March 13, 2024 00:07
@sarankk sarankk requested a review from yifan-c March 20, 2024 22:42
Comment on lines -139 to +171
promise.future().onComplete(context.succeedingThenComplete());
promise.future().onComplete(context.failing(v -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion change makes sense. How was the test passing before? hmm...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I saw a flaky test on circleci. Locally it was passing when I ran all tests together, if I ran just that test it was always failing

Comment on lines +37 to +39
public class NamedMetric<T extends Metric>
{
public final String canonicalName;
public final T metric;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case for NamedMetric to assert the retrieved metric with canonicalName is the same metric?

diskUsageHighErrors
= NamedMetric.builder(instanceMetricRegistry::meter)
.withDomain(DOMAIN)
.withName("disk_usage_high_errors")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the same as insufficient_staging_space. can you remove this metric? and collect with that one instead.

Copy link
Contributor Author

@sarankk sarankk Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the same currently. Code in upload handler to check if sufficient space is there, is different from DiskprotectionHandler. If we want to use same metric, we can modify SSTableUploadHandler

Copy link
Contributor

@frankgh frankgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great contribution. Thanks for the patch. I've added some comments

src/main/dist/conf/sidecar.yaml Outdated Show resolved Hide resolved
rateLimitedCalls
= NamedMetric.builder(metricRegistry::meter)
.withDomain(DOMAIN)
.withName("throttled_429")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make these names constants of the class with public visibility?

Copy link
Contributor Author

@sarankk sarankk Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expose the full name, if we want to use the metric name during testing. I don't see a use case for just the name extension elsewhere, wdyt?

@@ -116,10 +119,15 @@ public void handleInternal(RoutingContext context,
// accept the upload.
httpRequest.pause();

InstanceMetrics instanceMetrics = metadataFetcher.instance(host).metrics();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InstanceMetadata can be null, we usually return 503 Unavailable when this occurs. I think we run into the issue of an NPE here.

Copy link
Contributor Author

@sarankk sarankk Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we can have instance(host) throw unavailable exception. Else we will have to do null check everywhere before retrieving metrics

Copy link
Contributor Author

@sarankk sarankk Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does it currently throw new NoSuchElementException("Instance id " + id + " not found")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We handle nulls coming from metadata fetcher in the codebase already , so I think we should continue handling them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -101,6 +108,11 @@ public void execute(Promise<Void> promise)

// join always waits until all its futures are completed and will not fail as soon as one of the future fails
Future.join(futures)
.onComplete(v -> {
int instancesUp = instancesConfig.instances().size() - instanceDown.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't really need an AtomicInteger to keep track of the down instances. You can check the results instead.

v.result().causes().stream().filter(Objects::nonNull).count()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the futures failed due to some other reason, other than what is caught in catch block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is that possible? we explicitly set the promise to either failed or succeeded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicInteger instanceDown provides better readability w/o perf sacrifice. I am leaning to the current implementation

Copy link
Contributor Author

@sarankk sarankk Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried these changes, we have to process the value of result to get cause. Maintaining the count outside seems simpler, wdyt?

@@ -66,7 +68,7 @@ class RestoreJobManagerTest
@BeforeEach
void setup()
{
Injector injector = Guice.createInjector(new MainModule());
Injector injector = Guice.createInjector(Modules.override(new MainModule()).with(new TestModule()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the test module here? the metrics should be provided in the main module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is needed because, the test doesn't provide all the classes we need. It only provides mock(RestoreJobConfiguration.class). But for building Vert.x instance we need sidecar configuration object. The test throws error saying, yaml not found if we use MainModule. TestModule provides SidecarConfiguration hence we avoid yaml error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to add these fields to the yaml?

@@ -124,6 +123,7 @@ void tearDown() throws InterruptedException
void failsWhenKeyStoreIsNotConfigured()
{
builder.sslConfiguration(SslConfigurationImpl.builder().enabled(true).build());
vertx = vertx();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now creating a Vert.x instance needs SidecarConfiguration to get metrics options. Changed this. Because in these tests, the configuration is updated inside test as per each test. After the configuration update, building the vert.x instance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed for all tests cases, example some of the failure test cases. But changed it consistently for easier test reading

@sarankk
Copy link
Contributor Author

sarankk commented Mar 22, 2024

Thanks for the review @frankgh addressed your comments.

@sarankk sarankk requested a review from frankgh March 22, 2024 20:32
Copy link
Contributor

@frankgh frankgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks for addressing the comments.

@frankgh frankgh merged commit 056faad into apache:trunk Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants