New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CASSANDRASC-111 Improve observability in Sidecar with Dropwizard metrics #102
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thank you for adding a metrics framework!
I only skimmed through the first dozen files. But it becomes hard to follow, as this patch contains both new framework and changes to the existing metric publish.
I would suggest to trim this patch to contain only the new framework, which is the most valuable. And we can focus on getting the framework right first.
Then, you can create the follow-up patches that migrate the existing metrics.
Let me know if it makes sense to you.
src/main/java/org/apache/cassandra/sidecar/metrics/MetricName.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/metrics/MetricName.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/metrics/instance/InstanceMetricRegistry.java
Outdated
Show resolved
Hide resolved
1c7e0dd
to
177033f
Compare
Thanks for the review Yifan, removed changes to existing metrics |
src/main/java/org/apache/cassandra/sidecar/config/VertxMetricsConfiguration.java
Show resolved
Hide resolved
promise.future().onComplete(context.succeedingThenComplete()); | ||
promise.future().onComplete(context.failing(v -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion change makes sense. How was the test passing before? hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I saw a flaky test on circleci. Locally it was passing when I ran all tests together, if I ran just that test it was always failing
src/main/java/org/apache/cassandra/sidecar/metrics/NamedMetric.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/metrics/NamedMetric.java
Outdated
Show resolved
Hide resolved
public class NamedMetric<T extends Metric> | ||
{ | ||
public final String canonicalName; | ||
public final T metric; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test case for NamedMetric
to assert the retrieved metric with canonicalName
is the same metric?
src/main/java/org/apache/cassandra/sidecar/metrics/NamedMetric.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/metrics/instance/StreamSSTableComponentMetrics.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/metrics/instance/StreamSSTableComponentMetrics.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/metrics/instance/UploadSSTableComponentMetrics.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/metrics/instance/UploadSSTableComponentMetrics.java
Outdated
Show resolved
Hide resolved
diskUsageHighErrors | ||
= NamedMetric.builder(instanceMetricRegistry::meter) | ||
.withDomain(DOMAIN) | ||
.withName("disk_usage_high_errors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is the same as insufficient_staging_space. can you remove this metric? and collect with that one instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the same currently. Code in upload handler to check if sufficient space is there, is different from DiskprotectionHandler
. If we want to use same metric, we can modify SSTableUploadHandler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great contribution. Thanks for the patch. I've added some comments
src/main/java/org/apache/cassandra/sidecar/cluster/instance/InstanceMetadata.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/cluster/instance/InstanceMetadataImpl.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/cassandra/sidecar/config/yaml/MetricsConfigurationImpl.java
Show resolved
Hide resolved
rateLimitedCalls | ||
= NamedMetric.builder(metricRegistry::meter) | ||
.withDomain(DOMAIN) | ||
.withName("throttled_429") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make these names constants of the class with public visibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we expose the full name, if we want to use the metric name during testing. I don't see a use case for just the name extension elsewhere, wdyt?
src/main/java/org/apache/cassandra/sidecar/routes/DiskSpaceProtectionHandler.java
Outdated
Show resolved
Hide resolved
@@ -116,10 +119,15 @@ public void handleInternal(RoutingContext context, | |||
// accept the upload. | |||
httpRequest.pause(); | |||
|
|||
InstanceMetrics instanceMetrics = metadataFetcher.instance(host).metrics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InstanceMetadata can be null, we usually return 503 Unavailable when this occurs. I think we run into the issue of an NPE here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we can have instance(host)
throw unavailable exception. Else we will have to do null check everywhere before retrieving metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does it currently throw new NoSuchElementException("Instance id " + id + " not found")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We handle nulls coming from metadata fetcher in the codebase already , so I think we should continue handling them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -101,6 +108,11 @@ public void execute(Promise<Void> promise) | |||
|
|||
// join always waits until all its futures are completed and will not fail as soon as one of the future fails | |||
Future.join(futures) | |||
.onComplete(v -> { | |||
int instancesUp = instancesConfig.instances().size() - instanceDown.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't really need an AtomicInteger to keep track of the down instances. You can check the results instead.
v.result().causes().stream().filter(Objects::nonNull).count()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if the futures failed due to some other reason, other than what is caught in catch
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is that possible? we explicitly set the promise to either failed or succeeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AtomicInteger instanceDown
provides better readability w/o perf sacrifice. I am leaning to the current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried these changes, we have to process the value of result to get cause. Maintaining the count outside seems simpler, wdyt?
@@ -66,7 +68,7 @@ class RestoreJobManagerTest | |||
@BeforeEach | |||
void setup() | |||
{ | |||
Injector injector = Guice.createInjector(new MainModule()); | |||
Injector injector = Guice.createInjector(Modules.override(new MainModule()).with(new TestModule())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the test module here? the metrics should be provided in the main module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is needed because, the test doesn't provide all the classes we need. It only provides mock(RestoreJobConfiguration.class)
. But for building Vert.x instance we need sidecar configuration object. The test throws error saying, yaml not found if we use MainModule
. TestModule
provides SidecarConfiguration
hence we avoid yaml error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we need to add these fields to the yaml?
@@ -124,6 +123,7 @@ void tearDown() throws InterruptedException | |||
void failsWhenKeyStoreIsNotConfigured() | |||
{ | |||
builder.sslConfiguration(SslConfigurationImpl.builder().enabled(true).build()); | |||
vertx = vertx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since now creating a Vert.x instance needs SidecarConfiguration
to get metrics options. Changed this. Because in these tests, the configuration is updated inside test as per each test. After the configuration update, building the vert.x instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not needed for all tests cases, example some of the failure test cases. But changed it consistently for easier test reading
Thanks for the review @frankgh addressed your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Thanks for addressing the comments.
No description provided.