Cassandra: monitoring #39

jazzl0ver · 2018-02-20T17:21:14Z

Would like to open discussion on this topic. Some suggestions:

Add Jolokia jar to C* containers. It allows to easily fetch metrics by HTTP requests
Make it configurable (at C* service creation and C* update-service) to auto-create CloudWatch alarms for the most important metrics (like latency and free disk space).
Create a new service to run TICK (https://github.com/influxdata/sandbox) stack (in a single ECS task), which will retrieve metrics from C* (Telegraf), feed them into InfluxDB, provide a ready-to-use dashboards for various metrics (Chronograf).

Please, share your thoughts.

http://jolokia.org/agent/jvm.html
https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/
http://cassandra.apache.org/doc/latest/operating/metrics.html

JuniusLuo · 2018-02-22T06:23:35Z

The Cassandra metrics could be collected using nodetool, JConsole, or JMX. See Cassandra Monitoring. The datadog blog you posted also mentioned the same methods. Datadog would also collect the Cassandra metrics using one of these ways.

The initial plan is to follow the standard way (nodetool) to get Cassandra metrics, and send the metrics/alert to CloudWatch Metrics/Alarms. We will not add any additional library to the Cassandra container. We could have a general policy-based framework. The framework allows the customer to customize the policy, such as the metric collecting interval, the metrics to collect, etc. The framework will schedule the task(s) accordingly. The task(s) will use nodetool to connect to Cassandra nodes to get metrics and send to CloudWatch. The task will end after that, so the task only consumes resource when it is running. Each service could define its own monitoring task.

We will define the standard metrics/alarm APIs, and have one implementation for CloudWatch. In the future, we could easily add the implementation for Azure/GCP and other implementation, which may use TICK.

jazzl0ver · 2018-02-22T11:38:49Z

Thanks for the detailed explanation of your point! I agree that injecting a side library does not sound very well, but:

Jolokia library is a wrapper for JMX counters. It just adds an option to collect them thru simple HTTP requests. No local java needed to query them. Thus it requires much less of resources for the service which will query C* metrics. And it's opensourced.
nodetool is a java app, so it starts slowly and unable to query multiple metrics at once.
nodetool's output should be parsed before injecting metrics into CloudWatch. How are you going to deal with that? Also the output might change slightly in next C* releases, which will require extra work for the parser adaptation.
I'm not sure DataStax uses nodetool to collect metrics for opsCenter, since it installs a proprietary agent on every C* node and might query it to get metrics.

Regarding the monitoring task. Do you really think it's a good idea to continuously start and stop it? For example, for 1 minute metrics collection interval, it seems to me it might be started just a moment later after it was stopped. Just b/c a lot of metrics must be collected, processed and injected into CloudWatch.

JuniusLuo · 2018-02-22T20:23:37Z

Good point! There are many existing monitoring solutions. We will explore the existing solutions first, and leverage the open source solutions as much as possible. We will only consider to build our own solution when we could not find the suitable solution.

Jolokia is a good tool. It actually supports the proxy mode. We could test to see if the proxy mode works for Cassandra.

Telegraf is a good project. It supports to get metrics for many services, and could send the metrics to CloudWatch. It may be a better framework than CollectD. This blog has a good comparison.

JuniusLuo · 2018-02-22T20:33:48Z

For the monitoring task, it would be ok to keep the monitoring task running. The monitoring service would be a better name than task. Assume the framework such as Telegraf only has a small memory footprint.

It would not be a problem to keep the task short as well. Collecting the metrics of one Cassandra or other service node will be fast, unless something happens. For example, Cassandra itself is stuck at gc. The metrics data will be small. The processing and sending to like CloudWatch would be fast as well. The monitoring collection/handling would not take more than a few seconds. But it requires more work for the scheduling framework. We could start with the long run monitoring service first.

JuniusLuo · 2018-03-15T18:18:37Z

It turns out adding Jolokia into Cassandra container is the simplest way. Telegraf is also supported. Monitoring Cassandra, Redis and ZooKeeper are supported. You could create a Telegraf service for the Cassandra service and see the metrics on CloudWatch. Please take a look and share your comments/suggestions.

Note: currently Cassandra Keyspaces and tables are not monitored. The system keyspaces introduces more than 1000 metrics. Further enhancements will be added to monitor the user keyspaces.

jazzl0ver · 2018-03-16T09:49:31Z

That's a great news!! What is the upgrade path? If possible, I wouldn't like to re-create our cassandra services.

Please, add Telegraf service creation tutorial to the Wiki.

And how to restrict Telegraf's container memory?

JuniusLuo · 2018-03-16T15:49:12Z

yep, upgrade will be supported for service created in 0.9.4 and 0.9.3.

Telegraf itself does not restrict the memory. We could leverage the container max memory/cpu limits. You could set the max-memory and max-cpuunits when creating the Telegraf service. This will set the max memory and cpu for the container. If Telegraf exceeds the max memory, container will be killed.

jazzl0ver · 2018-03-16T16:04:46Z

Could you please share the options for setting max-memory for Telegraf service creation command?

JuniusLuo · 2018-03-16T16:21:54Z

"max-memory" and "max-cpuunits"

JuniusLuo · 2018-03-16T16:32:08Z

Looks CLI help does not include these options. Will add it.

jazzl0ver · 2018-03-19T14:33:34Z

Have you updated the manage server and cli? Looks like the cli is still the old one:

-rwxrwxr-x junius/junius 7648808 2018-03-14 04:32 firecamp-service-cli

JuniusLuo · 2018-03-19T16:00:33Z

CLI does include the "max-memory" option. Just the help, such as firecamp-service-cli -op=create-service --help does not show the "max-memory" option. You could still use it.

jazzl0ver · 2018-03-19T16:02:31Z

I'm sorry - I meant the telegraf service absence:

# ./firecamp-service-cli -region=us-east-1 -cluster=firecamp-prod -op=create-service -help
Usage: firecamp-service-cli -op=create-service
...
  -service-type string
        The catalog service type: mongodb|postgresql|cassandra|zookeeper|kafka|kafkamanager|redis|couchdb|consul|elasticsearch|kibana|logstash
...

JuniusLuo · 2018-03-19T16:10:35Z

oops, uploaded the latest cli.

jazzl0ver · 2018-03-19T17:14:14Z

Works very well, thank you!

It would be great to have some storage metrics, like free/total space available
It would also be great to have some summarized metrics per cluster (like Latency, for example). Is that possible?

JuniusLuo · 2018-03-20T04:40:25Z

Unfortunately, cassandra does not provide them. For #1, cassandra storage load metrics provides "Total disk space used (in bytes) for this node", but not provide the free space for the node. In the later release, we could integrate with CloudWatch to create an alarm when the used space reaches some threshold of the total data volume size.
For #2, cassandra only provides the per node metrics. not aggregate all nodes. You could easily create the dashboard for the "cassandraClientRequest_Latency_Mean" for all nodes. This would be enough?

jazzl0ver · 2018-03-20T12:00:59Z

Yeah, that's a good solution, thank you! I'll create a separate issue for CloudWatch alarm on the used space.

jazzl0ver · 2018-03-20T12:20:19Z

Are you aware why some metrics are not available?
For example, Streaming metrics (http://cassandra.apache.org/doc/latest/operating/metrics.html)

JuniusLuo · 2018-03-20T15:05:11Z

Yes, not all metrics are monitored, such as Streaming metrics, CQL metrics, DroppedMessage metrics, etc. If you think some metrics is important and want to add, please let us know. Thanks.

jazzl0ver · 2018-03-20T15:13:02Z

Well, I think the aggregation of metrics across all keyspaces and tables are good to have. Streaming and DroppedMessage metrics seem also important.

JuniusLuo · 2018-03-20T16:46:52Z

Sounds good. We could add these 3 metrics.

jazzl0ver · 2018-03-26T09:52:34Z

It would be great to have an option to update the list of currently fetched metrics according to one's requirements. For example, at the moment around 100 metrics per node are fetched from Cassandra while I need just a few.

JuniusLuo · 2018-03-26T16:51:05Z

There are lots of Cassandra metrics. How do you want to configure it?

I am not sure if this is really necessary. Collecting 100 metrics per node would not impact Cassandra, as metric data is very small. If you only care a few, you could easily filter them on CloudWatch. It would be better to collect the important metrics. When something goes wrong, we may get some hints from the metrics.

jazzl0ver · 2018-03-27T10:11:45Z

It's not about impacting C*, it's about money: custom metrics cost ($0.30 x 100 = $30 per node per month) and it's not very wise to pay for the things you don't really need.
I thought we could have a file with the list of metrics (one per line) that could be uploaded to the Telegraf service thru a firecamp-manager-cli update call. It would replace the current metrics list with the new one. Another "get" call might return the current list of metrics.

JuniusLuo · 2018-03-27T17:02:50Z

I see. This makes sense. CloudWatch is not cheap.

JuniusLuo · 2018-04-07T04:15:08Z

The commit was in. You could put all the custom metrics in one file, and pass the file "-tel-metrics-file=pathtofile" when creating the service.

please pay attention to the data format in the metrics file. Each line includes one metric. Every metric should have the quotation marks and end with comma. The last metri should not end with comma. Example:

    "/org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency",
    "/org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency",
    "/org.apache.cassandra.metrics:type=Storage,name=Load"

JuniusLuo · 2018-04-07T04:20:53Z

We just published 0.9.5 release, which supports Telegraf. You could try the latest firecamp quickstart.

If you have a cassandra service in 0.9.4 release, you could follow the upgrade guide to upgrade the cluster. While, there is one limit that you will have to stop all services first before upgrade. The upgrade will take around 10 minutes. Upgrade will be further enhanced in the next release.

jazzl0ver · 2018-04-09T10:43:16Z

That's great! Thanks for the implementation as well as for the upgrade feature!
If I'm running the "latest" release, what are my steps to upgrade correctly?

JuniusLuo · 2018-04-09T16:24:10Z

Upgrade is not supported for the "latest" release. There is no way to know what needs to be upgraded between commits of the latest release.

cloudstax · 2018-06-13T20:20:03Z

The custom metrics is supported for Cassandra. Close this issue.

jazzl0ver closed this as completed Mar 20, 2018

jazzl0ver reopened this Mar 26, 2018

cloudstax closed this as completed Jun 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cassandra: monitoring #39

Cassandra: monitoring #39

jazzl0ver commented Feb 20, 2018

JuniusLuo commented Feb 22, 2018

jazzl0ver commented Feb 22, 2018

JuniusLuo commented Feb 22, 2018

JuniusLuo commented Feb 22, 2018

JuniusLuo commented Mar 15, 2018

jazzl0ver commented Mar 16, 2018 •

edited

JuniusLuo commented Mar 16, 2018

jazzl0ver commented Mar 16, 2018

JuniusLuo commented Mar 16, 2018

JuniusLuo commented Mar 16, 2018

jazzl0ver commented Mar 19, 2018

JuniusLuo commented Mar 19, 2018

jazzl0ver commented Mar 19, 2018

JuniusLuo commented Mar 19, 2018

jazzl0ver commented Mar 19, 2018

JuniusLuo commented Mar 20, 2018

jazzl0ver commented Mar 20, 2018

jazzl0ver commented Mar 20, 2018

JuniusLuo commented Mar 20, 2018

jazzl0ver commented Mar 20, 2018

JuniusLuo commented Mar 20, 2018

jazzl0ver commented Mar 26, 2018

JuniusLuo commented Mar 26, 2018

jazzl0ver commented Mar 27, 2018

JuniusLuo commented Mar 27, 2018

JuniusLuo commented Apr 7, 2018

JuniusLuo commented Apr 7, 2018

jazzl0ver commented Apr 9, 2018

JuniusLuo commented Apr 9, 2018

cloudstax commented Jun 13, 2018

Cassandra: monitoring #39

Cassandra: monitoring #39

Comments

jazzl0ver commented Feb 20, 2018

JuniusLuo commented Feb 22, 2018

jazzl0ver commented Feb 22, 2018

JuniusLuo commented Feb 22, 2018

JuniusLuo commented Feb 22, 2018

JuniusLuo commented Mar 15, 2018

jazzl0ver commented Mar 16, 2018 • edited

JuniusLuo commented Mar 16, 2018

jazzl0ver commented Mar 16, 2018

JuniusLuo commented Mar 16, 2018

JuniusLuo commented Mar 16, 2018

jazzl0ver commented Mar 19, 2018

JuniusLuo commented Mar 19, 2018

jazzl0ver commented Mar 19, 2018

JuniusLuo commented Mar 19, 2018

jazzl0ver commented Mar 19, 2018

JuniusLuo commented Mar 20, 2018

jazzl0ver commented Mar 20, 2018

jazzl0ver commented Mar 20, 2018

JuniusLuo commented Mar 20, 2018

jazzl0ver commented Mar 20, 2018

JuniusLuo commented Mar 20, 2018

jazzl0ver commented Mar 26, 2018

JuniusLuo commented Mar 26, 2018

jazzl0ver commented Mar 27, 2018

JuniusLuo commented Mar 27, 2018

JuniusLuo commented Apr 7, 2018

JuniusLuo commented Apr 7, 2018

jazzl0ver commented Apr 9, 2018

JuniusLuo commented Apr 9, 2018

cloudstax commented Jun 13, 2018

jazzl0ver commented Mar 16, 2018 •

edited