[#158456560] Add a lot of metrics to the collector #3

keymon · 2018-06-21T13:48:01Z

What?

We want to add multiple metrics to the collector.
We will add metrics from cloudwatch [1] and SQL postgres ones.

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html

In this PR we added multiple metrics we found useful. We were not sure about
some of the cloudwatch ones.

We additionally added the capability of tagging the metrics, which would allow us to tag metrics with dynamic names like table_name

We did not implement metrics for query latency or slow queries, as that seems
more complicated: We need to enable slow query logging in postgres parameter
groups and read the logs using the AWS API.

We unit tested the new metrics for postgres. We did not see the point of
testing the new cloudwatch ones, but they can be tested manually.

How to review?

Code review
Check if the metrics make sense
Check if the naming of metrics and tags makes sense.

The SQL based metrics can be tested in a docker container locally.

The cloudwatch can be tested by running this locally as:

edit example-config.json to point to a existing CF deployment with RDS instances.
load your AWS creds.
run go run main.go -stdoutEmitter -config example_config.json

Who?

Anyone but me

Add a bunch of additional basic metrics as listed in [1] We do not test these, as there is not such a point because we mock the service. In this commit, we add the most valuable metrics in our opinion. [1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html

Add more metrics available in cloudwatch for RDS[1] We add them in a separate commit as we are not sure about the real value of these metrics. [1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html

Add Tags as an attribute of the metrics.Metric object, which would be added when emitting the metric with loggregator.

we add tags source=cloudwatch and source=sql to each existing collector.

We want to add dynamic tags like database name, table name and so on to the metrics retrieved using SQL queries. For that we extend the sql connector, so that it will consider any additional column not defined as `MetricQueryMeta` as tags, and add it to the returned metrics.

We want to include the dbname tag with the current name, because we might allow multiple DBs per service instance in the future. Currently we only offer one DB per instance, which shall be returned by the `brokerinfo.ConnectionString()`, so we only return only `current_database()`. We change the test to create a temporary test DB with a random DB and delete it after the test. We had to kill all active connection before DROP that database using SQL commands, because golang sql.DB does keep a pool of connections and we did not find a way to close them explicitely.

The deadlocks counter metric keeps a counter of how many deadlocks have occurred per DB. A deadlock means that one transaction had to be cancelled. We add a test that simulates a deadlock. We only report `current_database()` for the time being as our broker only offers one DB per instance, but we might allow more later. We include the tag `dbname`. The example code to simulate the deadlock has been borrowed from: https://www.depesz.com/2012/02/09/waiting-for-9-2-deadlock-counter/

We get the stats collector generated metric seq_scan and idx_scan for each table. We add the tags for the table_name. The dbname is forced to be `current_database()` but we might emit for all the DBs if we offer more databases per instance. We add a index to the films table and try to run queries that does a scan and a index to check the counter.

Use the pg_stat_user_indexes tables to enrich the metric with the index name.

Return the max_connections allowed in the postgred server. This number is per configuration in the RDS parameter group[1], and it is constant. We report this value in the marketplace description, but it might be handy for the tenant to create dynamic dashboards and alerts, rather than hardcode values. [1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html

Report the table_size in bytes per table. We include the table_name tag. The dbname tag is forced to be `current_database()` but we might emit for all the DBs if we offer more databases per instance.

Add counter of commit/rollback transactions, read/hit block couter, read/write time and temporary bytes. See [1] [1] https://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-DATABASE-VIEW

We test this with the deadlock test.

max_tx_age, transactions lasting too long indicate trouble[1]: transactions lasting too long, transactions not closed properly, slow queries or operations, etc. This metric will report the maximum transaction age in the moment the query is reported, but it will likely miss transactions shorter than the metric collection interval. [1] https://russ.garrett.co.uk/2015/10/02/postgres-monitoring-cheatsheet/

The CPUUTilization takes a while to appear. Consider the test OK if any cloudwatch metric is received.

Include alphagov/paas-rds-metric-collector#3

Includes alphagov/paas-rds-metric-collector#3

keymon added 17 commits June 21, 2018 14:37

Add --rm to docker run

9d184a2

cloudwatch: Add more RDS specific metrics

3861553

Add more metrics available in cloudwatch for RDS[1] We add them in a separate commit as we are not sure about the real value of these metrics. [1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html

Allow emit metrics with tags

e65ac27

Add Tags as an attribute of the metrics.Metric object, which would be added when emitting the metric with loggregator.

Add "source" tag to the current collectors

c7bbbd2

we add tags source=cloudwatch and source=sql to each existing collector.

sql_collector: Test no error with no rows

72f1c7b

postgres: idx_scan includes index_name tag

1b4937e

Use the pg_stat_user_indexes tables to enrich the metric with the index name.

postgres: table_size per table

30e72a5

Report the table_size in bytes per table. We include the table_name tag. The dbname tag is forced to be `current_database()` but we might emit for all the DBs if we offer more databases per instance.

postgres: Add more metrics from pg_stat_database

570ef97

Add counter of commit/rollback transactions, read/hit block couter, read/write time and temporary bytes. See [1] [1] https://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-DATABASE-VIEW

postgres: Metric for blocked_connections

4b9ff46

We test this with the deadlock test.

ci/blackbox: Test for any cloudwatch metrics

3508ce7

The CPUUTilization takes a while to appear. Consider the test OK if any cloudwatch metric is received.

LeePorte merged commit 8c143ad into master Jun 22, 2018

keymon added a commit to alphagov/paas-rds-broker-boshrelease that referenced this pull request Jun 22, 2018

Bump paas-rds-metric-collector to add more metrics

ffc819d

Include alphagov/paas-rds-metric-collector#3

keymon added a commit to alphagov/paas-cf that referenced this pull request Jun 22, 2018

Bump rds-broker release to add metrics

688b0e2

Includes alphagov/paas-rds-metric-collector#3

keymon mentioned this pull request Jun 22, 2018

[#158456560] Bump rds-broker release to add metrics alphagov/paas-cf#1412

Merged

paroxp deleted the more-sql-metrics-158456560 branch June 26, 2018 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#158456560] Add a lot of metrics to the collector #3

[#158456560] Add a lot of metrics to the collector #3

keymon commented Jun 21, 2018 •

edited

Loading

[#158456560] Add a lot of metrics to the collector #3

[#158456560] Add a lot of metrics to the collector #3

Conversation

keymon commented Jun 21, 2018 • edited Loading

What?

How to review?

Who?

keymon commented Jun 21, 2018 •

edited

Loading