Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#158456560] Add a lot of metrics to the collector #3

Merged
merged 17 commits into from
Jun 22, 2018

Conversation

keymon
Copy link
Contributor

@keymon keymon commented Jun 21, 2018

What?

We want to add multiple metrics to the collector.
We will add metrics from cloudwatch [1] and SQL postgres ones.

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html

In this PR we added multiple metrics we found useful. We were not sure about
some of the cloudwatch ones.

We additionally added the capability of tagging the metrics, which would allow us to tag metrics with dynamic names like table_name

We did not implement metrics for query latency or slow queries, as that seems
more complicated: We need to enable slow query logging in postgres parameter
groups and read the logs using the AWS API.

We unit tested the new metrics for postgres. We did not see the point of
testing the new cloudwatch ones, but they can be tested manually.

How to review?

  1. Code review
  2. Check if the metrics make sense
  3. Check if the naming of metrics and tags makes sense.

The SQL based metrics can be tested in a docker container locally.

The cloudwatch can be tested by running this locally as:

  1. edit example-config.json to point to a existing CF deployment with RDS instances.
  2. load your AWS creds.
  3. run go run main.go -stdoutEmitter -config example_config.json

Who?

Anyone but me

keymon added 17 commits June 21, 2018 14:37
Add a bunch of additional basic metrics as listed in [1]

We do not test these, as there is not such a point because we mock
the service.

In this commit, we add the most valuable metrics in our opinion.

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html
Add more metrics available in cloudwatch for RDS[1]

We add them in a separate commit as we are not sure about the real
value of these metrics.

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html
Add Tags as an attribute of the metrics.Metric object, which would
be added when emitting the metric with loggregator.
we add tags source=cloudwatch and source=sql to each existing collector.
We want to add dynamic tags like database name, table name and so on
to the metrics retrieved using SQL queries.

For that we extend the sql connector, so that it will consider
any additional column not defined as `MetricQueryMeta` as tags,
and add it to the returned metrics.
We want to include the dbname tag with the current name, because
we might allow multiple DBs per service instance in the future.

Currently we only offer one DB per instance, which shall be returned
by the `brokerinfo.ConnectionString()`, so we only return only
`current_database()`.

We change the test to create a temporary test DB with a random DB
and delete it after the test. We had to kill all active connection
before DROP that database using SQL commands, because golang
sql.DB does keep a pool of connections and we did not find a way
to close them explicitely.
The deadlocks counter metric keeps a counter of how many deadlocks
have occurred per DB. A deadlock means that one transaction had to
be cancelled.

We add a test that simulates a deadlock.

We only report `current_database()` for the time being as our broker
only offers one DB per instance, but we might allow more later. We
include the tag `dbname`.

The example code to simulate the deadlock has been borrowed from:
https://www.depesz.com/2012/02/09/waiting-for-9-2-deadlock-counter/
We get the stats collector generated metric seq_scan and idx_scan
for each table. We add the tags for the table_name.

The dbname is forced to be `current_database()` but we might emit
for all the DBs if we offer more databases per instance.

We add a index to the films table and try to run queries that does
a scan and a index to check the counter.
Use the pg_stat_user_indexes tables to enrich the metric with the
index name.
Return the max_connections allowed in the postgred server.
This number is per configuration in the RDS parameter group[1],
and it is constant.

We report this value in the marketplace description, but it
might be handy for the tenant to create dynamic dashboards and
alerts, rather than hardcode values.

[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html
Report the table_size in bytes per table. We include the table_name
tag.

The dbname tag is forced to be `current_database()` but we might emit
for all the DBs if we offer more databases per instance.
Add counter of commit/rollback transactions, read/hit block couter,
read/write time and temporary bytes.

See [1]

[1] https://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-DATABASE-VIEW
We test this with the deadlock test.
max_tx_age, transactions lasting too long indicate trouble[1]:
transactions lasting too long, transactions not closed properly,
slow queries or operations, etc.

This metric will report the maximum transaction age in the moment
the query is reported, but it will likely miss transactions shorter
than the metric collection interval.

[1] https://russ.garrett.co.uk/2015/10/02/postgres-monitoring-cheatsheet/
The CPUUTilization takes a while to appear. Consider the test
OK if any cloudwatch metric is received.
@LeePorte LeePorte merged commit 8c143ad into master Jun 22, 2018
keymon added a commit to alphagov/paas-rds-broker-boshrelease that referenced this pull request Jun 22, 2018
keymon added a commit to alphagov/paas-cf that referenced this pull request Jun 22, 2018
@paroxp paroxp deleted the more-sql-metrics-158456560 branch June 26, 2018 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants