-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#158456560] Add a lot of metrics to the collector #3
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add a bunch of additional basic metrics as listed in [1] We do not test these, as there is not such a point because we mock the service. In this commit, we add the most valuable metrics in our opinion. [1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html
Add more metrics available in cloudwatch for RDS[1] We add them in a separate commit as we are not sure about the real value of these metrics. [1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html
Add Tags as an attribute of the metrics.Metric object, which would be added when emitting the metric with loggregator.
we add tags source=cloudwatch and source=sql to each existing collector.
We want to add dynamic tags like database name, table name and so on to the metrics retrieved using SQL queries. For that we extend the sql connector, so that it will consider any additional column not defined as `MetricQueryMeta` as tags, and add it to the returned metrics.
We want to include the dbname tag with the current name, because we might allow multiple DBs per service instance in the future. Currently we only offer one DB per instance, which shall be returned by the `brokerinfo.ConnectionString()`, so we only return only `current_database()`. We change the test to create a temporary test DB with a random DB and delete it after the test. We had to kill all active connection before DROP that database using SQL commands, because golang sql.DB does keep a pool of connections and we did not find a way to close them explicitely.
The deadlocks counter metric keeps a counter of how many deadlocks have occurred per DB. A deadlock means that one transaction had to be cancelled. We add a test that simulates a deadlock. We only report `current_database()` for the time being as our broker only offers one DB per instance, but we might allow more later. We include the tag `dbname`. The example code to simulate the deadlock has been borrowed from: https://www.depesz.com/2012/02/09/waiting-for-9-2-deadlock-counter/
We get the stats collector generated metric seq_scan and idx_scan for each table. We add the tags for the table_name. The dbname is forced to be `current_database()` but we might emit for all the DBs if we offer more databases per instance. We add a index to the films table and try to run queries that does a scan and a index to check the counter.
Use the pg_stat_user_indexes tables to enrich the metric with the index name.
Return the max_connections allowed in the postgred server. This number is per configuration in the RDS parameter group[1], and it is constant. We report this value in the marketplace description, but it might be handy for the tenant to create dynamic dashboards and alerts, rather than hardcode values. [1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html
Report the table_size in bytes per table. We include the table_name tag. The dbname tag is forced to be `current_database()` but we might emit for all the DBs if we offer more databases per instance.
Add counter of commit/rollback transactions, read/hit block couter, read/write time and temporary bytes. See [1] [1] https://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-DATABASE-VIEW
We test this with the deadlock test.
max_tx_age, transactions lasting too long indicate trouble[1]: transactions lasting too long, transactions not closed properly, slow queries or operations, etc. This metric will report the maximum transaction age in the moment the query is reported, but it will likely miss transactions shorter than the metric collection interval. [1] https://russ.garrett.co.uk/2015/10/02/postgres-monitoring-cheatsheet/
The CPUUTilization takes a while to appear. Consider the test OK if any cloudwatch metric is received.
keymon
added a commit
to alphagov/paas-rds-broker-boshrelease
that referenced
this pull request
Jun 22, 2018
keymon
added a commit
to alphagov/paas-cf
that referenced
this pull request
Jun 22, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What?
We want to add multiple metrics to the collector.
We will add metrics from cloudwatch [1] and SQL postgres ones.
[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/rds-metricscollected.html
In this PR we added multiple metrics we found useful. We were not sure about
some of the cloudwatch ones.
We additionally added the capability of tagging the metrics, which would allow us to tag metrics with dynamic names like
table_name
We did not implement metrics for query latency or slow queries, as that seems
more complicated: We need to enable slow query logging in postgres parameter
groups and read the logs using the AWS API.
We unit tested the new metrics for postgres. We did not see the point of
testing the new cloudwatch ones, but they can be tested manually.
How to review?
The SQL based metrics can be tested in a docker container locally.
The cloudwatch can be tested by running this locally as:
example-config.json
to point to a existing CF deployment with RDS instances.go run main.go -stdoutEmitter -config example_config.json
Who?
Anyone but me