[Metricbeat] aws module rds metricset missing cpu usage #30915

m-standfuss · 2022-03-18T18:05:47Z

After enabling the RDS metricset for the AWS module I see the events coming across but they are missing cpu values (aws.rds.cpu.total.pct). This is for a normal (non-aurora) RDS Postgres db.

ConfigMap for Deployment

...
  aws_rds.yml: |-
    - module: aws
      period: 60s
      metricsets:
        - rds
      tags_filter:
        - key: "Environment"
          value: "<our environment name here>"
...

Resulting ~/modules.d/aws_rds.yml

# cat aws_rds.yml
- module: aws
  period: 60s
  metricsets:
    - rds
  tags_filter:
    - key: "Environment"
      value: "stage"

Example event in metricbeat index:

{
  "_index": "metricbeat-7.17.1-2022.03.17-000004",
  "_type": "_doc",
  "_id": "3RnTnX8BTHDvmoFRS7Jb",
  "_version": 1,
  "_score": 1,
  "_source": {
    "@timestamp": "2022-03-18T16:18:00.000Z",
    "agent": {
      "version": "7.17.1",
      "hostname": "xxxx",
      "ephemeral_id": "xxxx",
      "id": "xxxxx",
      "name": "xxxxx",
      "type": "metricbeat"
    },
    "service": {
      "type": "aws"
    },
    "cloud": {
      "availability_zone": "xxxxx",
      "provider": "aws",
      "region": "xxxx",
      "account": {
        "id": "xxxx",
        "name": "xxxxx"
      }
    },
    "aws": {
      "rds": {
        "read_io": {
          "ops_per_sec": 0
        },
        "throughput": {
          "write": 14472.292128464525,
          "network_receive": 1235.1166666666666,
          "network_transmit": 9321.033333333333,
          "read": 0
        },
        "latency": {
          "read": 0,
          "write": 0.00002
        },
        "replica_lag": {
          "sec": 5
        },
        "database_connections": 0,
        "freeable_memory": {
          "bytes": 266002432
        },
        "disk_queue_depth": 0,
        "write_io": {
          "ops_per_sec": 1.1833136114398093
        },
        "maximum_used_transaction_ids": 2565235,
        "oldest_replication_slot_lag": {
          "mb": -1
        },
        "swap_usage": {
          "bytes": 68968448
        },
        "db_instance": {
          "arn": "xxxxxx",
          "status": "available",
          "identifier": "xxxxx",
          "class": "db.t3.micro",
          "engine_name": "postgres"
        },
        "free_storage": {
          "bytes": 7765127168
        },
        "disk_usage": {
          "replication_slot": {
            "mb": 4096
          },
          "transaction_logs": {
            "mb": 2348818432
          }
        }
      },
      "cloudwatch": {
        "namespace": "AWS/RDS"
      },
      "dimensions": {
        "DBInstanceIdentifier": "xxxxxx"
      },
      "tags": {
        "Name": "xxxxx",
        "Role": "xxxxx",
        "Environment": "xxxxxx"
      }
    },
    "event": {
      "dataset": "aws.rds",
      "module": "aws",
      "duration": 9781510696
    },
    "metricset": {
      "name": "rds",
      "period": 60000
    },
    "ecs": {
      "version": "1.12.0"
    },
    "host": {
      "name": "xxxxx"
    }
  },
  "fields": {
    "aws.rds.oldest_replication_slot_lag.mb": [
      -1
    ],
    "aws.rds.db_instance.status": [
      "available"
    ],
    "aws.rds.disk_usage.transaction_logs.mb": [
      2348818432
    ],
    "cloud.availability_zone": [
      "xxxx"
    ],
    "aws.tags.Role": [
      "xxxxx"
    ],
    "service.type": [
      "aws"
    ],
    "aws.rds.db_instance.identifier": [
      "xxxxxx"
    ],
    "agent.type": [
      "metricbeat"
    ],
    "aws.rds.throughput.write": [
      14472.292
    ],
    "event.module": [
      "aws"
    ],
    "aws.rds.db_instance.arn": [
      "xxxxxx"
    ],
    "aws.dimensions.DBInstanceIdentifier": [
      "xxxxxx"
    ],
    "aws.cloudwatch.namespace": [
      "AWS/RDS"
    ],
    "agent.name": [
      "xxxxx"
    ],
    "host.name": [
      "xxxx"
    ],
    "aws.rds.throughput.network_transmit": [
      9321.033
    ],
    "aws.rds.read_io.ops_per_sec": [
      0
    ],
    "aws.rds.write_io.ops_per_sec": [
      1.1833136
    ],
    "aws.rds.disk_usage.replication_slot.mb": [
      4096
    ],
    "cloud.account.name": [
      "xxxxxx"
    ],
    "cloud.region": [
      "xxxxx"
    ],
    "aws.rds.db_instance.class": [
      "db.t3.micro"
    ],
    "aws.rds.replica_lag.sec": [
      5
    ],
    "aws.tags.Name": [
      "xxxxxx"
    ],
    "metricset.period": [
      60000
    ],
    "aws.rds.maximum_used_transaction_ids": [
      2565235
    ],
    "aws.rds.latency.read": [
      0
    ],
    "aws.rds.free_storage.bytes": [
      7765127168
    ],
    "aws.rds.latency.write": [
      0.00002
    ],
    "agent.hostname": [
      "xxxxxxx"
    ],
    "aws.rds.disk_queue_depth": [
      0
    ],
    "aws.rds.throughput.read": [
      0
    ],
    "aws.rds.throughput.network_receive": [
      1235.1167
    ],
    "metricset.name": [
      "rds"
    ],
    "event.duration": [
      9781510696
    ],
    "aws.tags.Environment": [
      "xxxxxx"
    ],
    "aws.rds.db_instance.engine_name": [
      "postgres"
    ],
    "cloud.provider": [
      "aws"
    ],
    "@timestamp": [
      "2022-03-18T16:18:00.000Z"
    ],
    "agent.id": [
      "2f21e11f-0ce9-428e-b4bd-c72ae341c8b6"
    ],
    "cloud.account.id": [
      "xxxxxx"
    ],
    "ecs.version": [
      "1.12.0"
    ],
    "agent.ephemeral_id": [
      "xxx"
    ],
    "agent.version": [
      "7.17.1"
    ],
    "aws.rds.database_connections": [
      0
    ],
    "aws.rds.swap_usage.bytes": [
      68968448
    ],
    "aws.rds.freeable_memory.bytes": [
      266002432
    ],
    "event.dataset": [
      "aws.rds"
    ]
  }
}

For confirmed bugs, please report:

Version: image: docker.elastic.co/beats/metricbeat:7.17.1
Operating System: K8s offical image for container
Discuss Forum URL:
Steps to Reproduce: Normal workflow, setup module with provided configuration

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-21T10:53:12Z

Pinging @elastic/integrations (Team:Integrations)

kaiyan-sheng · 2022-03-21T21:12:17Z

Thanks for reporting this bug. I was able to reproduce it on my side. Also I found out if I use period: 5m instead of 1m, I'm able to collect the CPU usage metric:

m-standfuss · 2022-03-22T04:05:12Z

Ah thanks I will probably move it to its own schedule and run at 5min for now. Thank you for taking a look at it.

kaiyan-sheng · 2022-10-04T02:00:23Z

I'm able to reproduce this problem locally with config:

- module: aws
  period: 1m
  credential_profile_name: elastic-beats
  metricsets:
    - cloudwatch
  regions:
    - us-east-1
  metrics:
    - namespace: AWS/RDS
      name: ["CPUUtilization"]
      statistic: ["Average"]
      dimensions:
        - name: DatabaseClass
          value: db.r5.large

I should be getting the CPUUtilization metric every minute for this specific database class db.r5.large but I'm missing a lot of data points.

kaiyan-sheng · 2022-10-06T20:50:50Z

Turned out there is a 2min to 3min latency between the current timestamp and the last data point from CloudWatch. So with adding the latency configuration parameter, CPUUtilization metrics are collected as expected.

- module: aws
  period: 1m
  credential_profile_name: elastic-beats
  metricsets:
    - cloudwatch
  regions:
    - us-east-1
  latency: 3m
  metrics:
    - namespace: AWS/RDS
      name: ["CPUUtilization"]
      statistic: ["Average"]
      dimensions:
        - name: DatabaseClass
          value: db.r5.large

So for the original issue, in order to collect all data points for CPU usage with 1m period, latency: ?m is needed.
Here is the config I tested with:

- module: aws
  period: 1m
  credential_profile_name: elastic-beats
  latency: 3m
  metricsets:
    - rds
  regions:
    - us-east-1

With latency configuration, I can see CPU usage metric is collected every minute as requested and the data points match what's in CloudWatch:

Without the latency parameter, data points were missing almost every other minute:

kaiyan-sheng · 2022-10-06T20:56:57Z

How to measure the latency? The easiest way is to open CloudWatch portal and choose the metric you are interested in, in this case we choose CPUUtilization for RDS instances with db.r5.large database class. Make sure the period is set to 1 minute and refresh the graph. Find the timestamp of the last data point reported to CloudWatch and compare it with the current timestamp. In this case, current timestamp is 20:55 and the last data point is at 20:52. The difference between these two timestamps means we have a latency of 3-minute. That's why I set the latency to 3m for metric collection.

m-standfuss · 2023-08-29T19:30:08Z

@kaiyan-sheng taking a look at implementing this latency fix for our cluster now that were upgraded to 8, does the logic work that if i set the latency fairly high, say 10 minutes, and there are multiple metrics for that last 10 minute window that it will select the most recent version of that metric?

Thank you for implementing a fix!

m-standfuss · 2023-08-29T19:39:03Z

Oh, just realized that latency is an existing/legacy parameter. I thought it was newly implemented 🤦

m-standfuss · 2023-08-29T20:28:42Z

So even with a more than adequate latency (3 minute+) i'm not getting all the metrics. I think I might just flip over to using the cloudwatch module directly instead of relying on the rds module.

kaiyan-sheng · 2023-08-29T22:19:06Z

Hello @m-standfuss 👋 Could you check the latency on AWS portal please? You can follow https://docs.elastic.co/en/integrations/aws#latency-causes-missing-metrics for that :) I would be curious to see if using the cloudwatch module directly works. It should behave the same though. Thank you!!!

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 18, 2022

mtojek added the Team:Integrations Label for the Integrations team label Mar 21, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 21, 2022

mtojek added needs_team Indicates that the issue/PR needs a Team:* label aws Enable builds in the CI for aws cloud testing labels Mar 21, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 21, 2022

kaiyan-sheng added bug Team:Cloud-Monitoring Label for the Cloud Monitoring team and removed Team:Integrations Label for the Integrations team labels Mar 21, 2022

kaiyan-sheng self-assigned this Oct 4, 2022

kaiyan-sheng mentioned this issue Oct 4, 2022

Change max query size for GetMetricData API to 500 and add RecentlyActive config #33105

Merged

6 tasks

kaiyan-sheng closed this as completed Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metricbeat] aws module rds metricset missing cpu usage #30915

[Metricbeat] aws module rds metricset missing cpu usage #30915

m-standfuss commented Mar 18, 2022

elasticmachine commented Mar 21, 2022

kaiyan-sheng commented Mar 21, 2022

m-standfuss commented Mar 22, 2022

kaiyan-sheng commented Oct 4, 2022

kaiyan-sheng commented Oct 6, 2022

kaiyan-sheng commented Oct 6, 2022

m-standfuss commented Aug 29, 2023

m-standfuss commented Aug 29, 2023

m-standfuss commented Aug 29, 2023 •

edited

Loading

kaiyan-sheng commented Aug 29, 2023

[Metricbeat] aws module rds metricset missing cpu usage #30915

[Metricbeat] aws module rds metricset missing cpu usage #30915

Comments

m-standfuss commented Mar 18, 2022

elasticmachine commented Mar 21, 2022

kaiyan-sheng commented Mar 21, 2022

m-standfuss commented Mar 22, 2022

kaiyan-sheng commented Oct 4, 2022

kaiyan-sheng commented Oct 6, 2022

kaiyan-sheng commented Oct 6, 2022

m-standfuss commented Aug 29, 2023

m-standfuss commented Aug 29, 2023

m-standfuss commented Aug 29, 2023 • edited Loading

kaiyan-sheng commented Aug 29, 2023

m-standfuss commented Aug 29, 2023 •

edited

Loading