Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] aws module rds metricset missing cpu usage #30915

Closed
m-standfuss opened this issue Mar 18, 2022 · 10 comments
Closed

[Metricbeat] aws module rds metricset missing cpu usage #30915

m-standfuss opened this issue Mar 18, 2022 · 10 comments
Assignees
Labels
aws Enable builds in the CI for aws cloud testing bug Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@m-standfuss
Copy link

After enabling the RDS metricset for the AWS module I see the events coming across but they are missing cpu values (aws.rds.cpu.total.pct). This is for a normal (non-aurora) RDS Postgres db.

ConfigMap for Deployment

...
  aws_rds.yml: |-
    - module: aws
      period: 60s
      metricsets:
        - rds
      tags_filter:
        - key: "Environment"
          value: "<our environment name here>"
...

Resulting ~/modules.d/aws_rds.yml

# cat aws_rds.yml
- module: aws
  period: 60s
  metricsets:
    - rds
  tags_filter:
    - key: "Environment"
      value: "stage"

Example event in metricbeat index:

{
  "_index": "metricbeat-7.17.1-2022.03.17-000004",
  "_type": "_doc",
  "_id": "3RnTnX8BTHDvmoFRS7Jb",
  "_version": 1,
  "_score": 1,
  "_source": {
    "@timestamp": "2022-03-18T16:18:00.000Z",
    "agent": {
      "version": "7.17.1",
      "hostname": "xxxx",
      "ephemeral_id": "xxxx",
      "id": "xxxxx",
      "name": "xxxxx",
      "type": "metricbeat"
    },
    "service": {
      "type": "aws"
    },
    "cloud": {
      "availability_zone": "xxxxx",
      "provider": "aws",
      "region": "xxxx",
      "account": {
        "id": "xxxx",
        "name": "xxxxx"
      }
    },
    "aws": {
      "rds": {
        "read_io": {
          "ops_per_sec": 0
        },
        "throughput": {
          "write": 14472.292128464525,
          "network_receive": 1235.1166666666666,
          "network_transmit": 9321.033333333333,
          "read": 0
        },
        "latency": {
          "read": 0,
          "write": 0.00002
        },
        "replica_lag": {
          "sec": 5
        },
        "database_connections": 0,
        "freeable_memory": {
          "bytes": 266002432
        },
        "disk_queue_depth": 0,
        "write_io": {
          "ops_per_sec": 1.1833136114398093
        },
        "maximum_used_transaction_ids": 2565235,
        "oldest_replication_slot_lag": {
          "mb": -1
        },
        "swap_usage": {
          "bytes": 68968448
        },
        "db_instance": {
          "arn": "xxxxxx",
          "status": "available",
          "identifier": "xxxxx",
          "class": "db.t3.micro",
          "engine_name": "postgres"
        },
        "free_storage": {
          "bytes": 7765127168
        },
        "disk_usage": {
          "replication_slot": {
            "mb": 4096
          },
          "transaction_logs": {
            "mb": 2348818432
          }
        }
      },
      "cloudwatch": {
        "namespace": "AWS/RDS"
      },
      "dimensions": {
        "DBInstanceIdentifier": "xxxxxx"
      },
      "tags": {
        "Name": "xxxxx",
        "Role": "xxxxx",
        "Environment": "xxxxxx"
      }
    },
    "event": {
      "dataset": "aws.rds",
      "module": "aws",
      "duration": 9781510696
    },
    "metricset": {
      "name": "rds",
      "period": 60000
    },
    "ecs": {
      "version": "1.12.0"
    },
    "host": {
      "name": "xxxxx"
    }
  },
  "fields": {
    "aws.rds.oldest_replication_slot_lag.mb": [
      -1
    ],
    "aws.rds.db_instance.status": [
      "available"
    ],
    "aws.rds.disk_usage.transaction_logs.mb": [
      2348818432
    ],
    "cloud.availability_zone": [
      "xxxx"
    ],
    "aws.tags.Role": [
      "xxxxx"
    ],
    "service.type": [
      "aws"
    ],
    "aws.rds.db_instance.identifier": [
      "xxxxxx"
    ],
    "agent.type": [
      "metricbeat"
    ],
    "aws.rds.throughput.write": [
      14472.292
    ],
    "event.module": [
      "aws"
    ],
    "aws.rds.db_instance.arn": [
      "xxxxxx"
    ],
    "aws.dimensions.DBInstanceIdentifier": [
      "xxxxxx"
    ],
    "aws.cloudwatch.namespace": [
      "AWS/RDS"
    ],
    "agent.name": [
      "xxxxx"
    ],
    "host.name": [
      "xxxx"
    ],
    "aws.rds.throughput.network_transmit": [
      9321.033
    ],
    "aws.rds.read_io.ops_per_sec": [
      0
    ],
    "aws.rds.write_io.ops_per_sec": [
      1.1833136
    ],
    "aws.rds.disk_usage.replication_slot.mb": [
      4096
    ],
    "cloud.account.name": [
      "xxxxxx"
    ],
    "cloud.region": [
      "xxxxx"
    ],
    "aws.rds.db_instance.class": [
      "db.t3.micro"
    ],
    "aws.rds.replica_lag.sec": [
      5
    ],
    "aws.tags.Name": [
      "xxxxxx"
    ],
    "metricset.period": [
      60000
    ],
    "aws.rds.maximum_used_transaction_ids": [
      2565235
    ],
    "aws.rds.latency.read": [
      0
    ],
    "aws.rds.free_storage.bytes": [
      7765127168
    ],
    "aws.rds.latency.write": [
      0.00002
    ],
    "agent.hostname": [
      "xxxxxxx"
    ],
    "aws.rds.disk_queue_depth": [
      0
    ],
    "aws.rds.throughput.read": [
      0
    ],
    "aws.rds.throughput.network_receive": [
      1235.1167
    ],
    "metricset.name": [
      "rds"
    ],
    "event.duration": [
      9781510696
    ],
    "aws.tags.Environment": [
      "xxxxxx"
    ],
    "aws.rds.db_instance.engine_name": [
      "postgres"
    ],
    "cloud.provider": [
      "aws"
    ],
    "@timestamp": [
      "2022-03-18T16:18:00.000Z"
    ],
    "agent.id": [
      "2f21e11f-0ce9-428e-b4bd-c72ae341c8b6"
    ],
    "cloud.account.id": [
      "xxxxxx"
    ],
    "ecs.version": [
      "1.12.0"
    ],
    "agent.ephemeral_id": [
      "xxx"
    ],
    "agent.version": [
      "7.17.1"
    ],
    "aws.rds.database_connections": [
      0
    ],
    "aws.rds.swap_usage.bytes": [
      68968448
    ],
    "aws.rds.freeable_memory.bytes": [
      266002432
    ],
    "event.dataset": [
      "aws.rds"
    ]
  }
}

For confirmed bugs, please report:

  • Version: image: docker.elastic.co/beats/metricbeat:7.17.1
  • Operating System: K8s offical image for container
  • Discuss Forum URL:
  • Steps to Reproduce: Normal workflow, setup module with provided configuration
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 18, 2022
@mtojek mtojek added the Team:Integrations Label for the Integrations team label Mar 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 21, 2022
@mtojek mtojek added needs_team Indicates that the issue/PR needs a Team:* label aws Enable builds in the CI for aws cloud testing labels Mar 21, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 21, 2022
@kaiyan-sheng
Copy link
Contributor

Thanks for reporting this bug. I was able to reproduce it on my side. Also I found out if I use period: 5m instead of 1m, I'm able to collect the CPU usage metric:
Screen Shot 2022-03-21 at 3 09 57 PM

@kaiyan-sheng kaiyan-sheng added bug Team:Cloud-Monitoring Label for the Cloud Monitoring team and removed Team:Integrations Label for the Integrations team labels Mar 21, 2022
@m-standfuss
Copy link
Author

Ah thanks I will probably move it to its own schedule and run at 5min for now. Thank you for taking a look at it.

@kaiyan-sheng
Copy link
Contributor

I'm able to reproduce this problem locally with config:

- module: aws
  period: 1m
  credential_profile_name: elastic-beats
  metricsets:
    - cloudwatch
  regions:
    - us-east-1
  metrics:
    - namespace: AWS/RDS
      name: ["CPUUtilization"]
      statistic: ["Average"]
      dimensions:
        - name: DatabaseClass
          value: db.r5.large

I should be getting the CPUUtilization metric every minute for this specific database class db.r5.large but I'm missing a lot of data points.

@kaiyan-sheng
Copy link
Contributor

Turned out there is a 2min to 3min latency between the current timestamp and the last data point from CloudWatch. So with adding the latency configuration parameter, CPUUtilization metrics are collected as expected.

- module: aws
  period: 1m
  credential_profile_name: elastic-beats
  metricsets:
    - cloudwatch
  regions:
    - us-east-1
  latency: 3m
  metrics:
    - namespace: AWS/RDS
      name: ["CPUUtilization"]
      statistic: ["Average"]
      dimensions:
        - name: DatabaseClass
          value: db.r5.large

So for the original issue, in order to collect all data points for CPU usage with 1m period, latency: ?m is needed.
Here is the config I tested with:

- module: aws
  period: 1m
  credential_profile_name: elastic-beats
  latency: 3m
  metricsets:
    - rds
  regions:
    - us-east-1

With latency configuration, I can see CPU usage metric is collected every minute as requested and the data points match what's in CloudWatch:
Screen Shot 2022-10-06 at 2 47 51 PM
Screen Shot 2022-10-06 at 2 47 36 PM

Without the latency parameter, data points were missing almost every other minute:
Screen Shot 2022-10-06 at 2 30 15 PM

@kaiyan-sheng
Copy link
Contributor

How to measure the latency? The easiest way is to open CloudWatch portal and choose the metric you are interested in, in this case we choose CPUUtilization for RDS instances with db.r5.large database class. Make sure the period is set to 1 minute and refresh the graph. Find the timestamp of the last data point reported to CloudWatch and compare it with the current timestamp. In this case, current timestamp is 20:55 and the last data point is at 20:52. The difference between these two timestamps means we have a latency of 3-minute. That's why I set the latency to 3m for metric collection.

Screen Shot 2022-10-06 at 2 56 43 PM

@m-standfuss
Copy link
Author

@kaiyan-sheng taking a look at implementing this latency fix for our cluster now that were upgraded to 8, does the logic work that if i set the latency fairly high, say 10 minutes, and there are multiple metrics for that last 10 minute window that it will select the most recent version of that metric?

Thank you for implementing a fix!

@m-standfuss
Copy link
Author

Oh, just realized that latency is an existing/legacy parameter. I thought it was newly implemented 🤦

@m-standfuss
Copy link
Author

m-standfuss commented Aug 29, 2023

So even with a more than adequate latency (3 minute+) i'm not getting all the metrics. I think I might just flip over to using the cloudwatch module directly instead of relying on the rds module.

@kaiyan-sheng
Copy link
Contributor

Hello @m-standfuss 👋 Could you check the latency on AWS portal please? You can follow https://docs.elastic.co/en/integrations/aws#latency-causes-missing-metrics for that :) I would be curious to see if using the cloudwatch module directly works. It should behave the same though. Thank you!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Enable builds in the CI for aws cloud testing bug Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

No branches or pull requests

4 participants