Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] [GCP] Fix compute metadata #36338

Merged
merged 7 commits into from Aug 28, 2023

Conversation

gpop63
Copy link
Contributor

@gpop63 gpop63 commented Aug 16, 2023

Proposed commit message

A few things I noticed while debugging:

  1. NewMetadataService is invoked based on the period set in the config e.g. 1m, therefore the computeInstances map will clear by itself, no need to remake it again in instance. If the period is 1m, are there any risks of having stale metadata?

    func NewMetadataService(projectID, zone string, region string, regions []string, opt ...option.ClientOption) (gcp.MetadataService, error) {
    return &metadataCollector{
    projectID: projectID,
    zone: zone,
    region: region,
    regions: regions,
    opt: opt,
    computeInstances: make(map[uint64]*computepb.Instance),

  2. ID is also calling Metadata, and in timeSeriesGrouped both are called

    func (s *metadataCollector) ID(ctx context.Context, in *gcp.MetadataCollectorInputData) (string, error) {
    metadata, err := s.Metadata(ctx, in.TimeSeries)

    id, err := metadataService.ID(ctx, sdCollectorInputData)
    if err != nil {
    m.Logger().Errorf("error trying to retrieve ID from metric event '%v'", err)
    continue
    }
    metadataCollectorData, err := metadataService.Metadata(ctx, sdCollectorInputData.TimeSeries)
    if err != nil {
    m.Logger().Error("error trying to retrieve labels from metric event")
    continue
    }

  3. ListTimeSeries call in Metric gets instances from other projects, not only from the project_id specified in the config. The AggregatedList call will not be able to find those instances.

    func (r *metricsRequester) Metric(ctx context.Context, serviceName, metricType string, timeInterval *monitoringpb.TimeInterval, aligner string) timeSeriesWithAligner {
    timeSeries := make([]*monitoringpb.TimeSeries, 0)
    req := &monitoringpb.ListTimeSeriesRequest{
    Name: "projects/" + r.config.ProjectID,
    Interval: timeInterval,
    View: monitoringpb.ListTimeSeriesRequest_FULL,
    Filter: r.getFilterForMetric(serviceName, metricType),
    Aggregation: &monitoringpb.Aggregation{
    PerSeriesAligner: gcp.AlignersMapToGCP[aligner],
    AlignmentPeriod: r.config.period,
    },
    }
    it := r.client.ListTimeSeries(ctx, req)

I think the problem is that there are many Metadata calls and clearing the computeInstances map would trigger the compute AggregatedList call a lot of times, and this would go on forever since that API call can't get metadata for all instances - there are also Kubernetes pods/nodes (we can't get metadata for them) that send compute metrics, not only VM instances.

By not remaking the computeInstances map, we only make 1 AggregatedList call per period specified in config (1m).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

image

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 16, 2023
@gpop63 gpop63 self-assigned this Aug 16, 2023
@mergify
Copy link
Contributor

mergify bot commented Aug 16, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gpop63? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 16, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-28T10:39:06.653+0000

  • Duration: 53 min 11 sec

Test stats 🧪

Test Results
Failed 0
Passed 1533
Skipped 96
Total 1629

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@gpop63 gpop63 added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Aug 16, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 16, 2023
@gpop63 gpop63 marked this pull request as ready for review August 16, 2023 15:49
@gpop63 gpop63 requested a review from a team as a code owner August 16, 2023 15:49
@tdancheva tdancheva self-requested a review August 28, 2023 15:47
@gpop63 gpop63 merged commit 63a147a into elastic:main Aug 28, 2023
21 checks passed
@tdancheva
Copy link
Contributor

@kaiyan-sheng I unskipped the test in a draft PR and the test fails cause it runs on 8.8.2. Do you think it is a good idea to backport this so that we can validate it now instead of waiting for the next release?

@kaiyan-sheng
Copy link
Contributor

@tdancheva good point! I would consider this as a bug fix and let's backport it.

@kaiyan-sheng kaiyan-sheng added the backport-v8.9.0 Automated backport with mergify label Sep 6, 2023
mergify bot pushed a commit that referenced this pull request Sep 6, 2023
* don't remake map

* only add instances from project_id

* fix ecs cloud region and AZ bug

* Revert "fix ecs cloud region and AZ bug"

This reverts commit 349bf6b.

* Revert "only add instances from project_id"

This reverts commit 89f6ab5.

* add changelog entry

(cherry picked from commit 63a147a)
kaiyan-sheng added a commit that referenced this pull request Sep 6, 2023
* [Metricbeat] [GCP] Fix compute metadata (#36338)

* don't remake map

* only add instances from project_id

* fix ecs cloud region and AZ bug

* Revert "fix ecs cloud region and AZ bug"

This reverts commit 349bf6b.

* Revert "only add instances from project_id"

This reverts commit 89f6ab5.

* add changelog entry

(cherry picked from commit 63a147a)

* Update CHANGELOG.next.asciidoc

---------

Co-authored-by: Gabriel Pop <94497545+gpop63@users.noreply.github.com>
Co-authored-by: kaiyan-sheng <kaiyan.sheng@elastic.co>
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
* don't remake map

* only add instances from project_id

* fix ecs cloud region and AZ bug

* Revert "fix ecs cloud region and AZ bug"

This reverts commit 349bf6b.

* Revert "only add instances from project_id"

This reverts commit 89f6ab5.

* add changelog entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.9.0 Automated backport with mergify Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants