GCP metadata instance cache is never refreshed but also never used #32734

endorama · 2022-08-18T16:02:20Z

gcp.compute metricset implements a caching mechanism for GCP Instance metadata. These metadata are collected as part of instance data collection.

An investigation into how the current caching works highlighted a major flaw in it's invalidation behaviour.

TL;DR: current cache behaviour is useless and cache is not used, the reason for having a cache (high number of calls to instances.Get GCP API) has been removed and invalidating the cache is problematic (there is no way to know when data changes in GCP, with the high risk of providing stale data). Due to this the proposal is to remove the code and introduce it back if needed, as we foresee this not to be needed anymore.

we reduced a lot the number of API calls involved that justified caching and caching invalidation for this use case is very error prone; can we remove it entirely?

How instance metadata collection works

Details

Instance metadata collection is enabled when exclude_labels is set to true in metricset configuration.

When enabled, the code instantiate a gcp.MetadataService:

beats/x-pack/metricbeat/module/gcp/metrics/metricset.go

Lines 215 to 219 in 7449e5c

    
           if !m.config.ExcludeLabels { 
        
           	if gcpService, err = NewMetadataServiceForConfig(m.config, sdc.ServiceName); err != nil { 
        
           		return nil, fmt.Errorf("error trying to create metadata service: %w", err) 
        
           	} 
        
           }

which in the case of gcp.compute means calling this function:

beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go

Lines 29 to 38 in 7449e5c

    
           func NewMetadataService(projectID, zone string, region string, opt ...option.ClientOption) (gcp.MetadataService, error) { 
        
           	return &metadataCollector{ 
        
           		projectID:     projectID, 
        
           		zone:          zone, 
        
           		region:        region, 
        
           		opt:           opt, 
        
           		instanceCache: common.NewCache(cacheTTL, initialCacheSize), 
        
           		logger:        logp.NewLogger("metrics-compute"), 
        
           	}, nil 
        
           }

For each collection period, the gcp.MetadataService Metadata function is called to gather instance metadata.

We can assume, as knowledge of why this cache is in place has been lost, that the reason was avoiding calling the instances.get GCP API endpoint at each collection iteration, to avoid costly API calls and network latency. This API was called for each instance to retrieve metadata for, which also poses a scaling problem with big instance number.

This behaviour has been altered in March 2022 with this PR #30154, where the single API call per instance has been improved by using instances.aggregatedList GCP API that returns batched results (up to 500 instances per result page).

So with the current implementation concerns for scaling (performance and cost wise) for this operation are greatly reduced; with 1000 instances monitored with metadata collection enabled metricbeat now performs 2 API calls instead of 1000.

The cache

Details

The cache, called instanceCache in compute.NewMetadataService is instantiated with:

beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go

Line 35 in 3fda469

instanceCache: common.NewCache(cacheTTL, initialCacheSize),

Where values are:

beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go

Lines 24 to 25 in 3fda469

    
           cacheTTL         = 30 * time.Second 
        
           initialCacheSize = 13

common.Cache is our libbeat common cache implementation.

What is cached

Details

When gcp.MetadataService Metadata is called its first step is to gather current instance metadata (by instance ID), which triggers a cache refresh:

beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go

Lines 144 to 145 in 7449e5c

    
           func (s *metadataCollector) instance(ctx context.Context, instanceID string) (*compute.Instance, error) { 
        
           	s.refreshInstanceCache(ctx)

Iterating over pages of results cache refresh add the result of instances.aggregatedList API to instanceCache.

How is cache invalidation wrong

Issue 1 - expired elements are not cleaned up

TL;DR: once filled the cache is never really refreshed because the condition expects an empty cache, but this case does not happen.

Details

The problem starts with this lines:

beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go

Lines 173 to 176 in 7449e5c

    
           // only refresh cache if it is empty 
        
           if s.instanceCache.Size() > 0 { 
        
           	return 
        
           }

This check was implemented to prevent refreshing the cache at each iteration, and wait for cache to be empty before performing a cache refresh.

But in which case Size() return 0? Looking at its implementation:

beats/libbeat/common/cache.go

Lines 230 to 236 in fc2212d

    
           // Size returns the number of elements in the cache. The number includes both 
        
           // active elements and expired elements that have not been cleaned up. 
        
           func (c *Cache) Size() int { 
        
           	c.RLock() 
        
           	defer c.RUnlock() 
        
           	return len(c.elements) 
        
           }

we notice that Size() returns the number of all active and expired elements (that have not been yet cleaned up).

When does clean-up happens? When Cleanup() is called, something that never happens in metricbeat code.

This is the first issue: once filled the cache is never really refreshed because the condition expects an empty cache, but this case does not happen.

Issue 2 - element expiration time is extended on access

TL;DR: the expiration time for elements is changed on access, so elements would never be cleaned up (and cache size would never be 0 after the first fill).

Details

But there is a second problem: even if the cache was correctly cleaned-up, when an element is considered expired?
The get function implementing cached element retrieval may update the element expiration time, depending on an internal attribute (accessExpire).

Metricbeat code instantiate the cache with common.NewCache(...) which implemented is:

beats/libbeat/common/cache.go

Lines 82 to 84 in fc2212d

    
           func NewCache(d time.Duration, initialSize int) *Cache { 
        
           	return newCache(d, true, initialSize, nil, time.Now) 
        
           }

The second parameter is accessExpire:

beats/libbeat/common/cache.go

Line 102 in fc2212d

    
           func newCache(d time.Duration, accessExpire bool, initialSize int, l RemovalListener, t clock) *Cache {

which is set to true by default.

This is the second problem: the expiration time for elements is changed on access, so elements would never be cleaned up (and cache size would never be 0 after the first fill).

Issue 3 - cache expires between periods

TL;DR: all elements expire after 30 seconds and we access them every 60 seconds, but due to not cleaning them up they are not retrieved from cache and the cache is not refreshed neither.

Details

Up to this point it seems the cache is not really working as expected, but there is another wrong behaviour.

Cache timeout is initialized to cacheTTL, which is set to:

beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go

Line 24 in 7449e5c

cacheTTL = 30 * time.Second

In gcp any period less that 60 seconds cannot be configured because that's the minimum valid for collection on GCP API side.

This is the third issue: all elements expire after 30 seconds and we access them every 60 seconds, but due to not cleaning them up they are not retrieved from cache and the cache is not refreshed neither.

Issue 4 - MetadataService is instantiated at each period

The cache is instantiated in compute.NewMetadataService. This function is:

called by gcp.NewMetadataService
which is called by gcp.eventMapping
which is called by gcp.Fetch

This is the fourth issue: the cache instance is created for each data collection, so it always starts empty.

Conclusions

Due to these 4 issues:

the cache is effectively not used
it would be filled and never refreshed
expired items would never been cleaned up
items would never expire because their expiration time is expanded upon access
the cache is instantiated fresh at each collection period

the cache behaviour is wrong and the cache is never used.

The text was updated successfully, but these errors were encountered:

gpop63 · 2022-11-07T11:14:46Z

Should the current API call be used after cache removal or revert to the old API call? Wouldn't the current API call be too expensive to use (without a cache) since it would collect data for all instances?

Current:

beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go

Line 186 in a49b362

req := computeService.Instances.AggregatedList(s.projectID)

Line 159 in 5339327

instanceData, err := service.Instances.Get(s.projectID, zone, instanceID).Do()

endorama · 2022-11-07T11:52:03Z

The current API is optimized for querying a big list of instances, so I don't think we need to change it.

As of now is already being used without a cache, so I would first remove the cache (as is broken code) then think on ways to reduce it's impact (the issue with caching is that is quite difficult to establish when the cache should be refreshed)

endorama added bug backport-v7.17.0 Automated backport with mergify Team:Cloud-Monitoring Label for the Cloud Monitoring team backport-v8.4.0 Automated backport with mergify labels Aug 18, 2022

gpop63 mentioned this issue Nov 11, 2022

[metricbeat] [gcp] remove compute metadata cache #33655

Merged

6 tasks

endorama mentioned this issue Nov 14, 2022

[Metricbeat] GCP cloudsql metadata #33066

Merged

6 tasks

gpop63 closed this as completed in #33655 Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP metadata instance cache is never refreshed but also never used #32734

GCP metadata instance cache is never refreshed but also never used #32734

endorama commented Aug 18, 2022

gpop63 commented Nov 7, 2022

endorama commented Nov 7, 2022

GCP metadata instance cache is never refreshed but also never used #32734

GCP metadata instance cache is never refreshed but also never used #32734

Comments

endorama commented Aug 18, 2022

How instance metadata collection works

The cache

What is cached

How is cache invalidation wrong

Issue 1 - expired elements are not cleaned up

Issue 2 - element expiration time is extended on access

Issue 3 - cache expires between periods

Issue 4 - MetadataService is instantiated at each period

Conclusions

gpop63 commented Nov 7, 2022

endorama commented Nov 7, 2022