Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmalert: long query execution in annotations blocks API #6079

Closed
dmitryk-dk opened this issue Apr 8, 2024 · 5 comments
Closed

vmalert: long query execution in annotations blocks API #6079

dmitryk-dk opened this issue Apr 8, 2024 · 5 comments
Assignees
Labels
bug Something isn't working vmalert

Comments

@dmitryk-dk
Copy link
Contributor

Describe the bug

If you are using the query template function in the alerting rule annotations and this query executes for a long time (let's say 1 minute), the other vmalert endpoint does not return a response.

To Reproduce

create alerting group like in the example

groups:
  - name: group
    interval: 5s
    eval_offset: 0
    rules:
      - alert: Blocking
        expr: 1 + 1
        annotations:
          summary: '{{ query "1" | first | value }}'

      - alert: Next Alert
        expr: vector(1)
        annotations:
          summary: 'just summary'

Prepare nginx config

daemon off;
error_log  nginx_error.log warn;
pid        nginx.pid;

events { }

http {


    log_format   main '$remote_addr [$time_local]  $status '
        '"$request" $body_bytes_sent "$http_referer" '
        '"$http_user_agent" "$http_x_forwarded_for"';

    limit_req_zone $binary_remote_addr zone=delay:1m rate=1r/m;

    server {
        listen       12345 default_server;
        access_log   nginx_access.log main;
        root /dev/null;

        location /datasource {
            return 200 '{"status":"success","isPartial":false,"data":{"resultType":"vector","result":[{"metric":{},"value":[1712340706,"2"]}]},"stats":{"seriesFetched": "0","executionTimeMsec":0}}\n';
        }

        location /api/v2/alerts {
            return 200;
        }

        location /delay-second-request {
            limit_req zone=delay burst=1;
            proxy_pass http://localhost:12345/datasource/;
        }
    }
}

run nginx

nginx -p . -e nginx_error.log -c ./bin/nginx.conf

Run vmalert

and try to curl /metrics endpoint

curl http://localhost:12346/metrics

The request will stack, while query from the template will be executed, curl will wait for the response.

I think it happens because of this mutex

ar.alertsMu.Lock()

Version

vmalert v1.99.0

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

@dmitryk-dk dmitryk-dk added bug Something isn't working vmalert labels Apr 8, 2024
@hagen1778
Copy link
Collaborator

Looks like vmalert is get blocked here

ar.metrics.pending = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_alerts_pending{%s}`, labels),
func() float64 {
ar.alertsMu.RLock()
defer ar.alertsMu.RUnlock()
var num int
for _, a := range ar.alerts {
if a.State == notifier.StatePending {
num++
}
}
return float64(num)
})
ar.metrics.active = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_alerts_firing{%s}`, labels),
func() float64 {
ar.alertsMu.RLock()
defer ar.alertsMu.RUnlock()
var num int
for _, a := range ar.alerts {
if a.State == notifier.StateFiring {
num++
}
}
return float64(num)
})

When /metrics endpoint is called, metrics package invokes all Gauge functions to collect the data. And in vmalert's case, those functions will try to acquire the lock to read the data.

At the same time, the same AlertingRule object could have already acquired the lock in Exec function. Here the lock is acquired to update the object fields. And one of these fields is Annotations which could take extra time to be updated when contains query operator inside (executes arbitrary query configured by user). For the rest of cases, when there is no query function in annotations, fields updates takes no time.

One solution here is to make Gauge metrics use atomic values instead of acquiring the lock. This should make the metrics collection always independent.

@Haleygo
Copy link
Collaborator

Haleygo commented Apr 15, 2024

/api/v1/rules and /api/v1/alerts would be blocked with the same lock problem as described here.

@hagen1778
Copy link
Collaborator

@dmitryk-dk #6129 has been merged and will be included into the next release.

@valyala
Copy link
Collaborator

valyala commented Apr 19, 2024

FYI, this bugfix has been included in v1.97.4 LTS release. It will be also included in the upcoming latest non-LTS release.

@hagen1778
Copy link
Collaborator

vmalert won't block /api/v1/rules, /api/v1/alerts, /metrics APIs during rules evaluation starting from v1.101.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vmalert
Projects
None yet
Development

No branches or pull requests

4 participants