Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading v1.63.0 -> v1.65.0 leads to unexpeced huge datapoint #1628

Closed
sequix opened this issue Sep 18, 2021 · 4 comments
Closed

Upgrading v1.63.0 -> v1.65.0 leads to unexpeced huge datapoint #1628

sequix opened this issue Sep 18, 2021 · 4 comments
Labels
question The question issue

Comments

@sequix
Copy link

sequix commented Sep 18, 2021

Describe the bug

I was upgrading our VictoriaMetrics in the follwing order:

  1. all vmagents;
  2. vmauth, vminsert, vmstorage, vmselct at the same time, machine by machine (depoloyed on 3 machines);

After upgrading 1 machine, I got the following results at some random timeserie and random vmselct endpoint (which cause pretty much alerts sent...):

pic

I noticed the vmagent went down at the same instant:

pic

After upgrading all machines, the big datapoint disappeared:

pic

To Reproduce

Use the following program to request different clusetr version setup.

// CGO_ENABLED=0 GOARCH=amd64 GOOS=linux go build -o test -ldflags "-s -w" main.go
package main
 
import (
   "fmt"
   "math"
 
   "github.com/go-resty/resty/v2"
   "github.com/golang/snappy"
   "github.com/prometheus/prometheus/prompb"
)
 
// StaleNaN is a special NaN value, which is used as Prometheus staleness mark.
// See https://www.robustperception.io/staleness-and-promql
var StaleNaN = math.Float64frombits(staleNaNBits)
 
const (
   // staleNaNbits is bit representation of Prometheus staleness mark (aka stale NaN).
   // This mark is put by Prometheus at the end of time series for improving staleness detection.
   // See https://www.robustperception.io/staleness-and-promql
   staleNaNBits uint64 = 0x7ff0000000000002
)
 
func main() {
   ts := prompb.TimeSeries{}
   ts.Labels = append(ts.Labels, &prompb.Label{Name: "__name__", Value: "foo"})
   // 14 samples
   ts.Samples = []prompb.Sample{
      {Timestamp: 1631788245000, Value: 1660963833},
      {Timestamp: 1631788260000, Value: 1660963833},
      {Timestamp: 1631788275000, Value: 1660976042},
      {Timestamp: 1631788290000, Value: 1660976042},
      {Timestamp: 1631788305000, Value: 1660988566},
      {Timestamp: 1631788320000, Value: 1660988566},
      {Timestamp: 1631788350000, Value: StaleNaN},
      //{Timestamp: 1631788350000, Value: math.Float64frombits(0x80000000000000c0)},
      {Timestamp: 1631788365000, Value: 1661013023},
      {Timestamp: 1631788380000, Value: 1661013023},
      {Timestamp: 1631788395000, Value: 1661025362},
      {Timestamp: 1631788410000, Value: 1661025362},
      {Timestamp: 1631788425000, Value: 1661037610},
      {Timestamp: 1631788440000, Value: 1661037610},
      {Timestamp: 1631788455000, Value: 1661049815},
   }
   //---
   req := prompb.WriteRequest{
      Timeseries: []*prompb.TimeSeries{&ts},
   }
   reqBuf, err := req.Marshal()
   if err != nil {
      panic(err)
   }
   reqBytes := snappy.Encode(nil, reqBuf)
   //---
   client := resty.New().
      SetHeader("Content-Type", "application/x-protobuf").
      SetHeader("Content-Encoding", "snappy").
      OnAfterResponse(func(c *resty.Client, rsp *resty.Response) error {
         if rsp.IsError() {
            return fmt.Errorf("http error, code %d, body %s", rsp.StatusCode(), string(rsp.Body()))
         }
         return nil
      })
   rsp, err := client.NewRequest().SetBody(reqBytes).Post("http://127.0.0.1:8480/insert/0/prometheus/api/v1/write")
   if err != nil {
      fmt.Printf("response error %s\n", err)
   }
   fmt.Printf("response code %d\n", rsp.StatusCode())
   fmt.Println(string(rsp.Body()))
}

The cmdline for every setup:

vmstorage &
vminsert -storageNode=127.0.0.1:8400 &
vmselect -storageNode=127.0.0.1:8401 &

And get the results with:

curl -G 'http://127.0.0.1:8481/select/0/prometheus/api/v1/export' -d 'match[]=foo{}'

Results:

storage63 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage63 + select63 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage63 + select65 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

srorage63 + select65 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select63 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select65 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,NaN,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select65 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,NaN,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

Expected behavior

I do not exactly know what is expected for the low version components, but alerts sent by these huge datapoints shoud be avoided.

Maybe an intrstruction on upgrading order & pitfalls,or heads up at the release info is needed to articulate these behavior and how to avoid them.

@sequix
Copy link
Author

sequix commented Sep 18, 2021

And a question: how does vmselect selects vmstorage to request?

From my reading of the code, vmselct will read all vmstorage until it gets what it wants ?
unlike vminsert with jumpHash(xxHash(accountID+projectID+labels)) ?

The reason I ask this is when I request directly from vmselct it gives persistent results through the update buttion at the right corner of grafana despite all vmselect share the same cmdline config of storageNodes, which should give different results, since it requests from all vmstorage at once?

@valyala valyala added the question The question issue label Sep 20, 2021
@valyala
Copy link
Collaborator

valyala commented Sep 20, 2021

@sequix , thanks for the detailed analysis of the issue! VictoriaMetrics components emit and process e.g. Prometheus staleness marks starting from v1.64.0 - see https://docs.victoriametrics.com/CHANGELOG.html#v1640 and #1526 . Previous versions of VictoriaMetrics (v1.63.0 or lower) just drop the staleness marks during data ingestion (at vminsert), so they don't propagate to vmstorage and aren't visible during querying via vmselect. When vminsert is upgraded from v1.63.0 to v1.64.0 or higher, then it starts forwarding Prometheus staleness marks to vmstorage, so they are visible to vmselect during querying. Unfortunately, vmselect prior to v1.64.0 doesn't know how to deal properly with these staleness marks, so it treats them as big numbers - 9223372036854776000 . This is what was observed during the upgrade in your setup. This is automatically fixed after vmselect is upgraded to v1.64.0 or higher.

So cluster components of VictoriaMetrics should be upgraded in the following order:

  1. Upgrade all the vmselect components, so they know how to deal with staleness markers.
  2. Upgrade vminsert and vmstorage components in any order.

And a question: how does vmselect selects vmstorage to request?
From my reading of the code, vmselct will read all vmstorage until it gets what it wants ?
unlike vminsert with jumpHash(xxHash(accountID+projectID+labels)) ?

Correct! vmselect queries all the vmstorage nodes listed in -storageNode command-line flag. This is needed because the data needed for the query can be located an any vmstorage node and vmselect cannot determine the particular vmstorage nodes for each incoming query.

@sequix
Copy link
Author

sequix commented Sep 20, 2021

Clears pretty much my confusions, but it does not explain this test result if vminsert v1.63.0 does not propagate StaleNaN to vmstaorge. So my guess is that vminsert before v1.64.0 will just send any data to vmstorage whether it is a StaleNaN or not?

storage65 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

@sequix
Copy link
Author

sequix commented Sep 28, 2021

Clears pretty much my confusions, but it does not explain this test result if vminsert v1.63.0 does not propagate StaleNaN to vmstaorge. So my guess is that vminsert before v1.64.0 will just send any data to vmstorage whether it is a StaleNaN or not?

storage65 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

seems so, close it.

@sequix sequix closed this as completed Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants