Upgrading v1.63.0 -> v1.65.0 leads to unexpeced huge datapoint #1628

sequix · 2021-09-18T03:47:19Z

Describe the bug

I was upgrading our VictoriaMetrics in the follwing order:

all vmagents;
vmauth, vminsert, vmstorage, vmselct at the same time, machine by machine (depoloyed on 3 machines);

After upgrading 1 machine, I got the following results at some random timeserie and random vmselct endpoint (which cause pretty much alerts sent...):

I noticed the vmagent went down at the same instant:

After upgrading all machines, the big datapoint disappeared:

To Reproduce

Use the following program to request different clusetr version setup.

// CGO_ENABLED=0 GOARCH=amd64 GOOS=linux go build -o test -ldflags "-s -w" main.go
package main
 
import (
   "fmt"
   "math"
 
   "github.com/go-resty/resty/v2"
   "github.com/golang/snappy"
   "github.com/prometheus/prometheus/prompb"
)
 
// StaleNaN is a special NaN value, which is used as Prometheus staleness mark.
// See https://www.robustperception.io/staleness-and-promql
var StaleNaN = math.Float64frombits(staleNaNBits)
 
const (
   // staleNaNbits is bit representation of Prometheus staleness mark (aka stale NaN).
   // This mark is put by Prometheus at the end of time series for improving staleness detection.
   // See https://www.robustperception.io/staleness-and-promql
   staleNaNBits uint64 = 0x7ff0000000000002
)
 
func main() {
   ts := prompb.TimeSeries{}
   ts.Labels = append(ts.Labels, &prompb.Label{Name: "__name__", Value: "foo"})
   // 14 samples
   ts.Samples = []prompb.Sample{
      {Timestamp: 1631788245000, Value: 1660963833},
      {Timestamp: 1631788260000, Value: 1660963833},
      {Timestamp: 1631788275000, Value: 1660976042},
      {Timestamp: 1631788290000, Value: 1660976042},
      {Timestamp: 1631788305000, Value: 1660988566},
      {Timestamp: 1631788320000, Value: 1660988566},
      {Timestamp: 1631788350000, Value: StaleNaN},
      //{Timestamp: 1631788350000, Value: math.Float64frombits(0x80000000000000c0)},
      {Timestamp: 1631788365000, Value: 1661013023},
      {Timestamp: 1631788380000, Value: 1661013023},
      {Timestamp: 1631788395000, Value: 1661025362},
      {Timestamp: 1631788410000, Value: 1661025362},
      {Timestamp: 1631788425000, Value: 1661037610},
      {Timestamp: 1631788440000, Value: 1661037610},
      {Timestamp: 1631788455000, Value: 1661049815},
   }
   //---
   req := prompb.WriteRequest{
      Timeseries: []*prompb.TimeSeries{&ts},
   }
   reqBuf, err := req.Marshal()
   if err != nil {
      panic(err)
   }
   reqBytes := snappy.Encode(nil, reqBuf)
   //---
   client := resty.New().
      SetHeader("Content-Type", "application/x-protobuf").
      SetHeader("Content-Encoding", "snappy").
      OnAfterResponse(func(c *resty.Client, rsp *resty.Response) error {
         if rsp.IsError() {
            return fmt.Errorf("http error, code %d, body %s", rsp.StatusCode(), string(rsp.Body()))
         }
         return nil
      })
   rsp, err := client.NewRequest().SetBody(reqBytes).Post("http://127.0.0.1:8480/insert/0/prometheus/api/v1/write")
   if err != nil {
      fmt.Printf("response error %s\n", err)
   }
   fmt.Printf("response code %d\n", rsp.StatusCode())
   fmt.Println(string(rsp.Body()))
}

The cmdline for every setup:

vmstorage &
vminsert -storageNode=127.0.0.1:8400 &
vmselect -storageNode=127.0.0.1:8401 &

And get the results with:

curl -G 'http://127.0.0.1:8481/select/0/prometheus/api/v1/export' -d 'match[]=foo{}'

Results:

storage63 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage63 + select63 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage63 + select65 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

srorage63 + select65 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select63 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select65 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,NaN,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

storage65 + select65 + insert65
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,NaN,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

Expected behavior

I do not exactly know what is expected for the low version components, but alerts sent by these huge datapoints shoud be avoided.

Maybe an intrstruction on upgrading order & pitfalls,or heads up at the release info is needed to articulate these behavior and how to avoid them.

sequix · 2021-09-18T04:01:41Z

And a question: how does vmselect selects vmstorage to request?

From my reading of the code, vmselct will read all vmstorage until it gets what it wants ?
unlike vminsert with jumpHash(xxHash(accountID+projectID+labels)) ?

The reason I ask this is when I request directly from vmselct it gives persistent results through the update buttion at the right corner of grafana despite all vmselect share the same cmdline config of storageNodes, which should give different results, since it requests from all vmstorage at once?

valyala · 2021-09-20T09:31:21Z

@sequix , thanks for the detailed analysis of the issue! VictoriaMetrics components emit and process e.g. Prometheus staleness marks starting from v1.64.0 - see https://docs.victoriametrics.com/CHANGELOG.html#v1640 and #1526 . Previous versions of VictoriaMetrics (v1.63.0 or lower) just drop the staleness marks during data ingestion (at vminsert), so they don't propagate to vmstorage and aren't visible during querying via vmselect. When vminsert is upgraded from v1.63.0 to v1.64.0 or higher, then it starts forwarding Prometheus staleness marks to vmstorage, so they are visible to vmselect during querying. Unfortunately, vmselect prior to v1.64.0 doesn't know how to deal properly with these staleness marks, so it treats them as big numbers - 9223372036854776000 . This is what was observed during the upgrade in your setup. This is automatically fixed after vmselect is upgraded to v1.64.0 or higher.

So cluster components of VictoriaMetrics should be upgraded in the following order:

Upgrade all the vmselect components, so they know how to deal with staleness markers.
Upgrade vminsert and vmstorage components in any order.

And a question: how does vmselect selects vmstorage to request?
From my reading of the code, vmselct will read all vmstorage until it gets what it wants ?
unlike vminsert with jumpHash(xxHash(accountID+projectID+labels)) ?

Correct! vmselect queries all the vmstorage nodes listed in -storageNode command-line flag. This is needed because the data needed for the query can be located an any vmstorage node and vmselect cannot determine the particular vmstorage nodes for each incoming query.

sequix · 2021-09-20T11:19:25Z

Clears pretty much my confusions, but it does not explain this test result if vminsert v1.63.0 does not propagate StaleNaN to vmstaorge. So my guess is that vminsert before v1.64.0 will just send any data to vmstorage whether it is a StaleNaN or not?

storage65 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

sequix · 2021-09-28T01:13:22Z

Clears pretty much my confusions, but it does not explain this test result if vminsert v1.63.0 does not propagate StaleNaN to vmstaorge. So my guess is that vminsert before v1.64.0 will just send any data to vmstorage whether it is a StaleNaN or not?
storage65 + select63 + insert63
{"metric":{"__name__":"foo"},"values":[1660963833,1660963833,1660976042,1660976042,1660988566,1660988566,9223372036854776000,1661013023,1661013023,1661025362,1661025362,1661037610,1661037610,1661049815],"timestamps":[1631788245000,1631788260000,1631788275000,1631788290000,1631788305000,1631788320000,1631788350000,1631788365000,1631788380000,1631788395000,1631788410000,1631788425000,1631788440000,1631788455000]}

seems so, close it.

valyala added the question The question issue label Sep 20, 2021

sequix closed this as completed Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading v1.63.0 -> v1.65.0 leads to unexpeced huge datapoint #1628

Upgrading v1.63.0 -> v1.65.0 leads to unexpeced huge datapoint #1628

sequix commented Sep 18, 2021 •

edited

Loading

sequix commented Sep 18, 2021

valyala commented Sep 20, 2021

sequix commented Sep 20, 2021

sequix commented Sep 28, 2021

Upgrading v1.63.0 -> v1.65.0 leads to unexpeced huge datapoint #1628

Upgrading v1.63.0 -> v1.65.0 leads to unexpeced huge datapoint #1628

Comments

sequix commented Sep 18, 2021 • edited Loading

Describe the bug

To Reproduce

Expected behavior

sequix commented Sep 18, 2021

valyala commented Sep 20, 2021

sequix commented Sep 20, 2021

sequix commented Sep 28, 2021

sequix commented Sep 18, 2021 •

edited

Loading