Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmselect unreasonable duplicate time series error #1501

Closed
LiuPacific opened this issue Jul 28, 2021 · 4 comments
Closed

vmselect unreasonable duplicate time series error #1501

LiuPacific opened this issue Jul 28, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@LiuPacific
Copy link

1. Describe the bug

unreasonable error duplicate time series
world * on(myname) group_left(hair) tpmetric

In the binary operator condition, there are two timeseries returned from the right expr, but one of the timeseries has already been stale, which is judged by the empty result of request directly. But it still return duplicate time series.
In term of code, the value of stale timeseries is NaN, and the logic of mergeNonOverlappingTimeseries has been ignoring it since git commit b473c21915d27bbf1b64d485ab0c757fc76f494d, which said app/vmselect/promql: do not merge time series during requests to /api/v1/query.

2. Version

tag v1.63.0-cluster

3. Used command-line flags

vmselect program arguments

-storageNode=localhost:8401  -storageNode=localhost:8501 -tls=false -httpListenAddr=0.0.0.0:18481

4. Reproduce

4.1. source data

  • left timeseries: world{myname="tp0"}
{"metric":{"__name__":"world","myname":"tp0"},"values":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"timestamps":[1626401155000,1626401185000,1626401215000,1626401245000,1626401275000,1626401305000,1626401335000,1626401365000,1626401395000,1626401425000,1626401455000,1626401485000,1626401515000,1626401545000,1626401575000,1626401605000,1626401635000,1626401665000,1626401695000,1626401725000,1626401755000,1626401785000]}

1626401155 -> 1626401785

  • right timeseries0: tpmetric{address="beijing", hair="black", myname="tp0"}
{"metric":{"__name__":"tpmetric","myname":"tp0","address":"beijing","hair":"black"},"values":[2,3,4,2,5,7,5,9,6,5],"timestamps":[1626401155000,1626401185000,1626401215000,1626401245000,1626401275000,1626401305000,1626401335000,1626401365000,1626401395000,1626401425000]}

1626401155 -> 1626401425

  • right timeseries1: tpmetric{address="shenzhen", hair="black", myname="tp0"}
{"metric":{"__name__":"tpmetric","myname":"tp0","address":"shenzhen","hair":"black"},"values":[1,1,1,1,1,1,1,1,1,1,1,1],"timestamps":[1626401455000,1626401485000,1626401515000,1626401545000,1626401575000,1626401605000,1626401635000,1626401665000,1626401695000,1626401725000,1626401755000,1626401785000]}

1626401455 -> 1626401785

4.2. request

  • expr
world * on(myname) group_left(hair) tpmetric
  • url
http://localhost:18481/select/5/prometheus/api/v1/query?query=world * on(myname) group_left(hair) tpmetric&time=1626401785&nocache=1
  • http response
{
    "status": "error",
    "errorType": "422",
    "error": "error when executing query=\"world * on(myname) group_left(hair) tpmetric\" for (time=1626401785000, step=300000): cannot evaluate \"world * on (myname) group_left (hair) tpmetric\": duplicate time series on the right side of `* on (myname) group_left (hair)`: {address=\"beijing\", hair=\"black\", myname=\"tp0\"} and {address=\"shenzhen\", hair=\"black\", myname=\"tp0\"}"
}

the request will return error of duplicate time series between 1626401455 and 1626402025.
but, the response of http://localhost:18481/select/5/prometheus/api/v1/query?query=tpmetric{address="beijing", hair="black", myname="tp0"} &time=1626401725&nocache=1 have already been empty since 1626401725.

4.3. log

2021-07-27T14:01:52.876Z        warn    app/vmselect/main.go:523        error in "/select/5/prometheus/api/v1/query?query=world%20*%20on(myname)%20group_left(hair)%20tpmetric&time=1626401785&nocache=1": error when executing query="world * on(myname) group_left(hair) tpmetric" for (time=1626401785000, step=300000): cannot evaluate "world * on (myname) group_left (hair) tpmetric": duplicate time series on the right side of `* on (myname) group_left (hair)`: {address="shenzhen", hair="black", myname="tp0"} and {address="beijing", hair="black", myname="tp0"}

4.4. time summary

tpmetric{address="beijing", hair="black", myname="tp0"} will empty be since 1626401725

world tpmetric{"beijing"} tpmetric{"shenzhen"} duplicate timeseries error tpmetric{"beijing"} empty
from 1626401155 from 1626401155
to 1626401425
from 1626401455 from 1626401455
since 1626401725
to 1626401785 to 1626401785
to 1626402025

5. guess

5.1. NaN timeseries cannot be deleted in doInternal()

http://localhost:18481/select/5/prometheus/api/v1/query?query=world * on(myname) group_left(hair) tpmetric&time=1626401785&nocache=1

app/vmselect/promql.rollupLast at rollup.go: doInternal

func (rc *rollupConfig) doInternal(dstValues []float64, tsm *timeseriesMap, values []float64, timestamps []int64) []float64 {
...
		rfa.values = values[i:j] //i:9, j:9, rfa.values: `[]`
...
		value := f(rfa) // f: rollupLast
...
	return dstValues //value: `[NaN]`

5.2. the changing of mergeNonOverlappingTimeseries logic

the logic of mergeNonOverlappingTimeseries will ignore the handling of NaN value when the number of values of right expr no more than 2, which means that it cannot reach math.IsNaN(v) continue.

func mergeNonOverlappingTimeseries(dst, src *timeseries) bool {
	...
	// Do not merge time series with too small number of datapoints.
	// This can be the case during evaluation of instant queries (alerting or recording rules).
	// See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1141
	if len(srcValues) <= 2 && len(dstValues) <= 2 {
		return false
	}
	// Time series can be merged. Merge them.
	for i, v := range srcValues {
		if math.IsNaN(v) {
			continue
		}
		dstValues[i] = v
	}
	return true
}

5.3. the right timeseries which value is NaN is added into tsExisting in groupJoin()

I added some codes to fix it temporarily by skiping NaN timeseries of right values.

app/vmselect/promql.groupJoin at binary_op.go groupJoin

func groupJoin(singleTimeseriesSide string, be *metricsql.BinaryOpExpr, rvsLeft, rvsRight, tssLeft, tssRight []*timeseries) ([]*timeseries, []*timeseries, error) {
	...
	for _, tsLeft := range tssLeft {
		...
		bb := bbPool.Get()
		for _, tsRight := range tssRight {


//>>>>>>>>>>>>>>>>>>>>
//I added the code to fix it temporarily.
			if len(tsRight.Values)==1 && math.IsNaN(tsRight.Values[0]){
				continue
			}
//<<<<<<<<<<<<<<<<<<<<<
...
@LiuPacific LiuPacific changed the title unreasonable duplicate time series error vmselect unreasonable duplicate time series error Jul 28, 2021
@hagen1778
Copy link
Collaborator

@valyala could you please take a look?

@valyala valyala added the bug Something isn't working label Aug 15, 2021
@valyala
Copy link
Collaborator

valyala commented Aug 15, 2021

@LiuPacific , could you check whether the issue is fixed in the latest commits of master and cluster branches? VictoriaMetrics and vmagent gained support for Prometheus staleness markers - see this comment. Now VictoriaMetrics should handle stale time series for disappeared scrape targets in the same way as Prometheus does. Note that the stale time series handling works only for newly ingested samples after the upgrade of VictoriaMetrics and vmagent to the latest commits in master and cluster branches.

The parent issue - #1526

@valyala
Copy link
Collaborator

valyala commented Aug 15, 2021

FYI, VictoriaMetrics and vmagent gained support for Prometheus staleness markers starting from the release v1.64.0.

@LiuPacific
Copy link
Author

it works for me, thanks you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants