Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downsampling data #36

Open
iamalryz opened this issue May 27, 2019 · 30 comments
Open

Downsampling data #36

iamalryz opened this issue May 27, 2019 · 30 comments

Comments

@iamalryz
Copy link

@iamalryz iamalryz commented May 27, 2019

Is there ability to downsample data?

For example, I need to store raw metrics for 1 month, and metrics aggregated by 30 minutes for 1 year.

@valyala
Copy link
Contributor

@valyala valyala commented May 27, 2019

VictoriaMetrics doesn't provide automatic downsampling at the moment. But it may be implemented using the following approach:

  • To run multiple VictoriaMetrics instances (or clusters) with distinct retentions, since each VictoriaMetrics instance works with a single retention.
  • To periodically scrape the required downsampled data via /federate API from the instance with raw data and store it in the instance with higher retention.

We are planning to add recording rules to VictoriaMetrics with the ability to export the recorded data into external storage such as another VictoriaMetrics instance with higher retention.

side notes

Downsampling is usually used for two purposes:

  • reducing query time over long time ranges
  • reducing the required storage size

VictoriaMetrics is optimized for both cases

So VictoriaMetrics work quite good without the downsampling. An additional benefit is that you can drill down old data to small time ranges without precision loss.

Loading

@thulle
Copy link

@thulle thulle commented Jun 9, 2019

I'm commenting part as a +1 on this being a requested feature, and part answering questions posted on reddit by @valyala about reasons to use VM.

To be able to downsample data like RRD in combination with the current storageefficiency of large amounts of timeseries, and being able to query this downsampled data transparently is imho. the featureset required for VM to become the obvious choice for devops/dashboards.
By transparently I mean that older data should be returned in lower resolution, without the querier having to make anything different. This would allow debugging issues using high resolution data as they happen and view trends over long time just by changing timerange in a dashboard, instead of having to modify queries & datasources in existing/pre-made dashboards or storing huge amounts of data.

Loading

@tenmozes tenmozes mentioned this issue Jul 27, 2019
13 tasks
@jujo1
Copy link

@jujo1 jujo1 commented Oct 2, 2019

There appears to be limited updates on this topic. The need for downsampling (in particular OHLC bars) is paramount for use cases in financial price data. The objective is to store data long term at high frequency and most commonly use data at varying lower frequencies.

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Oct 6, 2019

@jujo1 , did you try using rollup_candlestick(m[d]) on the raw data? This function returns four time series per each input time series - open, high, low and close. See https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/ExtendedPromQL for details.

Could you answer the following questions in order to understand better your use case:

  1. What is the interval between data points in a single time series? Note that the minimal interval between data points supported by VictoriaMetrics is 1 millisecond.
  2. How many unique time series are queried by a single query on average and on maximum?
  3. What is the average and the maximum query interval? (hour, day, week, month, year, etc.)
  4. Could you provide slow query for your case, so we could optimize it without resorting to down-sampling?

VictoriaMetrics is able to scan up to 50 millions of data points per second per CPU core, and the performance scales almost linearly with the number of CPU cores. For instance, 20 CPU cores can result in scan speed of up to 1 billion of data points per second.

Example calculations: if new points for each time series arrive every 100ms, then a single time series for one year would contain 10*3600*24*365=315M data points. This means that VictoriaMetrics could process up to 3 such time series per second on 20 CPU cores for year-long time range.

In the mean time you can create a custom script, which would periodically fetch down-sampled OHLC numbers from raw data and put them into a separate VictoriaMetrics instance for fast query processing on long time ranges.

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Apr 28, 2020

FYI, it is possible to (ab)use deduplication for simple downsampling. For instance, if scrape interval for the ingested data is 15s, while -dedup.minScrapeInterval is set to 5m, then VictoriaMetrics will leave only a single sample per each 5m interval in the storage. See these docs for details.

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Jun 5, 2020

FYI, vmalert supports recording rules starting from v1.37.0.

Loading

@bojleros
Copy link

@bojleros bojleros commented Aug 13, 2020

Hi, Would you please give at least rough estimate on downsampling feature release? It is very nice that VM is capable of storing such huge amounts of series but we'd prefer to use our resources in more optimal way. In general old data is far less likely to be queried and even if this gets queried it is enough to have just 5m aggregates instead of full resolution. Its likely that current data is going to be querried quite often so downsampling means more resources for the most important data -> more efficiency.

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Aug 13, 2020

Would you please give at least rough estimate on downsampling feature release?

We already designed draft architecture for multi-level downsampling and are going to add it to VictoriaMetrics during the next 6 months.

Loading

@bojleros
Copy link

@bojleros bojleros commented Aug 19, 2020

@valyala What is the overall plan? Are we going to be able to specify additional retention thresholds so datapoints will be downsampled as soon as their age breaches the thresholds? Would it be possible in simple single node deployments?

Loading

@raags
Copy link

@raags raags commented Aug 31, 2020

Does downsampling here mean an aggregation across the sampled time range? How will the new data point be calculated?

For e.g. in graphite, every metric can be configured with an aggregation function (min, max, avg, sum, etc) that is applied to the sample period. i.e. if 1 min goes down to 1 hour, for http_requests_total it should sum(1h), while for http_requets_duration it should avg(1h).

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Sep 4, 2020

What is the overall plan? Are we going to be able to specify additional retention thresholds so datapoints will be downsampled as soon as their age breaches the thresholds? Would it be possible in simple single node deployments?

Draft config will look like the following:

-downsample 1d:1m,7d:5m,30d:1h

This means that downsampling is applied in the following way:

  • Samples with timestamps closer than 1d (one day) to the current time aren't downsampled
  • Samples with timestamps from 1d to 7d to the current time are downsampled with 1m (one minute) interval
  • Samples with timestamps from 7d to 30d to the current time are downsampled with 5m interval
  • Older samples are downsampled with 1h interval.

The functionality will be available in both single-node and cluster versions of VictoriaMetrics.

Does downsampling here mean an aggregation across the sampled time range? How will the new data point be calculated?

The downsampling means leaving the first point on the configured interval. There is no any aggregation.

in graphite, every metric can be configured with an aggregation function (min, max, avg, sum, etc) that is applied to the sample period. i.e. if 1 min goes down to 1 hour, for http_requests_total it should sum(1h), while for http_requets_duration it should avg(1h).

This type of downsampling is easy to misconfigure and is hard to reason about, so we decided postponing its implementation after simple downsampling mentioned above will be implemented.

Loading

@hekmon
Copy link
Contributor

@hekmon hekmon commented Sep 4, 2020

That is great news !

So it seams it is a kind of dynamic deduplication windows ? How dedup will work once downsampling is here ? Will it be merged with DS as DS can be configured to act as dedup ?

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Sep 7, 2020

How dedup will work once downsampling is here ? Will it be merged with DS as DS can be configured to act as dedup ?

The -dedup.minScrapeInterval command-line flag will be left for backwards compatibility, while simple downsampling will re-use the deduplication code.

Loading

@razlo
Copy link

@razlo razlo commented Jan 7, 2021

Hi @valyala,
Any news regarding the downsampling implementation?

Thanks! 🙏

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Jan 7, 2021

Simple downsampling mentioned in the #36 (comment) is planned to be implemented in Q1 2021.

Loading

@isavcic
Copy link

@isavcic isavcic commented Feb 1, 2021

The downsampling means leaving the first point on the configured interval. There is no any aggregation.

Sorry, but this makes no sense in real-life scenarios. That first point might be an outlier value or even a null, thus skewing the downsampled period. avg() makes more sense in general use and is the default of other TSDBs.

This type of downsampling is easy to misconfigure and is hard to reason about

Can you please elaborate? I literally can think of zero situations where I would want the first point in a period to represent the larger period. Everything else makes more sense to me.

Loading

@vainkop
Copy link

@vainkop vainkop commented Feb 4, 2021

Too bad downsampling is not available yet.
I was about to propose a switch from InfluxDB to VM, but downsampling is a requirement, so it's a blocker.

Loading

@hdost
Copy link

@hdost hdost commented Feb 8, 2021

Is there a design proposed for this already, if so how can we help?

Loading

@sc0rp10
Copy link

@sc0rp10 sc0rp10 commented May 24, 2021

hi, @valyala!
Is there any news about downsampling?

Loading

@valyala
Copy link
Contributor

@valyala valyala commented May 24, 2021

The downsampling is at the top priority right now. The ETA is June 2021.

Loading

@acruise
Copy link

@acruise acruise commented Jul 2, 2021

Sorry, but this makes no sense in real-life scenarios. That first point might be an outlier value or even a null, thus skewing the downsampled period. avg() makes more sense in general use and is the default of other TSDBs.

Agreed, downsampling must aggregate in order to be useful. It'd be OK to only support simple aggregates like min/max/avg/count/sum at first.

One extra wrinkle is that some aggregates should preserve necessary statistics, e.g. to avoid average-of-averages, you really want the finer-grained side of the downsample dataset to include not just the computed average, but also the (count, sum) that it was derived from.

Loading

@odinsy
Copy link

@odinsy odinsy commented Jul 5, 2021

The downsampling is at the top priority right now. The ETA is June 2021.

Any news? :)

Loading

@jinlongwang
Copy link

@jinlongwang jinlongwang commented Jul 6, 2021

Will this ability only be available in the Enterprise Edition?

Loading

@isavcic
Copy link

@isavcic isavcic commented Jul 6, 2021

Will this ability only be available in the Enterprise Edition?

This man is onto something here.

Loading

@isavcic
Copy link

@isavcic isavcic commented Jul 6, 2021

if enough_people_need_the_feature && they_ask_for_it_for_a_long_time {
    make_it_an_enterprise_feature()
}

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Jul 6, 2021

Sorry, but this makes no sense in real-life scenarios. That first point might be an outlier value or even a null, thus skewing the downsampled period. avg() makes more sense in general use and is the default of other TSDBs.

While the avg_over_time() function may work well as a downsampling function for gauges, it doesn't work for counters in general case. It is better to get the first sample on every downsampling interval for counters instead, since it doesn't introduce the calculation error, e.g. rate(m[d]) and increase(m[d]) would return the same results before and after the downsampling for the downsampled interval d.

Additional notes:

  • Samples downsampled with avg cannot be downsampled on a bigger interval after that, since the average of averages doesn't equal to the real average.
  • The avg loses spikes (aka min and max values) on the downsampled interval. The spikes can contain important information for some cases. For instance, they can help answering questions like what was the maximum memory usage for the app on the given time range?.
  • Every type of the downsampling loses a part of the original information. The most optimal downsampling function is histogram_over_time() from MetricsQL, since it saves the original values distribution and its results can be used for multi-level downsampling. See more information about VictoriaMetrics histograms. Unfortunately, this function generates tens of output series for every input series, so the downsampled series can take more disk space than the original series.
  • Most users expect that the downsampling is performed individually per each time series. But this rapidly changes with new monitoring types such as Kubernetes monitoring. It can generate new series at a high rate during each deployment or any k8s cluster state change (aka high churn rate). The per-series downsampling can be ineffective in reducing disk space and improving query performance for high churn rate series, since it doesn't reduce the number of series and cannot reduce significantly the number of samples per each series (because the original series are short).
  • Both VictoriaMetrics and Prometheus return the last original sample value for each step interval requested via /api/v1/query_range, e.g. they use the analogue of last_over_time() function for the downsampling during querying. See more details about this here. The planned function for downsampling in VictoriaMetrics - first_over_time() - aligns better with the behavior of Prometheus and VictoriaMetrics, e.g. graphs over the original data and over the downsampled data will look similar.

As you can see, it is hard to provide a generic implementation, which covers the majority of edge cases. So we decided to start with the basic implementation aligned with the dynamic downsampling used in Prometheus and VictoriaMetrics during querying.

VictoriaMetrics already provides a tool, which can be set up for custom downsampling - vmalert. It allows setting up arbitrary downsampling type via recording rules. These rules are evaluated at regular interval against the -datasource.url (e.g. against VictoriaMetrics) and the evaluation results are sent to -remoteWrite.url. Note that -datasource.url and -remoteWrite.url can point to different databases with different settings including retention period. In this way it is possible to set up the downsampling only for long-term metrics, while leaving the rest of metrics in short-term storage.

Will this ability only be available in the Enterprise Edition?

The downsampling will be available in enterprise package at start as promised on the enterprise page. Later it can be backported to community edition of VictoriaMetrics.

Loading

@ahmadalli
Copy link

@ahmadalli ahmadalli commented Jul 7, 2021

Clould you provide an estimation on when this would be available on community edition?

Loading

@valyala
Copy link
Contributor

@valyala valyala commented Jul 7, 2021

Clould you provide an estimation on when this would be available on community edition?

This depends on how smooth the downsampling feature will work in an enterprise edition. There is no ETA for this.

Loading

@LightCastlePro
Copy link

@LightCastlePro LightCastlePro commented Oct 22, 2021

Clould you provide an estimation on when this would be available on community edition?

This depends on how smooth the downsampling feature will work in an enterprise edition. There is no ETA for this.

hi valyala, does downsampling data can work for avg_over_time, sum_over_time, sum, max, etc ?

Loading

@joaopaulobdac
Copy link
Contributor

@joaopaulobdac joaopaulobdac commented Oct 29, 2021

Hi everyone, does VictoriaMetrics/MetricsQL have something like LTTB from the TimescaleDB toolkit? https://github.com/timescale/timescaledb-toolkit/blob/main/docs/lttb.md

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet