-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSDB numeric compression #92045
TSDB numeric compression #92045
Conversation
This doc values format is meant to be used for tsdb. It provides compression using delta encoding, gcd encoding and bit-packing encoding. The expectation is that such encoding works well for fields whose values are monotonically increasing, like the timestamp field and counter metric fields.
This makes it easier to experiment with different encoding block sizes, and maybe also making the block size a configurable parameter.
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Hi @salvatore-campagna, I've created a changelog YAML for you. |
Neat! I'm super curious what this does to our benchmarks - the size, and, a little, the speed. |
|
||
@Override | ||
public void addNumericField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { | ||
meta.writeInt(field.number); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: here I am not writing the type of field (NUMERIC) because this only works for numeric fields and I spare some space not writing a value that would always be the same. I am wondering, anyway, if not writing this might have an impact in terms of BWC in case we later introduce compression for other fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest keeping it: we are talking about a single integer per segment per field, so the impact on space efficiency is null, and we could more easily add support for other types of doc values later on.
Could you consider adding zstd compression lib to libs modules like lz4 lib. The zstd compressed radio is much larger than lz4 in the tsdb metrics. |
HI @weizijun, the idea is to start with this compression because it is what we believe is more effective for numeric fields. zstd and lz4 are more generic compression algorithms which we expected not to be that effective. We might consider those for the future anyway, maybe for other fields. |
@rockdaboot I think it will make sense to expand this to non-TSDB data streams at some point too. I'm not sure exactly how exactly, whether it will be enabled in a selective way, e.g. only on fields that are part of the index sort, or across all fields. But I can certainly imagine this being useful outside of TSDB. The main question I'm curious of which we haven't evaluated yet is how this change affects query performance. |
Just from my perspective, I'd like to see the selective way, e.g. an "encoding" property in the field mapping, with a good default. Anyway, if you open a discussion/issue around non-TSDB field encodings, please cc me 😃 |
Is there any reason why we should not have a way to define the encoding per each field with an additional mapping parameter? Something like
|
We ran the benchmark gain based on the latest changes. The baseline is the same as yesterday's benchmark run. This is a filtered result just showing the difference in size of doc values fields:
The The tx/rx fields (for example |
@martijnvg and I had a quick chat about disk usage with this new doc-values format. Quoting a comment I made earlier on this PR:
We should be able to guarantee that this format would never cause significant regressions regarding disk usage by updating |
@jpountz do you have any pointer to where this is done in Lucene? I would like to have a look at it and understand how to adapt it also to 28, 40, 48 and 56. |
Ok I see this is used in Lucene to read/write postings lists in Lucene90PostingsReader/Lucene90PostingsWriter... but I don't see any specific optimization for 28, 40, 48 and 56. |
I will put this DocValuesFormat behind a feature flag so to avoid releasing it without proper testing and add throughput benchmarks too. |
I still need to add a benchmark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
I think the micro benchmarks can be added in a followup change.
docs/changelog/92045.yaml
Outdated
area: TSDB | ||
type: feature | ||
issues: [] | ||
highlight: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the FF date we need to check whether this codec is still behind a feature flag.
In order to avoid that we mention something in the changeling that is only usable via a feature flag.
@elasticsearchmachine run elasticsearch-ci/part-1 please |
@martijnvg @jpountz I will merge this and add microbenchmarks in another PR. |
This PR introduces a new doc values format meant to be used by TSDB. It
provides compression using delta encoding, gcd encoding and bit-packing
encoding. The expectation is that such encodings work well for numeric
fields whose values are monotonically increasing, like the timestamp field
and counter metric fields.