-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark Gorilla codec on non-float columns as suspicous #45376
Conversation
71ab546
to
56d3021
Compare
cbf5cc5
to
67912f8
Compare
Reasons: 1. The original Gorilla paper proposed a compression schema for pairs of time stamps and double-precision FP values. ClickHouse's Gorilla codec only implements compression of the latter and it does not impose any data type restrictions. - Data types != Float* or (U)Int* (e.g. Decimal, Point etc.) are definitely not supposed to be used with Gorilla. - (U)Int* types are debatable. The paper only considers integers-stored-as-FP-values, a practical use case for which Gorilla works well. Standalone integers are not considered which makes them at least suspicious. 2. Achieve consistency with FPC, another specialized floating-point timeseries codec, which rejects non-float data. 3. On practical datasets, ZSTD is often "good enough" (**) so it should be okay to disincentive non-ZSTD codecs a little bit. If needed, Delta and DoubleDelta codecs are viable alternative for slowly changing (time-series-like) integer sequences. Since on-prem and hosted users may still have Gorilla-compressed non-float data, this combination is only deprecated for now. No warning or error will be emitted. Users are encouraged to migrate Gorilla-compressed non-float data to an alternative codec. It is planned to treat Gorilla-compressed non-float columns as "suspicious" six months after this commit (i.e. in v23.6). Even then, it will still be possible to set "allow_suspicious_codecs = true" and read and write Gorilla-compressed non-float data. (*) Sec. 4.1.2, "Gorilla restricts the value element in its tuple to a double floating point type.", https://doi.org/10.14778/2824032.2824078 (**) https://clickhouse.com/blog/optimize-clickhouse-codecs-compression-schema
67912f8
to
e6167d6
Compare
@rschu1ze IIRC, we allow all codecs (including suspicious) for ATTACH TABLE queries and on server startup. So, why don't turn the check on by default? |
@alexey-milovidov Thanks, you are right ... I verified locally that server startup and ATTACH TABLE don't care about suspicious_codecs. So basically, when we make Gorilla-on-nonfloats suspicious right now, the creation of such tables will only work with |
c62d831
to
96a2201
Compare
96a2201
to
574cab5
Compare
Test failures are all unrelated or fixed in master. |
@alexey-milovidov Approve + merge? Thanks! |
Is it ok for Nullable(Float)? Array(Float)? Tuple(Float, Float)? |
Fixed that, thanks. |
679ff39
to
d721b92
Compare
d721b92
to
8359381
Compare
This PR marks the Gorilla codec on non-float columns as "suspicious codec".
Reasons:
Achieve consistency with FPC, another specialized floating-point timeseries codec, which rejects non-float data.
On practical datasets, ZSTD is often "good enough" so it should be okay to disincentive non-ZSTD codecs a little bit. If needed, Delta and DoubleDelta codecs are viable alternative for slowly changing (i.e. time-series-like) integer sequences.
This PR disallows the creation of new Gorilla-compressed non-float columns unless setting "allow_suspicious_codecs" is "true". Existing data of this kind will continue to load just fine because setting "allow_suspicous_codecs" is ignored during server startup and in
ATTACH TABLE
.Side note: Gorilla-compressed non-float columns should not be very common in practice since the documentation always listed Gorilla as a specialized codec for "series of floating-point values".
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Mark Gorilla compression on columns of non-Float* type as suspicious.