Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark Gorilla codec on non-float columns as suspicous #45376

Merged
merged 7 commits into from
Jan 28, 2023

Conversation

rschu1ze
Copy link
Member

@rschu1ze rschu1ze commented Jan 17, 2023

This PR marks the Gorilla codec on non-float columns as "suspicious codec".

Reasons:

  1. The original Gorilla paper proposed a compression schema for pairs of time stamps and double-precision FP values (see "Gorilla: a fast, scalable, in-memory time series database", Sec. 4.1.2, "Gorilla restricts the value element in its tuple to a double floating point type."). ClickHouse's Gorilla codec only implements compression of the latter and it does not impose any data type restrictions.
  • Data types != Float* or (U)Int* (e.g. Decimal, Point etc.) are definitely not supposed to be used with Gorilla.
  • (U)Int* types are debatable. The paper only considers integers-represented-as-FP-values, a practical use case for which Gorilla works well. Standalone integers are not considered which makes them at least suspicious.
  1. Achieve consistency with FPC, another specialized floating-point timeseries codec, which rejects non-float data.

  2. On practical datasets, ZSTD is often "good enough" so it should be okay to disincentive non-ZSTD codecs a little bit. If needed, Delta and DoubleDelta codecs are viable alternative for slowly changing (i.e. time-series-like) integer sequences.

This PR disallows the creation of new Gorilla-compressed non-float columns unless setting "allow_suspicious_codecs" is "true". Existing data of this kind will continue to load just fine because setting "allow_suspicous_codecs" is ignored during server startup and in ATTACH TABLE.

Side note: Gorilla-compressed non-float columns should not be very common in practice since the documentation always listed Gorilla as a specialized codec for "series of floating-point values".

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Mark Gorilla compression on columns of non-Float* type as suspicious.

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-backward-incompatible Pull request with backwards incompatible changes label Jan 17, 2023
@rschu1ze rschu1ze changed the title Disallow Gorilla codec on non-float columns Deprecate Gorilla codec on non-float columns Jan 20, 2023
@rschu1ze rschu1ze marked this pull request as ready for review January 20, 2023 17:25
@rschu1ze rschu1ze force-pushed the block-non-float-gorilla-v2 branch 2 times, most recently from cbf5cc5 to 67912f8 Compare January 20, 2023 17:29
Reasons:

1. The original Gorilla paper proposed a compression schema for pairs of
   time stamps and double-precision FP values. ClickHouse's Gorilla
   codec only implements compression of the latter and it does not
   impose any data type restrictions.
   - Data types != Float* or (U)Int* (e.g. Decimal, Point etc.) are
     definitely not supposed to be used with Gorilla.
   - (U)Int* types are debatable. The paper only considers
     integers-stored-as-FP-values, a practical use case for which
     Gorilla works well. Standalone integers are not considered which
     makes them at least suspicious.

2. Achieve consistency with FPC, another specialized floating-point
   timeseries codec, which rejects non-float data.

3. On practical datasets, ZSTD is often "good enough" (**) so it should
   be okay to disincentive non-ZSTD codecs a little bit. If needed,
   Delta and DoubleDelta codecs are viable alternative for slowly
   changing (time-series-like) integer sequences.

Since on-prem and hosted users may still have Gorilla-compressed
non-float data, this combination is only deprecated for now. No warning
or error will be emitted. Users are encouraged to migrate
Gorilla-compressed non-float data to an alternative codec. It is planned
to treat Gorilla-compressed non-float columns as "suspicious" six months
after this commit (i.e. in v23.6). Even then, it will still be possible
to set "allow_suspicious_codecs = true" and read and write
Gorilla-compressed non-float data.

(*) Sec. 4.1.2, "Gorilla restricts the value element in its tuple to a
    double floating point type.", https://doi.org/10.14778/2824032.2824078

(**) https://clickhouse.com/blog/optimize-clickhouse-codecs-compression-schema
@alexey-milovidov
Copy link
Member

@rschu1ze IIRC, we allow all codecs (including suspicious) for ATTACH TABLE queries and on server startup. So, why don't turn the check on by default?

@alexey-milovidov alexey-milovidov self-assigned this Jan 24, 2023
@rschu1ze
Copy link
Member Author

@alexey-milovidov Thanks, you are right ... I verified locally that server startup and ATTACH TABLE don't care about suspicious_codecs. So basically, when we make Gorilla-on-nonfloats suspicious right now, the creation of such tables will only work with allow_suspicous_codecs = true but all existing use cases will continue to work. Cool, then I'll change this PR, it will simplify things.

@rschu1ze rschu1ze changed the title Deprecate Gorilla codec on non-float columns Mark Gorilla codec on non-float columns as suspicous Jan 24, 2023
@rschu1ze rschu1ze removed the pr-backward-incompatible Pull request with backwards incompatible changes label Jan 24, 2023
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-backward-incompatible Pull request with backwards incompatible changes label Jan 24, 2023
@rschu1ze
Copy link
Member Author

Test failures are all unrelated or fixed in master.

@rschu1ze rschu1ze removed the pr-backward-incompatible Pull request with backwards incompatible changes label Jan 26, 2023
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-improvement Pull request with some product improvements label Jan 26, 2023
@rschu1ze
Copy link
Member Author

@alexey-milovidov Approve + merge? Thanks!

@alexey-milovidov
Copy link
Member

Is it ok for Nullable(Float)? Array(Float)? Tuple(Float, Float)?

@rschu1ze
Copy link
Member Author

Is it ok for Nullable(Float)? Array(Float)? Tuple(Float, Float)?

Fixed that, thanks.

@rschu1ze rschu1ze force-pushed the block-non-float-gorilla-v2 branch 2 times, most recently from 679ff39 to d721b92 Compare January 27, 2023 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants