-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Replacing merge tree new engine #41005
[RFC] Replacing merge tree new engine #41005
Conversation
98edc71
to
86c0de5
Compare
Actually, that implements #2074 (comment) And reimplements in a clearer / more explicit way hackish thing used by MaterializedMySQL - see #4006 (comment) + ReadFinalForExternalReplicaStorage code Several questions:
That will require applying FINAL, isn't it? (BTW currently we work on that: #39463 )
Why do you want to keep it?
That means the FINAL logic should differ from the merge logic, I'm not sure if it's a good idea. Also: what do you think about using single version value like described in issue. i.e.
Pros: less columns to read, logic should be very simple. Cons: a bit confusing to have negative versions, ordering by absolute value is (a bit) more complex. |
Let's avoid adding negative versions, it's too tricky and misleading |
Yes it would be applied for FINAL
For two reasons, first, because it would allow to insert in an out-of-order way.
Agree, managing version that can be negative will be way too complexe when we would order and filter (and other cases) |
ok.
I hope you're aware that Distributed is not able to apply FINAL or do Replacing / Collapsing etc. logic across shards. |
That means you can not simply modify ReplacingSortedTransform making it a respect But may be you can make it it adjustable, controlled by some flag passed in the constructor of ReplacingSortedTransform/ReplacingSortedAlgorithm. And if ReplacingSortedTransform is called from the merge - it should work like now, and if it's called from the SELECT ... FINAL then it should look additionally at the sign column. For sure you can't apply that on prewhere stage like there because in that case the last row will be filtered out before replacing logic will be applied, so that may end up with showing the previous state of the row (before it was deleted). Another alternative - is injecting that to a WHERE stage, but may be ReplacingSortedTransform would be better. @tavplubix is it ok for you? @youennL-cs can you try to implement that? And also add merges to your test (i.e. optimize final). Also you will need to check other places where ReplacingSortedAlgorithm / ReplacingSortedAlgorithm is used - for example
|
I would suggest 0 and 1 (1 is for insert and 0 is for delete). It will make filtering faster and more natural.
This is really questionable, because deleted rows will never be cleaned up (btw, it's one of major disadvantages of
We should avoid AST rewriting during query execution. |
Also thought about the same. And may be calling that sign is strange in that case?
Yep.
I also didn't get the point with shards. But out-of-order things on different replicas can be the issue. replica 1: merge of those -> removes the row replica 2: merge of those -> keeps 1006, Now they exchange parts, and record with version 1006 returns back.
That would be cool, but not yet sure how to do that w/o overcomplicating things too much. May be based on th part age? I.e. if we merge part older than X minutes / hours we are allowed to eliminate removed rows? But that cleaning old rows can be added later |
Authors of |
I thought about that a bit - actually the only safe scenarios of deletion are:
Maybe we can just have a switch like keep_deleted_rows for the table, default 1, which can be set to 0 if the end user is sure that inserts go in order. |
Hello @filimonov @tavplubix , I'm joining the discussion, we're colleagues with @youennL-cs :) Regarding the confusion around shards and out of orderness. Actually the concept was mixed in and what @youennL-cs meant was the issue that could exist between unmerged parts within partition. We assume that records sharing the same PK reside in the same shard. Referring to the example:
we could face analogical issue across unmerged parts, if replica 1 was part 1 and replica 2 part 2 instead. As a result, merged part 1 would keep one record:
Therefore, we believe it is important to not remove deleted rows. If our assumption of data locality is wrong, then indeed the same issue appears also across nodes. |
IMO that could be a nice tradeoff. Maybe the partition is old enough could be also configurable by the user? |
Yep, but if inserts to that old partition were going out of order and we merge some subset of parts - it seems like the same problem as above is possible
But checking inside the merge if it collected all the parts from the partition is not something natural... :| Maybe we can do some manual command, like
? |
Do you mean letting cancel rows stay on disk unless manually launched |
Yep
No |
Okay, it sounds good! Any tips for the implementation? Is everyone involved in the discussion aligned with the strategy? |
Hello. Great discussion. Deleting rows is subtle. If there is an OPTIMIZE WITH CLEANUP command I suggest we consider implementing a way to run this automatically after some period of time. After all, if it's safe for users to do it after some [configurable] period of time, it is safe for ClickHouse to do it. The alternative is that users will need to implement it themselves, like PostgreSQL VACUUM in days of yore. It was a major headache. There are use cases like running TRUNCATE on an upstream table that can generate very large numbers of deletes that will just hang around in ClickHouse. Users should not have to think about this case. |
I agree with this approach. |
@tavplubix ? wrap up: -- is deleted is simple 0/1 UInt8 flag, simpler than sign
CREATE TABLE (...) Engine=ReplacingMergeTree(version, is_deleted)
SETTINGS clean_deleted_rows='never';
-- 'never' is default / 'always' - can be enabled if inserts come in order,
-- in the future other modes can be added */
OPTIMIZE TABLE ... FINAL WITH CLEANUP |
Would love to have something like that also... The problem is that currently, there is not straight-forward & generic way of getting the deletion timestamp... We can not base it on part timestamp (they are shifted with every merge), we could try to use somehow the value of version column (and for example to allow automatic deletion of the rows if their version is much older that current version - can work if the timestamp or offer in replication log is used as a version, but for that we need to extract / know the 'latest' version somehow - can be stored on the table level, it's quite easy to fill it if inserts are coming, but can be problematic to get it after server restart without rereading the column). |
@filimonov It seems to me that it's not that easy to implement and likely will bring some interesting edge cases to investigate. Maybe we could consider it a V2 iteration? Keeping this one as an experimental feature? Maybe more ideas will land on the table too. |
Also related: #41817 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM, but needs to fix failed integration tests
if (arg_cnt - arg_num == 2 && !engine_args[arg_cnt - 1]->as<ASTLiteral>() && is_extended_storage_def) | ||
{ | ||
if (!tryGetIdentifierNameInto(engine_args[arg_cnt - 1], merging_params.is_deleted_column)) | ||
throw Exception(ErrorCodes::BAD_ARGUMENTS, "is_deleted column name must be an unquoted string {}", verbose_help_message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw Exception(ErrorCodes::BAD_ARGUMENTS, "is_deleted column name must be an unquoted string {}", verbose_help_message); | |
throw Exception(ErrorCodes::BAD_ARGUMENTS, "is_deleted column name must be an identifier {}", verbose_help_message); |
4f4dec7
to
6c0fe7d
Compare
@tavplubix , is this stress test considered as a blocker too ? |
No, this failure is unrelated - #45372 |
@youennL-cs @gontarzpawel maybe let's think about cleanup in older parts? Maybe something similar to #35836 can work. |
Also, can you please send the PR with updated documentation? |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Enrichment of the existing ReplacingMergeTree engine to allow duplicates insertion. It leverages the power of both ReplacingMergeTree and CollapsingMergeTree in one mergeTree engine. Deleted data are not returned when queried, but not removed from disk neither.
Change details
Following the discussion with the ClickHouse Support Team (@melvynator and engineering team), we propose an extension of the ReplacingMergeTree engine with collapsing: ReplacingCollapsingMergeTree. In this version, we enrich the existing ReplacingMergeTree engine. However, it could also be a fully new engine.
Our proposal adds an extra sign column (possible values: -1 / 1) to the ReplacingMergeTree.
The goal is to take advantage of ReplacingMergeTree and CollapsingMergeTree features in one mergeTree engine to allow insertion of duplicates.
This extra sign column is an optional column; however, if enabled, the version column becomes mandatory. This “new engine” allows backward compatibility with previous versions of the ReplacingMergeTree by being an option at the creation of the table.
No matter the operation on the data, the version must be increased. If two inserted rows have the same version number, the last inserted one is the one kept.
It allows:
From our perspective, deleted data must be filtered out when queried and not removed from disk. When partitioned on several shards, data can have several versions on different partitions. If rows with a -1 sign are deleted, we would lose the information of deleted data which could lead eventually to incorrect KPIs.
Required help
We don't know how to filter out data that has a sign set to "-1", when queried. We don't want to delete data from the disk when a -1 sign is received; we want to filter out data when queried and don't return them, while keeping the information.