Fix materialized column as sharding key #28637

vitlibar · 2021-09-06T05:40:40Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category:

Bug Fix

Changelog entry:
Allow using a materialized column as the sharding key in a distributed table even if insert_allow_materialized_columns=0:

CREATE TABLE dist ON CLUSTER '{cluster}' (x Int32, date Date MATERIALIZED today()) ENGINE = Distributed('{cluster}', currentDatabase(), local, date);
CREATE TABLE local AS dist ON CLUSTER '{cluster}' ENGINE = MergeTree() ORDER BY tuple();
INSERT INTO dist VALUES (1), (2), (3);

vitlibar · 2021-09-07T09:09:55Z

src/Storages/Distributed/DistributedSink.cpp

@@ -126,21 +132,6 @@ void DistributedSink::consume(Chunk chunk)

    auto ordinary_block = getPort().getHeader().cloneWithColumns(chunk.detachColumns());

-    if (!allow_materialized)


Here was a previous solution for avoiding the error Cannot insert column, because it is MATERIALIZED column after a materialized column is calculated and then sent to a remote table. The previous solution just removed materialized columns from blocks before sending them to remote tables. It helped with the error but produced another problem: it made ClickHouse to calculate a materialized column two times - on an initiator host and on a shard, and also made difficult to use a materialized column as a sharding key.

So I decided to try another approach: always send a materialized column to shards and also send the setting insert_allow_materialized_columns forced to true (no matter what the value of insert_allow_materialized_columns on an initiator actually is). Thus a shard will not calculate values for materialized columns by itself, it will always use values calculated on an initiator.

Here was a previous solution for avoiding the error Cannot insert column, because it is MATERIALIZED column after a materialized column is calculated and then sent to a remote table. The previous solution just removed materialized columns from blocks before sending them to remote tables. It helped with the error but produced another problem: it made ClickHouse to calculate a materialized column two times - on an initiator host and on a shard, and also made difficult to use a materialized column as a sharding key.

Indeed.

So I decided to try another approach: always send a materialized column to shards and also send the setting insert_allow_materialized_columns forced to true (no matter what the value of insert_allow_materialized_columns on an initiator actually is). Thus a shard will not calculate values for materialized columns by itself, it will always use values calculated on an initiator.

The reason for the previous solution was that some local table may has dictGet that will make sense only on local node regardless was it in the INSERT'ed block or not (#23349 (comment)), and after this patch it will not be possible to restrict these, but personally I fine with this change (pretty specific use-case anyway).

So sometimes we may want a materialized column to be calculated in a distributed table, sometimes we may want it to be calculated on a shard. It's getting complicated. Maybe we should add a new setting for distributed tables named calculate_defaults_on_shards to cover more cases.

Let me try to write possible cases.

insert_allow_materialized_columns=0

Any write of materialized column to any table (including Distributed) will fail.

But the problem is that if Distributed table has the same MATERIALIZED column as underlying table, then it will materialize it and later it will try to write these into underlying table and will fail, w/o #23349

And so in these case MATERIALIZED column indeed will be calculated 2 times, on initiator and on shards, and the result of the generated column on initiator will be simply discarded.

create table data (key Int, value Int materialized 100) engine=Null(); create table dist (key Int, value Int materialized 100) engine=Distributed(test_cluster_two_shards, currentDatabase(), data, key); INSERT INTO dist VALUES (1); [localhost] 2021.09.09 16:57:13.641736 [ 22781 ] {64e81fd5-4f69-48c5-aba3-4f776f47b336} <Debug> DistributedBlockOutputStream: default.dist (876d7f85-aa5a-4c1d-876d-7f85aa5a8c1d): column value will be removed, because it is MATERIALIZED ...

insert_allow_materialized_columns=1

With these setting set, MATERIALIZED column will be passed to the underlying storage and so it will not be calculated multiple times, only on initiator.

But there are some caveats in these (mentioned in the previous comments).

distributed_materialize_defaults_on_shards

You suggested calculate_defaults_on_shards, but I took the liberty and renamed it, since it looks better to me

If these setting is set (distributed_materialize_defaults_on_shards=1) MATERIALIZED column will be calculated on initiator always only, and so the INSERT will fail w/o insert_allow_materialized_columns=1.
Maybe these setting should instead forbid INSERT w/o all columns, including materialized (since it expects that Distributed already filled it), these will make it strict and ensure that Distributed table has the same structure (since if the structure will not match, in terms of some new materialized columns, INSERT will fail), thoughts?

P.S. Actually if someone want strict behavior in this it is better to add MATERIALIZED columns only in Distributed or only in the underlying table.

P.S. 00952_insert_into_distributed_with_materialized_colum should catch you change, but it uses the same default expression in Distributed and underlying table, hence it fails to do this.

I've changed my solution to make it more cautious. I put back your code removing materialized columns if insert_allow_materialized_columns=0, but now I do that after the sharding key is calculated. It seems the final version must solve all the current problems without breaking anything.

vitlibar · 2021-09-07T09:24:32Z

@azat Can you take a look please?

CLAassistant · 2021-09-28T08:45:31Z

All committers have signed the CLA.

vitlibar · 2021-10-05T07:45:14Z

The failures (test_merge_tree_s3, test_dictionaries_dependency) are not related to this PR.

azat · 2021-10-05T10:36:08Z

src/Storages/StorageDistributed.cpp

-/// The columns list in the original INSERT query is incorrect because inserted blocks are transformed
-/// to the form of the sample block of the Distributed table. So we rewrite it and add all columns from
-/// the sample block instead.


Oops, looks like the comment is missing now, although it looks useful.

azat · 2021-10-05T10:40:04Z

tests/integration/test_sharding_key_from_default_column/test.py

+    node1.query("CREATE TABLE dist ON CLUSTER 'test_cluster' (x Int32, y Int32 DEFAULT x + 100, z Int32 DEFAULT x + y) ENGINE = Distributed('test_cluster', currentDatabase(), local, y)")
+    node1.query("CREATE TABLE local ON CLUSTER 'test_cluster' (x Int32, y Int32 DEFAULT x + 200, z Int32 DEFAULT x - y) ENGINE = MergeTree() ORDER BY y")
+
+    for insert_sync in [0, 1]:


This can be done in a better way, using @pytest.mark.parametrize decorator

azat · 2021-10-05T10:45:22Z

tests/integration/test_sharding_key_from_default_column/test.py

+        assert expected_error in node1.query_and_get_error("INSERT INTO TABLE dist (x, y) VALUES (1, 11), (2, 22), (3, 33)", settings=settings)
+
+
+# Almost the same as the previous test `test_materialized_column_disallow_insert_materialized`, but the sharding key has different values.


This is just to ensure that partitioning by sharding key indeed works, right?

Backport #28637 to 21.10: Fix materialized column as sharding key

Backport #28637 to 21.9: Fix materialized column as sharding key

Backport #28637 to 21.8: Fix materialized column as sharding key

Backport #28637 to 21.7: Fix materialized column as sharding key

…ng key

robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Sep 6, 2021

vitlibar force-pushed the fix-materialized-column-as-sharding-key branch from 4104105 to 98a7061 Compare September 6, 2021 11:23

vitlibar commented Sep 7, 2021

View reviewed changes

vitlibar requested a review from azat September 7, 2021 09:25

vitlibar force-pushed the fix-materialized-column-as-sharding-key branch from 98a7061 to 954ae9d Compare October 3, 2021 20:17

Vitaly Baranov added 2 commits October 4, 2021 10:56

Add test.

217659c

Fix using materialized column as sharding key.

1636ee2

vitlibar force-pushed the fix-materialized-column-as-sharding-key branch from 954ae9d to 1636ee2 Compare October 4, 2021 07:57

vitlibar added the force tests The label does nothing, NOOP, None, nil label Oct 4, 2021

vitlibar merged commit 8a01b32 into ClickHouse:master Oct 5, 2021

vitlibar deleted the fix-materialized-column-as-sharding-key branch October 5, 2021 07:53

robot-clickhouse mentioned this pull request Oct 5, 2021

Cherry pick #28637 to 21.8: Fix materialized column as sharding key #29763

Merged

azat reviewed Oct 5, 2021

View reviewed changes

robot-clickhouse pushed a commit that referenced this pull request Oct 5, 2021

Backport #28637 to 21.9: Fix materialized column as sharding key

6353340

robot-clickhouse mentioned this pull request Oct 5, 2021

Backport #28637 to 21.9: Fix materialized column as sharding key #29775

Merged

robot-clickhouse pushed a commit that referenced this pull request Oct 5, 2021

Backport #28637 to 21.10: Fix materialized column as sharding key

d9183b4

robot-clickhouse mentioned this pull request Oct 5, 2021

Backport #28637 to 21.10: Fix materialized column as sharding key #29776

Merged

robot-clickhouse pushed a commit that referenced this pull request Oct 6, 2021

Backport #28637 to 21.8: Fix materialized column as sharding key

8ec08c3

robot-clickhouse mentioned this pull request Oct 6, 2021

Backport #28637 to 21.8: Fix materialized column as sharding key #29817

Merged

robot-clickhouse pushed a commit that referenced this pull request Oct 6, 2021

Backport #28637 to 21.7: Fix materialized column as sharding key

7dc7507

robot-clickhouse mentioned this pull request Oct 6, 2021

Backport #28637 to 21.7: Fix materialized column as sharding key #29818

Merged

robot-clickhouse pushed a commit that referenced this pull request Oct 6, 2021

Backport #28637 to 21.3: Fix materialized column as sharding key

46bcc65

robot-clickhouse mentioned this pull request Oct 6, 2021

Backport #28637 to 21.3: Fix materialized column as sharding key #29819

Closed

vitlibar pushed a commit that referenced this pull request Oct 12, 2021

Merge pull request #29776 from ClickHouse/backport/21.10/28637

d2a2b8f

Backport #28637 to 21.10: Fix materialized column as sharding key

vitlibar pushed a commit that referenced this pull request Oct 12, 2021

Merge pull request #29775 from ClickHouse/backport/21.9/28637

2d1dd28

Backport #28637 to 21.9: Fix materialized column as sharding key

vitlibar pushed a commit that referenced this pull request Oct 12, 2021

Merge pull request #29817 from ClickHouse/backport/21.8/28637

e93ded3

Backport #28637 to 21.8: Fix materialized column as sharding key

vitlibar pushed a commit that referenced this pull request Oct 12, 2021

Merge pull request #29818 from ClickHouse/backport/21.7/28637

1dabf41

Backport #28637 to 21.7: Fix materialized column as sharding key

taiyang-li pushed a commit to bigo-sg/ClickHouse that referenced this pull request Nov 5, 2021

Backport ClickHouse#28637 to 21.10: Fix materialized column as shardi…

6d2ccb8

…ng key

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix materialized column as sharding key #28637

Fix materialized column as sharding key #28637

vitlibar commented Sep 6, 2021 •

edited

vitlibar Sep 7, 2021 •

edited

azat Sep 7, 2021 •

edited

vitlibar Sep 9, 2021

azat Sep 9, 2021

vitlibar Oct 5, 2021 •

edited

vitlibar commented Sep 7, 2021

CLAassistant commented Sep 28, 2021 •

edited

vitlibar commented Oct 5, 2021 •

edited

azat Oct 5, 2021 •

edited

azat Oct 5, 2021

azat Oct 5, 2021

		@@ -126,21 +132,6 @@ void DistributedSink::consume(Chunk chunk)

		auto ordinary_block = getPort().getHeader().cloneWithColumns(chunk.detachColumns());

		if (!allow_materialized)

		assert expected_error in node1.query_and_get_error("INSERT INTO TABLE dist (x, y) VALUES (1, 11), (2, 22), (3, 33)", settings=settings)


		# Almost the same as the previous test `test_materialized_column_disallow_insert_materialized`, but the sharding key has different values.

Fix materialized column as sharding key #28637

Fix materialized column as sharding key #28637

Conversation

vitlibar commented Sep 6, 2021 • edited

vitlibar Sep 7, 2021 • edited

Choose a reason for hiding this comment

azat Sep 7, 2021 • edited

Choose a reason for hiding this comment

vitlibar Sep 9, 2021

Choose a reason for hiding this comment

azat Sep 9, 2021

Choose a reason for hiding this comment

insert_allow_materialized_columns=0

insert_allow_materialized_columns=1

distributed_materialize_defaults_on_shards

vitlibar Oct 5, 2021 • edited

Choose a reason for hiding this comment

vitlibar commented Sep 7, 2021

CLAassistant commented Sep 28, 2021 • edited

vitlibar commented Oct 5, 2021 • edited

azat Oct 5, 2021 • edited

Choose a reason for hiding this comment

azat Oct 5, 2021

Choose a reason for hiding this comment

azat Oct 5, 2021

Choose a reason for hiding this comment

vitlibar commented Sep 6, 2021 •

edited

vitlibar Sep 7, 2021 •

edited

azat Sep 7, 2021 •

edited

`insert_allow_materialized_columns=0`

`insert_allow_materialized_columns=1`

`distributed_materialize_defaults_on_shards`

vitlibar Oct 5, 2021 •

edited

CLAassistant commented Sep 28, 2021 •

edited

vitlibar commented Oct 5, 2021 •

edited

azat Oct 5, 2021 •

edited