-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Referential dependencies for RESTORE #43834
Referential dependencies for RESTORE #43834
Conversation
@@ -37,8 +37,8 @@ StorageSystemTables::StorageSystemTables(const StorageID & table_id_) | |||
{"data_paths", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())}, | |||
{"metadata_path", std::make_shared<DataTypeString>()}, | |||
{"metadata_modification_time", std::make_shared<DataTypeDateTime>()}, | |||
{"dependencies_database", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())}, | |||
{"dependencies_table", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())}, | |||
{"dependent_views_database", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this break queries accessing the old names? If so, is it really worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to rename it, and this is a system table so it can be changed from time to time.
But it's not necessary to do that right now in this PR, so I think I'll move this change to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to rename it, and this is a system table so it can be changed from time to time.
Yes, but it's still a pain from the user point of view as it requires different queries for different CH version (and CH doesn't support SQL scripts).
If you are thinking about renaming it, why not take the opportunity to improve the API with other types? Maybe it should be an array of tuples (database, table) so you don't have to be doing arrayZip
to use get the pair.
ecdeb4e
to
c565430
Compare
c565430
to
202eb9a
Compare
6711f26
to
cff43d1
Compare
b7b1da3
to
37546d2
Compare
when restoring tables from a backup.
37546d2
to
730034a
Compare
730034a
to
e3186b6
Compare
will this PR resolve #39416 ? |
I expect so. |
@@ -65,7 +65,9 @@ SELECT dictGet('test_01155_ordinary.dict', 'x', 'after renaming database'); | |||
SELECT database, substr(name, 1, 10) FROM system.tables WHERE database like 'test_01155_%'; | |||
|
|||
-- Move tables back | |||
SET check_table_dependencies=0; -- Otherwise we'll get error "test_01155_atomic.dict depends on test_01155_ordinary.dist" in the next line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is incorrect because there's no table test_01155_atomic.dict
when we rename test_01155_ordinary
to test_01155_atomic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should be
Otherwise we'll get error "test_01155_ordinary.dict depends on test_01155_ordinary.dist" in the next line.
I'll change that in my next PR.
@@ -14,6 +14,7 @@ | |||
#include <Common/MultiVersion.h> | |||
#include <Common/OpenTelemetryTraceContext.h> | |||
#include <Common/RemoteHostFilter.h> | |||
#include <Common/ThreadPool.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This include looks redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThreadPool
is used in this header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found only two usages: two functions return ThreadPool &
, but it does not require the definition
/// TO target_table (for materialized views) | ||
if (to_table.database.empty()) | ||
to_table.database = data.default_database; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be empty, because the query is normalized . Btw, can default_database
be different from current_database
? (Seems like it can)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be the same for the global context. I just thought maybe it's better to check dependencies for a backup in the query context, not in the global context, so probably I'll replace that with the current database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code so now it uses the current database (#44158)
QualifiedTableName as_table{create.as_database, create.as_table}; | ||
if (!as_table.table.empty()) | ||
{ | ||
/// AS table_name | ||
if (as_table.database.empty()) | ||
as_table.database = data.default_database; | ||
data.dependencies.emplace(as_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should not count it as a referential dependency, because we always replace AS table
clause with normal table definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought maybe someone will try to forge a backup manually so I tried to support maximum of what's easy to support, and not to rely on that all the queries are normalized too much. Because a backup works with create queries, and not attach queries. Though it's very likely that all queries inside a backup are normalized (because those queries are basically the output of SHOW CREATE TABLE
)
/// ASTTableExpression represents a reference to a table in SELECT query. | ||
/// DDLDependencyVisitor should handle ASTTableExpression because some CREATE queries can contain SELECT queries after AS | ||
/// (for example, CREATE VIEW). | ||
void visitTableExpression(const ASTTableExpression & expr, DDLDependencyVisitor::Data & data) | ||
{ | ||
if (!expr.database_and_table_name) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test with queries like CREATE VIEW ... AS SELECT ... FROM (SELECT ... FROM table)
? (I guess it will just visit another ASTTableExpression
, and it will work fine, but it's good to have more complex tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test in #44158
if (table_engine.name == "Buffer") | ||
extractDatabaseAndTableNameFromArguments(table_engine, data, 0, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a test for this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the test before but it was removed because I needed to shorten this PR. I'll add it again in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test in #44158
void visitTableEngine(const ASTFunction & table_engine, DDLDependencyVisitor::Data & data) | ||
{ | ||
/// It can be table/dictionary from default database or XML dictionary, but we cannot distinguish it here. | ||
qualified_name.database = data.default_database; | ||
if (table_engine.name == "Dictionary") | ||
extractQualifiedTableNameFromArgument(table_engine, data, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also handle Distributed
and Merge
engines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distributed
- yes, in the next PR. I'm not sure about Merge
(because of regexp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Distributed
engine is handled in #44158
void log() const; | ||
|
||
private: | ||
struct Node : public std::enable_shared_from_this<Node> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason why enable_shared_from_this
is required is
nodes.erase(node->shared_from_this());
Consider using "heterogeneous lookup": https://godbolt.org/z/YKn96GzWh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that in #44158
@vitlibar @tavplubix could you look to is "head" version is build for master github branch? |
if (remove_isolated_tables && dependency_node->dependencies.empty() && dependency_node->dependents.empty()) | ||
{ | ||
removeNode(dependency_node); | ||
if (table_node == dependency_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that it's possible only when a table depends on itself? I would prefer to forbid it and do not add dependencies like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But it looks like there is no difference actually between a view selecting from itself and a view selecting from a second view which selects from the first view. So I think the graph can handle it like any other cyclic dependency.
} | ||
} | ||
|
||
if (remove_isolated_tables && !table_node_removed && table_node->dependencies.empty() && table_node->dependents.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_node->dependencies.empty()
is always true
if (table_id.hasUUID() && node->storage_id.hasUUID() && (table_id.uuid != node->storage_id.uuid)) | ||
return nullptr; /// UUID is different, it's not the node we're looking for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is it possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everything is fine that shouldn't happen. I've added a comment and a warning here (#44158).
while (!nodes_to_process.empty()) | ||
{ | ||
size_t old_num_sorted = nodes_sorted_by_level_lazy.size(); | ||
|
||
for (auto it = nodes_to_process.begin(); it != nodes_to_process.end();) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it's at least O(V^2)
in the worst case. Imagine a graph that looks like a linked list (A -> B -> C -> D -> ...). It will remove only one node on each iteration of the outer loop. But the inner loop will do nodes_to_process.size()
iterations each time. So it will be V + (V-1) + (V-2) + (V-3) + ... = O(V^2)
iterations in total. But the previous topsort implementation was O(V + E)
if I'm not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it's even O(V^3)
if we add A->C, A->D, A->E, ..., B->D, B->E, ... edges to the linked list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code so now it's O(V + E)
again.
Fix clang tidy errors introduced in #43834
auto need_exclude_dependency = [this, remove_loaded](const QualifiedTableName & dependency_name, const DependenciesInfo & info) | ||
for (const auto & [table_name, table_metadata] : metadata.parsed_tables) | ||
{ | ||
auto new_loading_dependencies = getLoadingDependenciesFromCreateQuery(global_context, table_name, table_metadata.ast); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will traverse all ASTs in single thread, but maybe it's okay
{ | ||
const auto & dependents = referential_dependencies.getDependents(removing_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add another setting for checking referential dependencies, but I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM
Thank you for the review, I'm working on your comments and will create a new PR soon. |
@vitlibar please refer new PR in this comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking at the code changes due to a related issue (#44496, but not introduced in this PR) and it seems to be it's missing several functions
|
||
ssize_t DDLMatcherBase::getPositionOfTableNameArgument(const ASTFunction & function) | ||
{ | ||
if (function.name == "joinGet" || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be missing joinGetOrNull
and could be greatly simplified (and grouped logically) by doing something like:
if (functionIsJoinGet(function.name) || functionIsDictGet(function.name))
function.name.starts_with("dictGet")) | ||
return 0; | ||
|
||
if (Poco::toLower(function.name) == "in") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a similar fashion, what about notIn, nullIn, globalIn, etc? Should this be functionIsInOrGlobalInOperator(function.name)
?
{ | ||
assert(false); | ||
return; | ||
if (function.name == "joinGet" || function.name == "dictHas" || function.name == "dictIsIn" || function.name.starts_with("dictGet")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Having this logic repeated seems wrong and I think using the functions in misc.h is much cleaner.
Changelog category:
Changelog entry:
Implement referential dependencies and use them to create tables in the correct order while restoring from a backup.