Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referential dependencies for RESTORE #43834

Merged

Conversation

vitlibar
Copy link
Member

@vitlibar vitlibar commented Nov 30, 2022

Changelog category:

  • Improvement

Changelog entry:

Implement referential dependencies and use them to create tables in the correct order while restoring from a backup.

@vitlibar vitlibar added the do not test disable testing on pull request label Nov 30, 2022
@tavplubix tavplubix self-assigned this Nov 30, 2022
@nikitamikhaylov nikitamikhaylov added the force tests The label does nothing, NOOP, None, nil label Nov 30, 2022
@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Nov 30, 2022
@nikitamikhaylov nikitamikhaylov removed the force tests The label does nothing, NOOP, None, nil label Nov 30, 2022
@@ -37,8 +37,8 @@ StorageSystemTables::StorageSystemTables(const StorageID & table_id_)
{"data_paths", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())},
{"metadata_path", std::make_shared<DataTypeString>()},
{"metadata_modification_time", std::make_shared<DataTypeDateTime>()},
{"dependencies_database", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())},
{"dependencies_table", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())},
{"dependent_views_database", std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>())},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break queries accessing the old names? If so, is it really worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to rename it, and this is a system table so it can be changed from time to time.
But it's not necessary to do that right now in this PR, so I think I'll move this change to a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to rename it, and this is a system table so it can be changed from time to time.

Yes, but it's still a pain from the user point of view as it requires different queries for different CH version (and CH doesn't support SQL scripts).

If you are thinking about renaming it, why not take the opportunity to improve the API with other types? Maybe it should be an array of tuples (database, table) so you don't have to be doing arrayZip to use get the pair.

@vitlibar vitlibar removed the do not test disable testing on pull request label Dec 1, 2022
@vitlibar vitlibar marked this pull request as ready for review December 1, 2022 15:24
@vitlibar vitlibar changed the title Referential dependencies Referential dependencies for RESTORE Dec 1, 2022
@vitlibar vitlibar force-pushed the referential-dependencies branch 2 times, most recently from b7b1da3 to 37546d2 Compare December 1, 2022 21:11
@nikitamikhaylov nikitamikhaylov merged commit e1f7f04 into ClickHouse:master Dec 2, 2022
@Slach
Copy link
Contributor

Slach commented Dec 2, 2022

will this PR resolve #39416 ?

@vitlibar
Copy link
Member Author

vitlibar commented Dec 2, 2022

will this PR resolve #39416 ?

I expect so.

@@ -65,7 +65,9 @@ SELECT dictGet('test_01155_ordinary.dict', 'x', 'after renaming database');
SELECT database, substr(name, 1, 10) FROM system.tables WHERE database like 'test_01155_%';

-- Move tables back
SET check_table_dependencies=0; -- Otherwise we'll get error "test_01155_atomic.dict depends on test_01155_ordinary.dist" in the next line.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is incorrect because there's no table test_01155_atomic.dict when we rename test_01155_ordinary to test_01155_atomic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be

Otherwise we'll get error "test_01155_ordinary.dict depends on test_01155_ordinary.dist" in the next line.

I'll change that in my next PR.

@@ -14,6 +14,7 @@
#include <Common/MultiVersion.h>
#include <Common/OpenTelemetryTraceContext.h>
#include <Common/RemoteHostFilter.h>
#include <Common/ThreadPool.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include looks redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPool is used in this header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found only two usages: two functions return ThreadPool &, but it does not require the definition

@vitlibar vitlibar deleted the referential-dependencies branch December 2, 2022 15:30
Comment on lines +24 to +26
/// TO target_table (for materialized views)
if (to_table.database.empty())
to_table.database = data.default_database;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be empty, because the query is normalized . Btw, can default_database be different from current_database? (Seems like it can)

Copy link
Member Author

@vitlibar vitlibar Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be the same for the global context. I just thought maybe it's better to check dependencies for a backup in the query context, not in the global context, so probably I'll replace that with the current database.

Copy link
Member Author

@vitlibar vitlibar Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code so now it uses the current database (#44158)

Comment on lines +30 to +36
QualifiedTableName as_table{create.as_database, create.as_table};
if (!as_table.table.empty())
{
/// AS table_name
if (as_table.database.empty())
as_table.database = data.default_database;
data.dependencies.emplace(as_table);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should not count it as a referential dependency, because we always replace AS table clause with normal table definition.

Copy link
Member Author

@vitlibar vitlibar Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought maybe someone will try to forge a backup manually so I tried to support maximum of what's easy to support, and not to rely on that all the queries are normalized too much. Because a backup works with create queries, and not attach queries. Though it's very likely that all queries inside a backup are normalized (because those queries are basically the output of SHOW CREATE TABLE)

Comment on lines +40 to +46
/// ASTTableExpression represents a reference to a table in SELECT query.
/// DDLDependencyVisitor should handle ASTTableExpression because some CREATE queries can contain SELECT queries after AS
/// (for example, CREATE VIEW).
void visitTableExpression(const ASTTableExpression & expr, DDLDependencyVisitor::Data & data)
{
if (!expr.database_and_table_name)
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test with queries like CREATE VIEW ... AS SELECT ... FROM (SELECT ... FROM table)? (I guess it will just visit another ASTTableExpression, and it will work fine, but it's good to have more complex tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test in #44158

Comment on lines +175 to +176
if (table_engine.name == "Buffer")
extractDatabaseAndTableNameFromArguments(table_engine, data, 0, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a test for this as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the test before but it was removed because I needed to shorten this PR. I'll add it again in the next PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test in #44158

Comment on lines +170 to +173
void visitTableEngine(const ASTFunction & table_engine, DDLDependencyVisitor::Data & data)
{
/// It can be table/dictionary from default database or XML dictionary, but we cannot distinguish it here.
qualified_name.database = data.default_database;
if (table_engine.name == "Dictionary")
extractQualifiedTableNameFromArgument(table_engine, data, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also handle Distributed and Merge engines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distributed - yes, in the next PR. I'm not sure about Merge (because of regexp)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Distributed engine is handled in #44158

void log() const;

private:
struct Node : public std::enable_shared_from_this<Node>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason why enable_shared_from_this is required is

nodes.erase(node->shared_from_this());

Consider using "heterogeneous lookup": https://godbolt.org/z/YKn96GzWh

Copy link
Member Author

@vitlibar vitlibar Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that in #44158

@Slach
Copy link
Contributor

Slach commented Dec 2, 2022

@vitlibar @tavplubix could you look to
#39416 (comment)

is "head" version is build for master github branch?

if (remove_isolated_tables && dependency_node->dependencies.empty() && dependency_node->dependents.empty())
{
removeNode(dependency_node);
if (table_node == dependency_node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that it's possible only when a table depends on itself? I would prefer to forbid it and do not add dependencies like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it looks like there is no difference actually between a view selecting from itself and a view selecting from a second view which selects from the first view. So I think the graph can handle it like any other cyclic dependency.

}
}

if (remove_isolated_tables && !table_node_removed && table_node->dependencies.empty() && table_node->dependents.empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_node->dependencies.empty() is always true

Comment on lines +250 to +251
if (table_id.hasUUID() && node->storage_id.hasUUID() && (table_id.uuid != node->storage_id.uuid))
return nullptr; /// UUID is different, it's not the node we're looking for.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If everything is fine that shouldn't happen. I've added a comment and a warning here (#44158).

Comment on lines +555 to +559
while (!nodes_to_process.empty())
{
size_t old_num_sorted = nodes_sorted_by_level_lazy.size();

for (auto it = nodes_to_process.begin(); it != nodes_to_process.end();)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's at least O(V^2) in the worst case. Imagine a graph that looks like a linked list (A -> B -> C -> D -> ...). It will remove only one node on each iteration of the outer loop. But the inner loop will do nodes_to_process.size() iterations each time. So it will be V + (V-1) + (V-2) + (V-3) + ... = O(V^2) iterations in total. But the previous topsort implementation was O(V + E) if I'm not mistaken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's even O(V^3) if we add A->C, A->D, A->E, ..., B->D, B->E, ... edges to the linked list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code so now it's O(V + E) again.

@nickitat
Copy link
Member

nickitat commented Dec 3, 2022

#43911

rschu1ze added a commit that referenced this pull request Dec 4, 2022
Fix clang tidy errors introduced in #43834
auto need_exclude_dependency = [this, remove_loaded](const QualifiedTableName & dependency_name, const DependenciesInfo & info)
for (const auto & [table_name, table_metadata] : metadata.parsed_tables)
{
auto new_loading_dependencies = getLoadingDependenciesFromCreateQuery(global_context, table_name, table_metadata.ast);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will traverse all ASTs in single thread, but maybe it's okay

{
const auto & dependents = referential_dependencies.getDependents(removing_table);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add another setting for checking referential dependencies, but I'm not sure

Copy link
Member

@tavplubix tavplubix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM

@vitlibar
Copy link
Member Author

vitlibar commented Dec 5, 2022

In general LGTM

Thank you for the review, I'm working on your comments and will create a new PR soon.

@Slach
Copy link
Contributor

Slach commented Dec 5, 2022

@vitlibar please refer new PR in this comments

Copy link
Member

@Algunenano Algunenano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at the code changes due to a related issue (#44496, but not introduced in this PR) and it seems to be it's missing several functions


ssize_t DDLMatcherBase::getPositionOfTableNameArgument(const ASTFunction & function)
{
if (function.name == "joinGet" ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be missing joinGetOrNull and could be greatly simplified (and grouped logically) by doing something like:

if (functionIsJoinGet(function.name) || functionIsDictGet(function.name))

function.name.starts_with("dictGet"))
return 0;

if (Poco::toLower(function.name) == "in")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a similar fashion, what about notIn, nullIn, globalIn, etc? Should this be functionIsInOrGlobalInOperator(function.name)?

{
assert(false);
return;
if (function.name == "joinGet" || function.name == "dictHas" || function.name == "dictIsIn" || function.name.starts_with("dictGet"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Having this logic repeated seems wrong and I think using the functions in misc.h is much cleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants