Avoid deadlocks on JOIN Engine tables #29544

Algunenano · 2021-09-29T17:50:25Z

Changelog category (leave one):

Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Avoid deadlocks when reading and writting on JOIN Engine tables at the same time

Detailed description / Documentation draft:

The patch ended up requiring way more changes that I wanted to. I decided to replace the shared_mutex by a RWLock for several reasons:

The main one because it allows getting a READ lock after you've gotten one already even if there has been a request by other request/thread to get a write lock in the meantime. This was what caused the deadlock at Deadlock with JOIN Engine #29485
RWLock has profile events, so it's easier to track
RWLock has timeouts, which is a nice way to avoid deadlocks requiring a database restart (as you can't kill a query that's forever waiting for a mutex).

Since RWLock requires a query_id to work effectively I had to add a context parameter to multiple places. The 2 functions that I didn't add it and instead used RWLockImpl::NO_QUERY were totalRows and totalBytes as their declaration is used in 10 other places, but it could be done if we wanted to.

The added test fails all the time in my system pre-change (usually even in the second iteration of the loop) but I had to increase the size the table until if was reliable deadlocking the database as it's purely timebased.

Fixes #29485
The bug affects all stable releases so it would be great to have it backported to 21.8+ at least. Let me know if I can help there if the proposed solution is accepted.

Needed for check test under ASAN

vdimir

Test 01732_race_condition_storage_join_long.sh is interesting with this changes. Now all queries in this test should success.
Would some of them fail with timeout error after change?

vdimir · 2021-09-30T10:10:32Z

src/Interpreters/HashJoin.cpp

            throw DB::Exception("addJoinedBlock called when HashJoin locked to prevent updates",
                                ErrorCodes::LOGICAL_ERROR);


It's not changed in this PR, but I wonder why it's LOGICAL_ERROR, can it be reached in we perform INSERT into JOIN Engine table during SELECT?

UPD: because lock already acquired in StorageJoin::insertBlock, but storage_join_lock can be set only from StorageJoin::getJoinLocked that are used for SELECTing data.

Algunenano · 2021-09-30T10:32:31Z

Test 01732_race_condition_storage_join_long.sh is interesting with this changes. Now all queries in this test should success.
Would some of them fail with timeout error after change?

Before the change (master) the queries would get stuck forever if a deadlock happens (read starts and gets lock, writes starts and asks for write lock, first read tries to get another read lock). The lock_acquire_timeout setting doesn't have any effect in master because it uses a shared mutex that doesn't respect the timeout.

With this change, the issue should not appear anymore as RWLock allows you to jump the lock queue is you already hold a read lock and request another. This is what fixes the bug. The timeouts only happen in the tests because I set it too low for the sanitizers but if they were large enough it should eventually go through. I'm not testing that the timeout occurs for low enough values of lock_acquire_timeout because there isn't a reliable way that I know to trigger it.

Algunenano · 2021-09-30T10:37:04Z

Ups I see that you were talking about a different test and not the one added in the PR.

Would some of them fail with timeout error after change?

Not unless the server is extremely slow for other reasons and I would say that would be a desirable outcome (respect the settings instead of blocking forever). In the test itself there are only 4 concurrent queries (3 reads and an insert) so I think it's highly unlikely that they would need to wait for 120 to acquire the lock.

vdimir · 2021-10-01T11:39:34Z

Not unless the server is extremely slow for other reasons and I would say that would be a desirable outcome (respect the settings instead of blocking forever). In the test itself there are only 4 concurrent queries (3 reads and an insert) so I think it's highly unlikely that they would need to wait for 120 to acquire the lock.

If I'm not mistaken lock is acquired for the whole SELECT query, we get StorageJoin in HashJoin that holds lock during execution. For INSERT lock is acquired per each block.

Another question does RWLock used only because it supports timeouts? Can problem be solved with std::timed_mutex, for instance? As #29485 (comment) says RWLock is a bit specific.

Finally, have you research how difficult to solve problem without timeout? Like to lock mutex in correct order, change logic a bit? What is the main issue here?

Algunenano · 2021-10-01T12:37:42Z

If I'm not mistaken lock is acquired for the whole SELECT query, we get StorageJoin in HashJoin that holds lock during execution. For INSERT lock is acquired per each block.

For SELECT queries the lock is acquired multiple times, the first one I think happens during InterpreterSelectQuery, and then when read() is called. Might be more.

Another question does RWLock used only because it supports timeouts? Can problem be solved with std::timed_mutex, for instance? As #29485 (comment) says RWLock is a bit specific.

No, the main reason was to avoid the deadlock due to acquiring the same lock multiple times (and somebody else gets in between). shared_timed_mutex would not fix that. Anything else that allows something like this would work in this situation.

Finally, have you research how difficult to solve problem without timeout? Like to lock mutex in correct order, change logic a bit? What is the main issue here?

I think it's doable, ideally by just holding the mutex at the start (InterpreterSelectQuery) and never acquire it again under any circumstance. 2 super tricky situations for show how hard that would be:

If you use joinGet or JOIN multiple times in different subqueries you would need to make sure you only try to hold the mutex once, and not call it again as otherwise there can be a race condition with a write query between them that can't be fixed by std mutexes (that I know of).
If you use joinGet over a table and, at the same time, read the size of that table via system.tables you would also try to hold the mutex twice, one at the start and one at runtime as far as I can see. Not sure what would happen if you do several queries concurrently doing an INSERT to the join table from that.

So, from what's available I couldn't see a better option; but I don't say there isn't.

Algunenano · 2021-10-06T20:12:50Z

I can't access the logs of the failed tests as they point to an internal URL. Are they related to changes in this PR?

vdimir · 2021-10-07T08:21:44Z

I can't access the logs of the failed tests as they point to an internal URL. Are they related to changes in this PR?

Seems that it isn't task just timed out (I see similar issue in other PRs).

No, the main reason was to avoid the deadlock due to acquiring the same lock multiple times (and somebody else gets in between). shared_timed_mutex would not fix that. Anything else that allows something like this would work in this situation.

What's about std::recursive_timed_mutex ?

Algunenano · 2021-10-07T08:56:17Z

Seems that it isn't task just timed out (I see similar issue in other PRs).

Thanks!

What's about std::recursive_timed_mutex ?

It only provides exclusive ownership only so you wouldn't be able to have multiple read queries running at the same time.

vdimir · 2021-10-11T09:58:05Z

Functional stateless tests flaky check (address)
02033_join_engine_deadlock
Test runs too long (> 60s). Make it faster.

Let's mark it as long

Algunenano · 2021-10-14T17:07:21Z

Manually backported PRs:

21.10: Backport #29544 to 21.10: Avoid deadlocks on JOIN Engine tables #30182
21.9: (Manual) Backport #29544 to 21.9: Avoid deadlocks on JOIN Engine tables #30185
21.8: (Manual) Backport #29544 to 21.8: Avoid deadlocks on JOIN Engine tables #30187

Further backports are possible but they require extra manual work.

Backport #29544 to 21.10: Avoid deadlocks on JOIN Engine tables

Algunenano added 2 commits September 29, 2021 19:11

Add test for JOIN engine deadlock

66bb857

Use RWLock in StorageJoin to avoid deadlocks

0ee5c0b

robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Sep 29, 2021

Algunenano added 2 commits September 30, 2021 11:05

Raise lock acquire timeout for the test

e53a48f

Needed for check test under ASAN

Consistent naming

f587420

vdimir self-assigned this Sep 30, 2021

vdimir reviewed Sep 30, 2021

View reviewed changes

clang-tidy fix

6f2447c

Merge branch 'master' into join_deadlock

dc4bb92

vdimir approved these changes Oct 11, 2021

View reviewed changes

Mark join_engine_deadlock as long test

e6c088f

vdimir force-pushed the join_deadlock branch from 3cd602d to e6c088f Compare October 11, 2021 14:59

Add long tag to 02033_join_engine_deadlock_long

cb176cf

vdimir merged commit 969999f into ClickHouse:master Oct 12, 2021

Algunenano pushed a commit to Algunenano/ClickHouse that referenced this pull request Oct 14, 2021

Merge pull request ClickHouse#29544 from Algunenano/join_deadlock

ba08f61

Algunenano mentioned this pull request Oct 14, 2021

Backport #29544 to 21.10: Avoid deadlocks on JOIN Engine tables #30182

Merged

Algunenano pushed a commit to Algunenano/ClickHouse that referenced this pull request Oct 14, 2021

Merge pull request ClickHouse#29544 from Algunenano/join_deadlock

02a97cd

Algunenano mentioned this pull request Oct 14, 2021

(Manual) Backport #29544 to 21.9: Avoid deadlocks on JOIN Engine tables #30185

Merged

Algunenano pushed a commit to Algunenano/ClickHouse that referenced this pull request Oct 14, 2021

Merge pull request ClickHouse#29544 from Algunenano/join_deadlock

4a99043

Algunenano mentioned this pull request Oct 14, 2021

(Manual) Backport #29544 to 21.8: Avoid deadlocks on JOIN Engine tables #30187

Merged

alexey-milovidov added a commit that referenced this pull request Oct 16, 2021

Merge pull request #30182 from Algunenano/21_10_29544

0f38290

Backport #29544 to 21.10: Avoid deadlocks on JOIN Engine tables

taiyang-li pushed a commit to bigo-sg/ClickHouse that referenced this pull request Nov 5, 2021

Merge pull request ClickHouse#29544 from Algunenano/join_deadlock

64281d8

zealjoanna mentioned this pull request Nov 18, 2021

Context has expired: while pushing to view #31502

Closed

tavplubix mentioned this pull request Feb 28, 2023

LOGICAL_ERROR: RWLockImpl::getLock(): Cannot acquire exclusive lock while RWLock is already locked #47023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deadlocks on JOIN Engine tables #29544

Avoid deadlocks on JOIN Engine tables #29544

Algunenano commented Sep 29, 2021

vdimir left a comment •

edited

vdimir Sep 30, 2021

Algunenano commented Sep 30, 2021

Algunenano commented Sep 30, 2021

vdimir commented Oct 1, 2021

Algunenano commented Oct 1, 2021

Algunenano commented Oct 6, 2021

vdimir commented Oct 7, 2021

Algunenano commented Oct 7, 2021

vdimir commented Oct 11, 2021 •

edited

Algunenano commented Oct 14, 2021

		throw DB::Exception("addJoinedBlock called when HashJoin locked to prevent updates",
		ErrorCodes::LOGICAL_ERROR);

Avoid deadlocks on JOIN Engine tables #29544

Avoid deadlocks on JOIN Engine tables #29544

Conversation

Algunenano commented Sep 29, 2021

vdimir left a comment • edited

Choose a reason for hiding this comment

vdimir Sep 30, 2021

Choose a reason for hiding this comment

Algunenano commented Sep 30, 2021

Algunenano commented Sep 30, 2021

vdimir commented Oct 1, 2021

Algunenano commented Oct 1, 2021

Algunenano commented Oct 6, 2021

vdimir commented Oct 7, 2021

Algunenano commented Oct 7, 2021

vdimir commented Oct 11, 2021 • edited

Algunenano commented Oct 14, 2021

vdimir left a comment •

edited

vdimir commented Oct 11, 2021 •

edited