Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix a race in tenant creation #107666

Merged
merged 1 commit into from
Jul 27, 2023

Conversation

lidorcarmel
Copy link
Contributor

@lidorcarmel lidorcarmel commented Jul 26, 2023

Previously, scanTenantsForRunnableServices() was not holding the mutex when SELECTing for the existing tenant names, which means that the following may happen:

  • scanTenantsForRunnableServices() sees that only the system tenant exists
  • createServerEntryLocked() then adds another tenant while holding the mutex
  • scanTenantsForRunnableServices() takes the lock and stops the tenant that was just created because only the system tenant should be alive (which is wrong)

This patch changes scanTenantsForRunnableServices() to take the mutex before SELECTing for the existing tenants in order to avoid the race.

Epic: none
Fixes: #107434
Fixes: #107343
Fixes: #107154

Release note: None

Previously, scanTenantsForRunnableServices() was not holding the mutex when
SELECTing for the existing tenant names, which means that the following may
happen:
- scanTenantsForRunnableServices() sees that only the system tenant exists
- createServerEntryLocked() then adds another tenant while holding the mutex
- scanTenantsForRunnableServices() takes the lock and stops the tenant that
  was just created because only the system tenant should be alive (which
  is wrong)

This patch changes scanTenantsForRunnableServices() to take the mutex before
SELECTing for the existing tenants in order to avoid the race.

Epic: none
Fixes: cockroachdb#107434

Release note: None
@lidorcarmel lidorcarmel requested review from a team as code owners July 26, 2023 20:19
@blathers-crl
Copy link

blathers-crl bot commented Jul 26, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lidorcarmel lidorcarmel requested a review from knz July 26, 2023 20:19
@lidorcarmel lidorcarmel added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Jul 26, 2023
@stevendanna
Copy link
Collaborator

Possibly also fixes:

#107343
#107154

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@knz knz added the db-cy-23 label Jul 27, 2023
@knz
Copy link
Contributor

knz commented Jul 27, 2023

let's merge this - i need it in a different PR too!

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 27, 2023

This PR was included in a batch that timed out, it will be automatically retried

@adityamaru
Copy link
Contributor

Also fixes the following I think:
#107686
#107687

@craig
Copy link
Contributor

craig bot commented Jul 27, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jul 27, 2023

Build succeeded:

@craig craig bot merged commit 68e43c8 into cockroachdb:master Jul 27, 2023
6 of 7 checks passed
craig bot pushed a commit that referenced this pull request Jul 28, 2023
107820: db-console: delete unused vars and enforce eslint rule r=maryliag a=xinhaoz

This commit turns the eslint rule no-unused-vars to errors. It removes all unused vars in the db-console application.

Epic: none

Release note: None

107824: server: prevent deadlocks in server orchestration r=lidorcarmel,andrewbaptist a=knz

Fixes #107564.
Fixes #107791.
Supersedes #107666.

The previous fix in this
area (5ca5703) correctly identified the case where `createServerEntryLocked()` was called concurrently with `scanTenantsForRunnableServices()`, in which case we ran the risk of immediately tearing down the new server because it hadn't be picked up by `getExpectedRunningTenants()`.

However, the fix was incorrect: it was causing the controller mutex to be held through `getExpectedRunningTenants()`, which itself can hang. In that case, a cascading failure could result.

This patch changes the fix (and thus continues to solve the original problem) by ensuring we only look at entries to remove that existed prior to the call to `getExpectedRunningTenants()`. No mutex needs to be held here.

Release note: None
Epic: CRDB-28893

Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 db-cy-23
Projects
None yet
5 participants