Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSANDRA-18747-4.1: Fix race condition in Schema #2713

Conversation

jacek-lewandowski
Copy link
Contributor

@jacek-lewandowski jacek-lewandowski commented Sep 21, 2023

No description provided.

@jacek-lewandowski jacek-lewandowski changed the title Cassandra 18747 4.1 CASSANDRA-18747-4.1: Fix race condition in Schema Sep 22, 2023
@bereng
Copy link
Contributor

bereng commented Sep 25, 2023

Ar you planning on adding a test for the specific scenario? Also you can run the pre-commit CI job rather than the separate one which is more cumbersome and some jobs got missed.

@jacek-lewandowski
Copy link
Contributor Author

jacek-lewandowski commented Sep 25, 2023

I don't think I can add a test for this scenario due to changes I've made. To prove that, I needed to add a sleep between updating one and the other field. With those changes such a test makes no sense.

@bereng
Copy link
Contributor

bereng commented Sep 26, 2023

^Couldn't you inject a sleep + getVersionUnsafe() + getVersion() to test the former is out of sync and the latter is safe to use?

@jacek-lewandowski
Copy link
Contributor Author

So you mean that piece of code in DescriptionStatement right? yeah, I can try

@bereng
Copy link
Contributor

bereng commented Sep 27, 2023

It seems like MigrationCoordinatorTest failed on the repeats. I checked JIRA and Butler and they both seem clean. So it's worth investigating imo

@jacek-lewandowski
Copy link
Contributor Author

Yeah, I know, I'll run repeated test for 4.1 today

@bereng
Copy link
Contributor

bereng commented Sep 27, 2023

^Ok so it failed on 4.1 as well. A ticket for that one needs to be created and I'll get back to this review shortly

@bereng
Copy link
Contributor

bereng commented Sep 28, 2023

+1 pending increasing timeouts

@jacek-lewandowski
Copy link
Contributor Author

is it ok now?

@bereng
Copy link
Contributor

bereng commented Sep 29, 2023

+1 now we need the rest of branches and CI :-)

Copy link
Contributor

@JeremiahDJordan JeremiahDJordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. Race conditions are hard.
Only comment is I wouldn’t remove the helper functions if possible.

src/java/org/apache/cassandra/schema/Schema.java Outdated Show resolved Hide resolved
src/java/org/apache/cassandra/schema/Schema.java Outdated Show resolved Hide resolved
Copy link
Contributor

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nit picks.

Looks correct but I have one concern, while DescribeStatement will be improved and this could be considered a bug, we are changing behavior in a patch version and incurring synchronization cost. No? Also, wasn't this working that way even before 4.1?

An explicit test for the new holder is missing, but we will see the Jenkins failures disappearing after commit. I believe that is all we need in this case.
CI LGTM
My comments on the PR to be considered for all branches (except the 4.1 explicit one around change of behavior)

src/java/org/apache/cassandra/schema/Schema.java Outdated Show resolved Hide resolved
src/java/org/apache/cassandra/schema/Schema.java Outdated Show resolved Hide resolved
src/java/org/apache/cassandra/schema/Schema.java Outdated Show resolved Hide resolved
src/java/org/apache/cassandra/schema/Schema.java Outdated Show resolved Hide resolved
@jacek-lewandowski jacek-lewandowski force-pushed the CASSANDRA-18747-4.1 branch 2 times, most recently from bf2d127 to 550d8e9 Compare October 13, 2023 09:54
@@ -246,9 +253,10 @@ private Keyspace maybeRemoveKeyspaceInstance(String keyspaceName, Consumer<Keysp
}
}

@Deprecated(since = "4.1", forRemoval = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can simply remove it.

@@ -770,12 +770,12 @@ public static Stream<Keyspace> allExisting()

public static Iterable<Keyspace> nonSystem()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems to be use only in Keyspace.drain where we should use nonLocalStrategy() so we should be able to remove it.

@@ -280,7 +280,7 @@ private static Pair<Long, Long> totalNonSystemTablesSize(Predicate<SSTableReader
{
long total = 0;
long filtered = 0;
for (String keyspace : Schema.instance.getNonSystemKeyspaces().names())
for (String keyspace : Schema.instance.distributedKeyspaces().names())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we are lying for TableMetrics too :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we address that in 5.0 and change metrics a bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a statement. We can let that for later.

Fixed the inconsistency between distributedKeyspaces and distributedAndLocalKeyspaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants