Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSANDRA-18732 Baseline Diagnostic vtables for Accord #3305

Merged
merged 1 commit into from
May 28, 2024

Conversation

maedhroz
Copy link
Contributor

@maedhroz maedhroz commented May 13, 2024

Copy link
Contributor

@dcapwell dcapwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the migration states table, one of the reasons for using the list/map collections is also better documentation. text is flexible for us but once people start parsing (tools) then we can't actually change things... so we are stuck with the format...

src/java/org/apache/cassandra/utils/PojoToString.java Outdated Show resolved Hide resolved
{
private MigrationStates(String keyspace)
{
super(parse(keyspace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do I know the migration is done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a range move from "migrating" -> "migrated"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know it's done when migrated_ranges is the whole ring and should have a single entry. Right now once a migration state has been created for a table it never goes away.

{
private CommandStoreCache(String keyspace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be useful to include instance as well, either as its own table or a column here... having a low hit rate doesn't let you know what's going on, need to see which instances are contributing to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there even any reason to populate this w/ the global Stats? Would it make more sense to expose the Stats from commandCache, timestampsForKeyCache, and commandsForKeyCache by store ID instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the global one is def nice to have, but if you are trying to debug an issue you kinda need to know which "instance" is having the issues... if command cache is 99% cache hit and the global miss rate is 50%, you know something is going on with the CFK caches (but which one?!?!)

The system_metrics vtables also expose the JMX metrics, which is the JVM level (includes each store)

TL;DR - I personally feel that cmd/cfk cache level is more useful than global and a global is nice but not required (slight duplicate with JMX table)

{
private MigrationStates(String keyspace)
{
super(parse(keyspace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -696,6 +705,27 @@ private static void assertMigrationState(String tableName, ConsensusMigrationTar
});
}

private static void assertVtableMigrationState(String tableName, List<Range<Token>> migratedRanges, List<Range<Token>> migratingRanges) throws JsonProcessingException
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need to get rid of JsonProcessingException

assertTrue(vtableResult.hasNext());
Row tableState = vtableResult.next();
List<String> vtableMigratedRanges = tableState.getList("migrated_ranges");
assertEquals(migratedRanges, vtableMigratedRanges.stream().map(Range::fromString).collect(Collectors.toList()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aweisberg It looks like this assertion is a bit flakey. I haven't been able to reproduce locally, but a couple times in CI...

junit.framework.AssertionFailedError: expected:<[]>  but was:<[(-1,9223372036854775807]]>
	at org.apache.cassandra.distributed.test.accord.AccordMigrationTest.assertVtableMigrationState(AccordMigrationTest.java:715)
	at org.apache.cassandra.distributed.test.accord.AccordMigrationTest.assertMigrationState(AccordMigrationTest.java:633)
	at org.apache.cassandra.distributed.test.accord.AccordMigrationTest.lambda$testPaxosToAccordCAS$13(AccordMigrationTest.java:404)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that range has moved to "migrated", unless I'm looking at the wrong keyspace/table...or there was something I'm waiting for that I shouldn't (as this looks a lot like I'm getting a later state than expected)

Comment on lines 93 to 106
List<AccordStateCache.ImmutableStats> commandStats =
AsyncChains.getBlockingAndRethrow(stores.map(store -> ((AccordCommandStore) store.commandStore()).commandCache().statsSnapshot()));
for (int storeID : stores.ids())
addRow(commandStats.get(storeID), result, storeID, "command");

List<AccordStateCache.ImmutableStats> commandsForKeyStats =
AsyncChains.getBlockingAndRethrow(stores.map(store -> ((AccordCommandStore) store.commandStore()).commandsForKeyCache().statsSnapshot()));
for (int storeID : stores.ids())
addRow(commandsForKeyStats.get(storeID), result, storeID, "commands_for_key");

List<AccordStateCache.ImmutableStats> timestampsForKeyStats =
AsyncChains.getBlockingAndRethrow(stores.map(store -> ((AccordCommandStore) store.commandStore()).timestampsForKeyCache().statsSnapshot()));
for (int storeID : stores.ids())
addRow(timestampsForKeyStats.get(storeID), result, storeID, "timestamps_for_key");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why schedule on the same stores 3 times when you could do this once? this would be 3x faster to do

patch by Caleb Rackliffe; reviewed by David Capwell and Ariel Weisberg for CASSANDRA-18732
@maedhroz maedhroz merged commit f473646 into apache:cep-15-accord May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants