Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-25771][connectors][Cassandra][tests][flaky] Reduce cassandra testContainer load to avoid timeouts and in general flakiness issues #19059

Merged
merged 10 commits into from
Mar 15, 2022

Conversation

echauchot
Copy link
Contributor

@echauchot echauchot commented Mar 11, 2022

I took a look at the history of flaky test and found that:
testCassandraBatchTupleFormat failure https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=32816&view=results
testCassandraScalaTuplePartialColumnUpdate failure https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=32571&view=resultstestCassandraScalaTuplePartialColumnUpdate
testCassandraTableSink failure https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=32272

They fail because the keyspace does not exist while executing create table (in before) whereas the keyspace is created in before class. My guess it that drop tables (in after) is not atomic and drops the keyspace in between. Hence my commit for making it atomic.

The other failures I see are timeouts in only testRetrialAndDropTables method. But this test is no more relevant as the table names are now dynamic. Simply remove this test

R: @XComp

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 11, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

…updates the tableID and it is called even in case or retrials. So, we need to drop only the created table for this test (with the related tableID)
…tRetrialAndDropTables because it is no more relevant as the table names are now dynamic"

This reverts commit b9eda68.
@echauchot
Copy link
Contributor Author

echauchot commented Mar 14, 2022

@zentol, I analyzed the problem. OperationTimeoutException definitely happens at the client side (Cassandra driver) under load in all the tests but the timeouts raised seem to be not enough. So I reduced the load by avoiding "IF EXIST" in CQL requests and more important I replaced drop keyspace by drop table. Indeed, each Before updates the tableID and it is called even in case or retrials. So, we need to drop only the created table for this test (with the same tableID) and not the whole keyspace. That will reduce the overall load on cassandra (so I hope no more timeouts) but also fix the keyspace does not exist flaky issue.

Finally I put testRetrialDropTable back to check that everything still works in case of retrial (drop the single table instead of keyspace etc...)

I did 10 local runs of the whole CassandraITCase with no error.

@echauchot
Copy link
Contributor Author

echauchot commented Mar 14, 2022

@zentol Actually, dropTable() is no more even needed because each test including retrials use a different table name now. I also removed the leftover flink_initial table. I don't know what it was for.
So we lower the load even more.
PTAL

CREATE_TABLE_QUERY.replace(TABLE_NAME_VARIABLE, TABLE_NAME_PREFIX + "initial"));
/*
session.execute(
CREATE_TABLE_QUERY.replace(TABLE_NAME_VARIABLE, TABLE_NAME_PREFIX + "initial"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just remove it, no need for a comment. I don't remember why we had this in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is what I already did in the latest commits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you didn't, its still there :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, my bad. Done

@echauchot
Copy link
Contributor Author

@flinkbot run azure

@echauchot
Copy link
Contributor Author

echauchot commented Mar 15, 2022

@zentol I addressed all your comments. It should be ok now. PTAL

@zentol zentol merged commit 3bb6fa3 into apache:master Mar 15, 2022
@echauchot
Copy link
Contributor Author

thanks @zentol for reviewing and merging !

@echauchot echauchot changed the title [FLINK-25771][connectors][Cassandra][tests][flaky] Remove testRetrialAndDropTables and make drop tables atomic [FLINK-25771][connectors][Cassandra][tests][flaky] Reduce cassandra testContainer load to avoid timeouts and in general flakiness issues Mar 15, 2022
@echauchot echauchot deleted the FLINK-25771-timeouts2 branch March 15, 2022 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants