-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-9188] CassandraIO split performance improvement - cache size of the table #10701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… table Splitting CassandraIO source into multiple sources works fast as it uses one connection pool to Cassandra cluster but after that dataflow.worker.WorkerCustomSources is calling CassandraSource.getEstimatedSizeBytes for each source which setups and tears down connection to Cassandra cluster to calculate same size of table.
|
Retest it please |
|
Retest it please |
|
Retest it please |
boyuanzz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care this! Could you please also add unit tests to CassandraIOTest?
sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java
Outdated
Show resolved
Hide resolved
|
Retest it please |
stankiewicz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added tests and comments and fixed sizing logic (sum of split sizes roughly equals size of original source)
|
Retest it please |
|
retest it please |
1 similar comment
|
retest it please |
|
retest this please |
1 similar comment
|
retest this please |
|
Java_Examples_Dataflow is broken probably because of dataflow service. |
|
retest this please |
1 similar comment
|
retest this please |
|
Run Spotless PreCommit |
2 similar comments
|
Run Spotless PreCommit |
|
Run Spotless PreCommit |
|
All tests passed. I'll go ahead to merge this PR. |
Splitting CassandraIO source into multiple sources works fast as it uses one connection pool to Cassandra cluster but after that dataflow.worker.WorkerCustomSources is calling CassandraSource.getEstimatedSizeBytes for each source which setups and tears down connection to Cassandra cluster to calculate same size of table. This optimization introduces caching of size internally just to avoid additional queries.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.