[BEAM-9188] CassandraIO split performance improvement - cache size of the table #10701

stankiewicz · 2020-01-28T15:24:37Z

Splitting CassandraIO source into multiple sources works fast as it uses one connection pool to Cassandra cluster but after that dataflow.worker.WorkerCustomSources is calling CassandraSource.getEstimatedSizeBytes for each source which setups and tears down connection to Cassandra cluster to calculate same size of table. This optimization introduces caching of size internally just to avoid additional queries.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

… table Splitting CassandraIO source into multiple sources works fast as it uses one connection pool to Cassandra cluster but after that dataflow.worker.WorkerCustomSources is calling CassandraSource.getEstimatedSizeBytes for each source which setups and tears down connection to Cassandra cluster to calculate same size of table.

boyuanzz · 2020-01-28T20:36:17Z

Retest it please

boyuanzz · 2020-01-28T21:06:05Z

Retest it please

boyuanzz · 2020-01-28T22:13:49Z

Retest it please

boyuanzz

Thanks for taking care this! Could you please also add unit tests to CassandraIOTest?

sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java

boyuanzz · 2020-01-29T00:08:55Z

Retest it please

stankiewicz

I've added tests and comments and fixed sizing logic (sum of split sizes roughly equals size of original source)

boyuanzz · 2020-01-29T20:58:28Z

Retest it please

stankiewicz · 2020-01-30T09:29:22Z

retest it please

boyuanzz · 2020-01-30T18:45:04Z

retest it please

boyuanzz · 2020-01-30T18:45:53Z

retest this please

boyuanzz · 2020-01-30T19:21:51Z

retest this please

boyuanzz · 2020-01-30T21:12:22Z

Java_Examples_Dataflow is broken probably because of dataflow service.
Please fix Spotless.

stankiewicz · 2020-01-31T08:58:48Z

retest this please

boyuanzz · 2020-01-31T18:38:12Z

retest this please

boyuanzz · 2020-01-31T18:39:40Z

Run Spotless PreCommit

boyuanzz · 2020-01-31T18:53:12Z

Run Spotless PreCommit

boyuanzz · 2020-01-31T22:03:19Z

Run Spotless PreCommit

boyuanzz · 2020-01-31T22:36:45Z

All tests passed. I'll go ahead to merge this PR.
Thanks for your contribution!

stankiewicz requested a review from boyuanzz January 28, 2020 15:24

spotless fixes

eaf347c

boyuanzz mentioned this pull request Jan 28, 2020

[BEAM-9188] Dataflow's WorkerCustomSources improvement - parallelize creation of Derived Sources (splitting) #10685

Closed

3 tasks

boyuanzz reviewed Jan 28, 2020

View reviewed changes

sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java Outdated Show resolved Hide resolved

sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java Outdated Show resolved Hide resolved

boyuanzz reviewed Jan 29, 2020

View reviewed changes

sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java Outdated Show resolved Hide resolved

comments and tests

9fe2cec

stankiewicz commented Jan 29, 2020

View reviewed changes

spottless

4785031

boyuanzz merged commit 94ca187 into apache:master Jan 31, 2020

[BEAM-9188] CassandraIO split performance improvement - cache size of the table #10701

[BEAM-9188] CassandraIO split performance improvement - cache size of the table #10701

Uh oh!

Conversation

stankiewicz commented Jan 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

boyuanzz commented Jan 28, 2020

Uh oh!

boyuanzz commented Jan 28, 2020

Uh oh!

boyuanzz commented Jan 28, 2020

Uh oh!

boyuanzz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

boyuanzz commented Jan 29, 2020

Uh oh!

stankiewicz left a comment

Choose a reason for hiding this comment

Uh oh!

boyuanzz commented Jan 29, 2020

Uh oh!

stankiewicz commented Jan 30, 2020

Uh oh!

boyuanzz commented Jan 30, 2020

Uh oh!

boyuanzz commented Jan 30, 2020

Uh oh!

boyuanzz commented Jan 30, 2020

Uh oh!

boyuanzz commented Jan 30, 2020

Uh oh!

stankiewicz commented Jan 31, 2020

Uh oh!

boyuanzz commented Jan 31, 2020

Uh oh!

boyuanzz commented Jan 31, 2020

Uh oh!

boyuanzz commented Jan 31, 2020

Uh oh!

boyuanzz commented Jan 31, 2020

Uh oh!

boyuanzz commented Jan 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stankiewicz commented Jan 28, 2020 •

edited

Loading