[BEAM-245] Add Cassandra IO #592

jbonofre · 2016-07-06T05:34:47Z

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

Make sure the PR title is formatted like:
[BEAM-<Jira issue #>] Description of pull request
Make sure tests pass via mvn clean verify. (Even better, enable
Travis-CI on your fork and ensure the whole test matrix passes).
Replace <Jira issue #> in the title with the actual Jira issue
number, if there is one.
If this contribution is large, please file an Apache
Individual Contributor License Agreement.

Initial version of CassandraIO.

TODO:

fix and enable the tests (Cassandra daemon related)
usage of entity should be optional and the source should be able to return a PCollection<Row>

jbonofre · 2016-08-30T17:20:22Z

R: @jkff

jbonofre · 2016-11-09T18:34:29Z

Rebased, AutoValue use, etc. Not fully ready for review anyway.

jkff

High-level notes:

Audit the code for uses of raw types: I found several. There should be zero uses of raw types in the entire PR - if you feel that a particular use is necessary, let's discuss.
For initial splitting, please reuse ByteKeyRange rather than hand-implementing similar logic.
Add logging to places where something really important happens (e.g. especially various fallback cases when splitting)
Please use SourceTestUtils to test your Source.
I'm confused by the sink implementation: it seems to be similar to the Sink/WriteOperation API, but I don't see why this is necessary: can't it be done with a simple DoFn?

jbonofre · 2016-11-11T06:13:58Z

@jkff thanks for the update. Regarding your comment:

yes, I will fix the raw types
ok, thanks for the tip
agree, I will
yes, as I did in ElasticsearchIO, generally speaking, the tests should be improved
agree, I will simplify with a "regular" DoFn

jbonofre · 2016-12-18T17:03:58Z

Rebased and implemented:

Reimplement sink using "regular" DoFn approach.
Fixed the getEstimatedSizeBytes() and splitIntoBundles() methods.
Fixed test using CassandraUnit.

TODO:

Add logging in key places.
Use of ByteKeyRange (I have to evaluate).
Use of SourceTestUtils in the test.
Maybe set Mapper optional (I have to think about that).

However, the PR is now in a better shape IMHO.

asfbot · 2016-12-18T17:24:47Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6062/
--none--

asfbot · 2016-12-18T17:26:04Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6063/
--none--

jbonofre · 2016-12-19T07:05:46Z

Travis is failing because Cassandra 3.7 is Java8 only. Do we need Java7 compliant dependencies ? I guess so.

jbonofre · 2016-12-19T18:15:02Z

I improved the test. However, as Cassandra only works with Java 8, I have to execute the tests only with Java 8 (not Java 7). I will do as we do in the java8 examples.

asfbot · 2016-12-19T18:50:33Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6071/
--none--

asfbot · 2016-12-20T06:53:39Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6101/
--none--

asfbot · 2016-12-20T07:03:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6102/
--none--

echauchot · 2016-12-23T18:01:05Z

JB, for the splitIntoBundles you use token range from -2^63 to 2^63 but these values are for Cassandra Murmur3Partitioner partitioner. This is the default Cassandra v1.2+ but it can be changed by the user to RandomPartitioner, ... in that case the values for the ranges won't be the same. IMHO, it might be better rather than using absolute values, to read all the token ranges in system.size_estimates table in fields rangeStart and rangeEnd. That way ranges it will be dynamic. But that system table is only available in Cassandra v2.1.5+. Besides, this is what cassandra connector for spark does

echauchot · 2016-12-23T18:06:51Z

Another thing, for the write: currently, it is not batched. It would be better I think to use batch API (BatchStatement) in place of asynchronously writing record by record.

doanduyhai · 2016-12-23T18:40:11Z

@echauchot

No, don't use BatchStatement. It'll kill your performance. Batch Statements are there for denormalization, when you need to update together some tables.

Async insert is the best solution for performance

jbonofre · 2016-12-24T07:23:07Z

Good point for the partitioner. I implemented that way because it's the default in Cassandra. I will improve this.
For the usage of BatchStatement, I don't think it's a good idea, especially with the Mapper. I prefer a sync approach with optionally a batching on the IO (just storing number of entities or rows).

asfbot · 2016-12-24T08:10:06Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6245/
--none--

asfbot · 2016-12-24T09:59:16Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6246/
--none--

asfbot · 2016-12-24T10:27:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6247/
--none--

echauchot · 2016-12-26T09:04:56Z

Did not know about BatchStatement, thanks @doanduyhai

echauchot · 2016-12-27T09:07:53Z

I'll work on the partitionner part, and also on splitIntoBundles and getEstimatedSize because these parts are linked

jkff

Thanks!

jkff · 2016-12-28T02:25:27Z

sdks/java/io/cassandra/pom.xml

+          <source>1.7</source>
+          <target>1.7</target>
+          <compilerArgs>
+            <arg>-Xlint:all</arg>


Why all these disabled warnings? If you're adding a new module, might as well make it lint-clean.

It's just the same configuration as in the parent pom.

jkff · 2016-12-28T02:25:43Z

sdks/java/io/cassandra/pom.xml

+  <description>IO to read and write with Apache Cassandra database</description>
+
+  <properties>
+    <compiler.error.flag></compiler.error.flag>


What does this do?

It's to disable -Werror which fails with Cassandra on Java7 (Cassandra requires Java8).

jkff · 2016-12-28T02:26:02Z

sdks/java/io/cassandra/pom.xml

+            <arg>-Xlint:-varargs</arg>
+          </compilerArgs>
+          <showWarnings>true</showWarnings>
+          <showDeprecation>false</showDeprecation>


Likewise. Don't think there's a strong reason to use deprecated APIs in this connector.

As previous comment, it's duplication of the parent pom configuration updated for Cassandra.

jkff · 2016-12-28T02:26:57Z

sdks/java/io/cassandra/pom.xml

+
+  <profiles>
+    <profile>
+      <!-- Skip tests on Java7 as Cassandra requires Java8 -->


Do you mean the embedded test instance of Cassandra, or do you mean the Cassandra API? Please clarify in the comment.

It's the embedded Cassandra instance used in the test, which requires Java8. I will update the comment accordingly.

jkff · 2016-12-28T02:29:26Z

sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java

+ *
+ * <p>CassandraIO provides a source to read and returns a bounded collection of entities as {@code
+ * PCollection<Entity>}.
+ * An entity is built by Cassandra mapper based on a POJO containing annotations.


Just want to confirm whether this is how people usually interact with Cassandra, i.e. whether they'll find this familiar. E.g. is this similar to how the Spark cassandra connector works?

If not, you may want to adjust the API to let the user supply an arbitrary row mapper function, like in JdbcIO, and allow using entity mappers as just a special case of that.

The Spark cassandra connector also uses an Entity AFAIK.

So, my preferred option is to use an Entity. However, for convenience, if the user doesn't provide an Entity, then, he will get a PCollection of Row (implementing this part now).

jkff · 2016-12-28T02:44:02Z

sdks/java/io/cassandra/src/test/java/org/apache/beam/sdk/io/cassandra/CassandraIOTest.java

+  @Before
+  public void startCassandra() throws Exception {
+    EmbeddedCassandraServerHelper.startEmbeddedCassandra("/cassandra.yaml",
+        "target/cassandra", 30000);


Do not use target/ - use a TempDirectory rule.

jkff · 2016-12-28T02:44:22Z

sdks/java/io/cassandra/src/test/java/org/apache/beam/sdk/io/cassandra/CassandraIOTest.java

+  @Test
+  @Category(NeedsRunner.class)
+  public void testRead() throws Exception {
+    Pipeline pipeline = TestPipeline.create();


TestPipeline is now a @Rule.

Using @Rule for TestPipeline causes:

java.io.NotSerializableException: org.apache.beam.sdk.testing.TestPipeline

when I do MapElements.via(new SimpleFunction()).

It's because CassandraIOTest is Serializable (required for the functions) but not TestPipeline. I'm testing transient there.

jkff · 2016-12-28T02:45:08Z

sdks/java/io/cassandra/src/test/resources/cassandra.yaml

@@ -0,0 +1,1073 @@
+#


Clean up this file - remove cruft, unnecessary settings and unnecessary comments?

jkff · 2016-12-28T02:45:25Z

sdks/java/io/cassandra/src/test/resources/cassandra.yaml

+# Directory where Cassandra should store hints.
+# If not set, the default directory is $CASSANDRA_HOME/data/hints.
+# hints_directory: /var/lib/cassandra/hints
+hints_directory: target/cassandra/hints


This should also use the temp directory.

jkff · 2016-12-28T02:46:27Z

sdks/java/io/cassandra/src/test/resources/cassandra.yaml

+# data_file_directories:
+#     - /var/lib/cassandra/data
+data_file_directories:
+      - target/cassandra/data


Likewise here and line 217. I hope there's a better way to tell Cassandra where to write. I also hope it has defaults for most of the options declared in this file...

jbonofre · 2017-06-02T16:32:14Z

retest this please

coveralls · 2017-06-02T18:03:21Z

Changes Unknown when pulling 8581685 on jbonofre:BEAM-245-CASSANDRA into ** on apache:master**.

ssisk

This looks good for the IO ITs, thanks for making sure the tests can run repeatedly!

ssisk · 2017-06-02T22:18:47Z

sdks/java/io/cassandra/src/test/java/org/apache/beam/sdk/io/cassandra/CassandraIOIT.java

+public class CassandraIOIT implements Serializable {
+
+  private static IOTestPipelineOptions options;
+  private static String writeTableName;


not used any more

…and add IT write cleanup

jbonofre · 2017-06-06T05:55:48Z

Agree to merge ?

jbonofre · 2017-06-06T06:05:26Z

Rebased.

coveralls · 2017-06-06T07:19:07Z

Coverage decreased (-0.007%) to 70.61% when pulling fc618c2 on jbonofre:BEAM-245-CASSANDRA into 6d64c6e on apache:master.

jbonofre · 2017-06-06T11:44:57Z

The build passed:

2017-06-06T07:19:20.608 [INFO] ------------------------------------------------------------------------
2017-06-06T07:19:20.608 [INFO] BUILD SUCCESS
2017-06-06T07:19:20.608 [INFO] ------------------------------------------------------------------------

but Jenkins complaints.

I'm launching a new build.

jbonofre · 2017-06-06T11:45:03Z

retest this please

coveralls · 2017-06-06T13:07:57Z

Coverage remained the same at 70.617% when pulling fc618c2 on jbonofre:BEAM-245-CASSANDRA into 6d64c6e on apache:master.

jbonofre · 2017-06-06T17:42:40Z

@jkff @ssisk good for you guys ?

jkff · 2017-06-06T21:10:17Z

Was good for me already, and seems @ssisk is good too. Go ahead and merge, thanks!

jkff · 2017-06-06T21:10:35Z

Please squash the commits though.

nevillelyh · 2017-06-21T17:33:57Z

So if I understand the writer code correctly, createWriter is called per DoFn instance, e.g. creating one connection per CPU per worker, and then doing blocking mapper.save(entity) per element?

Update beam-site for release 2.15.0

* fix(deps): Require proto-plus >=1.20.5 In proto-plus 1.20.5, the protobuf dependency is pinned to <4.0.0dev Fix apache#592

jbonofre force-pushed the BEAM-245-CASSANDRA branch from f1c042a to 7b2c5b6 Compare August 30, 2016 17:03

jbonofre force-pushed the BEAM-245-CASSANDRA branch from 4c4a03b to 4f0125c Compare November 9, 2016 18:28

jkff reviewed Nov 9, 2016

View reviewed changes

jbonofre force-pushed the BEAM-245-CASSANDRA branch 2 times, most recently from 3f51ada to 12184b8 Compare December 18, 2016 16:59

jbonofre force-pushed the BEAM-245-CASSANDRA branch from 12184b8 to 0e46ccd Compare December 19, 2016 18:14

jbonofre force-pushed the BEAM-245-CASSANDRA branch from 0b516e5 to f057ed5 Compare December 20, 2016 06:16

jbonofre force-pushed the BEAM-245-CASSANDRA branch from f057ed5 to 371ccb5 Compare December 24, 2016 07:43

jbonofre force-pushed the BEAM-245-CASSANDRA branch from d93c3fd to 371ccb5 Compare December 24, 2016 10:01

jkff requested changes Dec 28, 2016

View reviewed changes

ssisk approved these changes Jun 2, 2017

View reviewed changes

jbonofre added 11 commits June 6, 2017 07:42

[BEAM-245] Add CassandraIO.read()

ae97696

[BEAM-245] Add CassandraIO.write()

0f6f5eb

[BEAM-245] Update CassandraIO version

f0ac6c3

[BEAM-245] Add IT to CassandraIO

bcb3d17

[BEAM-245] Flag CassandraIO as experimental

dff9d9d

[BEAM-245] Add more control on the Cassandra cluster builder

6e79639

[BEAM-245] Fix javadoc

5127692

[BEAM-245] Fix withHost() precondition on CassandraIO.read(), fix IT …

67d14eb

…and add IT write cleanup

[BEAM-245] Fix the IO and improved the IT

578bcac

[BEAM-245] Fix CassandraIOIT entity read

3cba68f

[BEAM-245] Remove unused write table name in the CassandraIOIT

fc618c2

jbonofre force-pushed the BEAM-245-CASSANDRA branch from 8581685 to fc618c2 Compare June 6, 2017 06:05

asfgit closed this in c189d5c Jun 7, 2017

jbonofre deleted the BEAM-245-CASSANDRA branch June 7, 2017 06:06

nevillelyh mentioned this pull request Jun 21, 2017

Proper Cassandra IO spotify/scio#603

Closed

johnjcasey pushed a commit to johnjcasey/beam that referenced this pull request Feb 8, 2023

Merge pull request apache#592 Update beam-site for release 2.15.0

b4c435a

Update beam-site for release 2.15.0

pl04351820 pushed a commit to pl04351820/beam that referenced this pull request Dec 20, 2023

fix(deps): Require proto-plus >=1.20.5 (apache#593)

2281290

* fix(deps): Require proto-plus >=1.20.5 In proto-plus 1.20.5, the protobuf dependency is pinned to <4.0.0dev Fix apache#592

[BEAM-245] Add Cassandra IO #592

[BEAM-245] Add Cassandra IO #592

Conversation

jbonofre commented Jul 6, 2016

jbonofre commented Aug 30, 2016

jbonofre commented Nov 9, 2016

jkff left a comment

Choose a reason for hiding this comment

jbonofre commented Nov 11, 2016

jbonofre commented Dec 18, 2016 • edited

asfbot commented Dec 18, 2016

asfbot commented Dec 18, 2016

jbonofre commented Dec 19, 2016

jbonofre commented Dec 19, 2016

asfbot commented Dec 19, 2016

asfbot commented Dec 20, 2016

asfbot commented Dec 20, 2016

echauchot commented Dec 23, 2016 • edited

echauchot commented Dec 23, 2016

doanduyhai commented Dec 23, 2016

jbonofre commented Dec 24, 2016

asfbot commented Dec 24, 2016

asfbot commented Dec 24, 2016

asfbot commented Dec 24, 2016

echauchot commented Dec 26, 2016

echauchot commented Dec 27, 2016 • edited

jkff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbonofre Dec 28, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbonofre commented Jun 2, 2017

coveralls commented Jun 2, 2017

ssisk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbonofre commented Jun 6, 2017

jbonofre commented Jun 6, 2017

coveralls commented Jun 6, 2017

jbonofre commented Jun 6, 2017

jbonofre commented Jun 6, 2017

coveralls commented Jun 6, 2017

jbonofre commented Jun 6, 2017

jkff commented Jun 6, 2017

jkff commented Jun 6, 2017

nevillelyh commented Jun 21, 2017

jbonofre commented Dec 18, 2016 •

edited

echauchot commented Dec 23, 2016 •

edited

echauchot commented Dec 27, 2016 •

edited

jbonofre Dec 28, 2016 •

edited