New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-245] Add Cassandra IO #592
Conversation
f1c042a
to
7b2c5b6
Compare
R: @jkff |
4c4a03b
to
4f0125c
Compare
Rebased, AutoValue use, etc. Not fully ready for review anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level notes:
- Audit the code for uses of raw types: I found several. There should be zero uses of raw types in the entire PR - if you feel that a particular use is necessary, let's discuss.
- For initial splitting, please reuse ByteKeyRange rather than hand-implementing similar logic.
- Add logging to places where something really important happens (e.g. especially various fallback cases when splitting)
- Please use SourceTestUtils to test your Source.
- I'm confused by the sink implementation: it seems to be similar to the Sink/WriteOperation API, but I don't see why this is necessary: can't it be done with a simple DoFn?
@jkff thanks for the update. Regarding your comment:
|
3f51ada
to
12184b8
Compare
Rebased and implemented:
TODO:
However, the PR is now in a better shape IMHO. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Travis is failing because Cassandra 3.7 is Java8 only. Do we need Java7 compliant dependencies ? I guess so. |
12184b8
to
0e46ccd
Compare
I improved the test. However, as Cassandra only works with Java 8, I have to execute the tests only with Java 8 (not Java 7). I will do as we do in the java8 examples. |
Refer to this link for build results (access rights to CI server needed): |
0b516e5
to
f057ed5
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
JB, for the splitIntoBundles you use token range from -2^63 to 2^63 but these values are for Cassandra Murmur3Partitioner partitioner. This is the default Cassandra v1.2+ but it can be changed by the user to RandomPartitioner, ... in that case the values for the ranges won't be the same. IMHO, it might be better rather than using absolute values, to read all the token ranges in system.size_estimates table in fields rangeStart and rangeEnd. That way ranges it will be dynamic. But that system table is only available in Cassandra v2.1.5+. Besides, this is what cassandra connector for spark does |
Another thing, for the write: currently, it is not batched. It would be better I think to use batch API (BatchStatement) in place of asynchronously writing record by record. |
No, don't use BatchStatement. It'll kill your performance. Batch Statements are there for denormalization, when you need to update together some tables. Async insert is the best solution for performance |
Good point for the partitioner. I implemented that way because it's the default in Cassandra. I will improve this. |
f057ed5
to
371ccb5
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
d93c3fd
to
371ccb5
Compare
Refer to this link for build results (access rights to CI server needed): |
Did not know about BatchStatement, thanks @doanduyhai |
I'll work on the partitionner part, and also on splitIntoBundles and getEstimatedSize because these parts are linked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
sdks/java/io/cassandra/pom.xml
Outdated
<source>1.7</source> | ||
<target>1.7</target> | ||
<compilerArgs> | ||
<arg>-Xlint:all</arg> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all these disabled warnings? If you're adding a new module, might as well make it lint-clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just the same configuration as in the parent pom.
sdks/java/io/cassandra/pom.xml
Outdated
<description>IO to read and write with Apache Cassandra database</description> | ||
|
||
<properties> | ||
<compiler.error.flag></compiler.error.flag> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to disable -Werror
which fails with Cassandra on Java7 (Cassandra requires Java8).
sdks/java/io/cassandra/pom.xml
Outdated
<arg>-Xlint:-varargs</arg> | ||
</compilerArgs> | ||
<showWarnings>true</showWarnings> | ||
<showDeprecation>false</showDeprecation> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise. Don't think there's a strong reason to use deprecated APIs in this connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As previous comment, it's duplication of the parent pom configuration updated for Cassandra.
sdks/java/io/cassandra/pom.xml
Outdated
|
||
<profiles> | ||
<profile> | ||
<!-- Skip tests on Java7 as Cassandra requires Java8 --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the embedded test instance of Cassandra, or do you mean the Cassandra API? Please clarify in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the embedded Cassandra instance used in the test, which requires Java8. I will update the comment accordingly.
* | ||
* <p>CassandraIO provides a source to read and returns a bounded collection of entities as {@code | ||
* PCollection<Entity>}. | ||
* An entity is built by Cassandra mapper based on a POJO containing annotations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm whether this is how people usually interact with Cassandra, i.e. whether they'll find this familiar. E.g. is this similar to how the Spark cassandra connector works?
If not, you may want to adjust the API to let the user supply an arbitrary row mapper function, like in JdbcIO, and allow using entity mappers as just a special case of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Spark cassandra connector also uses an Entity AFAIK.
So, my preferred option is to use an Entity. However, for convenience, if the user doesn't provide an Entity, then, he will get a PCollection of Row (implementing this part now).
@Before | ||
public void startCassandra() throws Exception { | ||
EmbeddedCassandraServerHelper.startEmbeddedCassandra("/cassandra.yaml", | ||
"target/cassandra", 30000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use target/ - use a TempDirectory rule.
@Test | ||
@Category(NeedsRunner.class) | ||
public void testRead() throws Exception { | ||
Pipeline pipeline = TestPipeline.create(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestPipeline is now a @Rule
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using @Rule
for TestPipeline
causes:
java.io.NotSerializableException: org.apache.beam.sdk.testing.TestPipeline
when I do MapElements.via(new SimpleFunction())
.
It's because CassandraIOTest
is Serializable
(required for the functions) but not TestPipeline
. I'm testing transient
there.
@@ -0,0 +1,1073 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up this file - remove cruft, unnecessary settings and unnecessary comments?
# Directory where Cassandra should store hints. | ||
# If not set, the default directory is $CASSANDRA_HOME/data/hints. | ||
# hints_directory: /var/lib/cassandra/hints | ||
hints_directory: target/cassandra/hints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also use the temp directory.
# data_file_directories: | ||
# - /var/lib/cassandra/data | ||
data_file_directories: | ||
- target/cassandra/data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here and line 217. I hope there's a better way to tell Cassandra where to write. I also hope it has defaults for most of the options declared in this file...
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for the IO ITs, thanks for making sure the tests can run repeatedly!
public class CassandraIOIT implements Serializable { | ||
|
||
private static IOTestPipelineOptions options; | ||
private static String writeTableName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used any more
…and add IT write cleanup
Agree to merge ? |
8581685
to
fc618c2
Compare
Rebased. |
The build passed:
but Jenkins complaints. I'm launching a new build. |
retest this please |
Was good for me already, and seems @ssisk is good too. Go ahead and merge, thanks! |
Please squash the commits though. |
So if I understand the writer code correctly, |
Update beam-site for release 2.15.0
* fix(deps): Require proto-plus >=1.20.5 In proto-plus 1.20.5, the protobuf dependency is pinned to <4.0.0dev Fix apache#592
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
[BEAM-<Jira issue #>] Description of pull request
mvn clean verify
. (Even better, enableTravis-CI on your fork and ensure the whole test matrix passes).
<Jira issue #>
in the title with the actual Jira issuenumber, if there is one.
Individual Contributor License Agreement.
Initial version of CassandraIO.
TODO:
PCollection<Row>