-
Notifications
You must be signed in to change notification settings - Fork 22
CASSANDRA-18692 Fix bulk writes with Buffered RowBufferMode #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When setting Buffered RowBufferMode as part of the `WriterOption`s, `org.apache.cassandra.spark.bulkwriter.RecordWriter` ignores that configuration and instead uses the batch size to determine when to finalize an SSTable and start writing a new SSTable, if more rows are available. In this commit, we fix `org.apache.cassandra.spark.bulkwriter.RecordWriter#checkBatchSize` to take into account the configured `RowBufferMode`. And in specific to the case of the `UNBUFFERED` RowBufferMode, we check then the batchSize of the SSTable during writes, and for the case of `BUFFERED` that check will take no effect. Co-authored-by: Doug Rohrer <doug@therohrers.org>
|
+1 LGTM. |
yifan-c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits. Looks good to me.
| if (sstableWriter != null) | ||
| { | ||
| finalizeSSTable(streamSession, partitionId, sstableWriter, batchNumber, batchSize); | ||
| finalizeSSTable(streamSession, partitionId, sstableWriter, batchNumber, batchSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reset sstableWriter to null after finalizeSSTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is essentially the last step before we're done with the sstablewriter, we don't really need to set the sstablewriter to null here - this RecordWriter instance should be collected as soon as the upload/commits finish, and we close the sstablewriter on finalize so we should be good w/o nulling it out.
| } | ||
| else if (rowBufferMode == RowBufferMode.BUFFERED) | ||
| { | ||
| builder.withBufferSizeInMB(bufferSizeMB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe decide a valid value range for bufferSizeMB and validate. CQLSSTableWriter accepts whatever int value.
For the upper bound, I think 1/2 of the spark.executor.memory is a good limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, for now, we leave this just a configuration option... Validating it given the Spark environment and picking a reasonable upper-bound would take some experimentation (and would likely involve more than just the executor.memory setting, as there are other config settings that deal w/ memory like memory overhead).
When setting Buffered RowBufferMode as part of the
WriterOptions,org.apache.cassandra.spark.bulkwriter.RecordWriterignores that configuration and instead uses the batch size to determine when to finalize an SSTable and start writing a new SSTable, if more rows are available.In this commit, we fix
org.apache.cassandra.spark.bulkwriter.RecordWriter#checkBatchSizeto take into account the configuredRowBufferMode. And in specific to the case of theUNBUFFEREDRowBufferMode, we check then the batchSize of the SSTable during writes, and for the case ofBUFFEREDthat check will take no effect.Co-authored-by: Doug Rohrer doug@therohrers.org