-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SstFileWriter performance #2668
Comments
Yeah I'd expect better write throughput too, especially given that w/o params we should be using cheap (CPU-wise) block compression. I'll try to reproduce that. |
@h-1 Just to point out, but implementing comparators in Java has quite some performance overhead and we are aware of that. There are several improvements planned in that area. For the moment if you need a custom comparator, you could implement just the comparator in C++ and set it from Java. Also I would point out that |
@mikhail-antonov Thanks for the quick reply. No, I haven't tried loading the test file into memory. This is just a test, real data is much larger. And buffReader is a regular Java Thanks for the help! |
@adamretter Thanks for the tip. I read the Java |
Hi @mikhail-antonov , I am sure you're very busy. But have you had a chance to look into it? Thank you for the help. |
Using RocksDb 5.7.2, I did some straightforward benchmark on generated data rather than reading from a file. The benchmark has three parts:
Since 1 GB of data items of 30 bytes each are ca. 34M records, I ended up with ca. 83 seconds for the depth 3 test on my laptop (2/4 physical/virtual cores, 10 GB RAM, SSD). So the most important insights:
Here the detailed test output:
And here the full (Scala) code for the benchmark: import java.nio.file.{Files, Paths}
import java.util.Locale
import org.rocksdb._
import org.rocksdb.util.DirectBytewiseComparator
object RocksSST extends App {
Locale.setDefault(Locale.US)
val sstFileWriter = new SstFileWriter(new EnvOptions, new Options)
val filename = "/tmp/rocksdbsst"
sstFileWriter.open(filename)
val n = 34000000 // ca. number of records in 1 GB data of 30 bytes each
for (depth <- 1 to 3) {
val starttime = System.currentTimeMillis()
var lastlogtime = System.currentTimeMillis()
var logeach = 1000000
println("#"*100 + s"\ndepth $depth: started...")
for (id <- 1 to n) {
val key = {val number = id.toString; "0" * (15 - number.length) + number} // 6x faster than "%015d".format(id)
val value = key
if (depth > 1) {
val keySlice = new Slice(key.getBytes)
val valueSlice = new Slice(value.getBytes)
if (depth > 2) {
sstFileWriter.add(keySlice, valueSlice)
}
}
if (id % logeach == 0) {
println(f"depth $depth%d: $id%d of $n%d (${id * 100.0 / n}%.1f%%, records/sec overall: ${id * 1000.0 / (System.currentTimeMillis - starttime)}%.0f, current: ${logeach * 1000.0 / (System.currentTimeMillis - lastlogtime)}%.0f")
lastlogtime = System.currentTimeMillis
}
}
if (depth > 2)
sstFileWriter.finish()
val duration = System.currentTimeMillis() - starttime
println(f"depth $depth%d: duration ${duration / 1000.0}%.2f secs, that is ${n * 1000.0 / duration}%.0f records/sec overall")
if (depth > 2)
println(f"size of output file '$filename': ${Files.size(Paths.get(filename)) / 1048576.0}%.2f MB")
}
}
|
CC @scv119 |
Thanks @zawlazaw for sharing such insights, @adamretter, Are we expecting such huge overhead from creating the slice ? is it because of memcpy ? |
(EDIT: added DirectBuffer variants) I did another benchmark that solely focuses on creating slices - Slice and DirectSlice. It first creates a single large byte-array that concatenates 70 million 15-byte-records. Then it proceeds in one of 6 variants: Comments on the non-direct approaches (A-C): According to stackoverflow, the invocation of a parameterless function on a reasonable machine via JNI takes ca. 6 ns, other sources report 10-40 ns. Thus, the most time it takes to create a Slice comes from passing data. Later in that topic, somebody passed "some arguments" via JNI within 400 ns. Comments on the direct approaches (D-G): I already run a variant of the very first benchmak above (coming soon...) that shows how using DirectBuffer this way can significantly improve the performance on sst writes. However, there are also two issues with the current implementation of DirectBuffer that I will explain then. Main Result: Here the detailed results of the slice-only benchmark:
And here the (Scala) code: import org.rocksdb._
import java.nio.ByteBuffer
object SliceBenchmark extends App {
RocksDB.loadLibrary()
val n = 70000000
val recordLen = 15
val data = Array.ofDim[Byte](n * recordLen)
println("created data-array of length " + data.length)
val reusedByteBuffer = ByteBuffer.allocateDirect(4096) // allow for up to 4kB of data (although not required, here)
val reusedSlice = new DirectSlice(reusedByteBuffer, 15) // fix the length to 15 bytes (cannot be changed later, in the current implementation)
for (variant <- 'A' to 'G') {
println("#"*100 + s"\nvariant $variant: started...")
val starttime = System.currentTimeMillis()
var lastlogtime = starttime
var logeach = 10000000
for (id <- 1 to n) {
variant match {
case 'A' => // do simply nothing...
case 'B' => // create a java-internal copy of the relevant fragment of the data
System.arraycopy(data, (id-1) * recordLen, Array.ofDim[Byte](recordLen), 0, recordLen)
case 'C' => // create a Slice from a java-internal copy
val record = Array.ofDim[Byte](recordLen)
System.arraycopy(data, (id-1) * recordLen, record, 0, recordLen)
new Slice(record)
case 'D' => // create a new ByteBuffer and copy data into it
val newByteBuffer = ByteBuffer.allocateDirect(recordLen)
newByteBuffer.put(data, (id-1) * recordLen, recordLen)
case 'E' => // create a new ByteBuffer and a new DirectSlice of the desired size
val newByteBuffer = ByteBuffer.allocateDirect(recordLen)
newByteBuffer.put(data, (id-1) * recordLen, recordLen)
new DirectSlice(newByteBuffer, recordLen)
case 'F' => // re-use the existing ByteBuffer, but create a new DirectSlice of the desired size
reusedByteBuffer.clear()
reusedByteBuffer.put(data, (id-1) * recordLen, recordLen)
new DirectSlice(reusedByteBuffer, recordLen)
case 'G' => // re-use the existing ByteBuffer and the existing DirectSlice
reusedByteBuffer.clear()
reusedByteBuffer.put(data, (id-1) * recordLen, recordLen)
reusedSlice // do something with it... (its content is affected by writing to reusedByteBuffer, however, in the current implementation its size cannot be changed !)
}
if (id % logeach == 0) {
println(f"variant $variant: $id%d of $n%d (${id * 100.0 / n}%.1f%%, records/sec overall: ${id * 1000.0 / (System.currentTimeMillis - starttime)}%.0f, current: ${logeach * 1000.0 / (System.currentTimeMillis - lastlogtime)}%.0f")
lastlogtime = System.currentTimeMillis
}
}
val duration = System.currentTimeMillis() - starttime
println(f"variant $variant%c: duration ${duration / 1000.0}%.2f secs, that is ${n * 1000.0 / duration}%.0f records/sec overall")
}
} |
Note that I edited my above post on the Slice-creation benchmark by including DirectSlice. As a follow-up, here is the final benchmark of writing sst files much faster than before, which solves the specific original issue. The trick is to use DirectSlice in the right way (which currently only works if all keys have the same length, similar for all values). In order to not measure input data creation, we first create an array of 35M key-value-pairs (keys and values of 15 bytes each, giving ca. 1 GB in total). Then there are two variants: A) non-direct: write the sst-file by creating a new Slice for each key and each value => ca. 450k records/sec (same as in the very first benchmark) B) direct: write the sst-file by re-using a single ByteBuffer as well as a single DirectSlice for all keys (similar for values) => ca. 2.3M records/sec !
Thus, by the new approach the total time goes down from 79 seconds to 15 seconds. Speed is now probably no longer determined by jni, and is on par with C++? Here the output of the benchmark:
And here the (Scala) code: import org.rocksdb._
import java.nio.ByteBuffer
object DirectRocksSST extends App {
val sstFileWriter = new SstFileWriter(new EnvOptions, new Options)
val filename = "/tmp/rocks.sst"
sstFileWriter.open(filename)
val sstFileWriterDirect = new SstFileWriter(new EnvOptions, new Options)
val filenameDirect = "/tmp/rocks-direct.sst"
sstFileWriterDirect.open(filenameDirect)
val n = 35000000 // ca. number of records (15 bytes per key and per value) in 1 GB data
val data = Array.ofDim[Byte](n * 30)
print("filling data-array of length " + data.length + "...")
for (id <- 0 until n) {
val kv = {val number = id.toString; "0" * (15 - number.length) + number}.getBytes // 6x faster than "%015d".format(id)
System.arraycopy(kv, 0, data, 30*id, 15)
System.arraycopy(kv, 0, data, 30*id+15, 15)
}
println("done")
// re-use these ByteBuffers for DirectSlice
val keyByteBuffer = ByteBuffer.allocateDirect(4096)
val valueByteBuffer = ByteBuffer.allocateDirect(4096)
// re-use these DirectSlices
val directKeySlice = new DirectSlice(keyByteBuffer,15)
val directValueSlice = new DirectSlice(valueByteBuffer,15)
// re-use these byte-arrays for non-direct Slice
val keyArray = Array.ofDim[Byte](15)
val valueArray = Array.ofDim[Byte](15)
for (variant <- 'A' to 'B') {
val starttime = System.currentTimeMillis()
var lastlogtime = System.currentTimeMillis()
var logeach = 5000000
println("#"*100 + s"\nvariant $variant: started...")
val keyArray = Array.ofDim[Byte](15)
val valueArray = Array.ofDim[Byte](15)
for (id <- 0 until n) {
variant match {
case 'A' => // non-direct
System.arraycopy(data, 30*id, keyArray, 0, 15)
val keySlice = new Slice(keyArray)
System.arraycopy(data, 30*id+15, valueArray, 0, 15)
val valueSlice = new Slice(valueArray)
sstFileWriter.put(keySlice, valueSlice)
case 'B' => // direct
keyByteBuffer.clear() // not really necessary, only for safety if C++ would modify position/limit
keyByteBuffer.put(data, 30*id, 15)
keyByteBuffer.flip()
valueByteBuffer.clear() // not really necessary, only for safety if C++ would modify position/limit
valueByteBuffer.put(data, 30*id+15, 15)
valueByteBuffer.flip()
sstFileWriterDirect.put(directKeySlice, directValueSlice)
}
if (id > 0 && id % logeach == 0) {
println(f"variant $variant%c: $id%d of $n%d (${id * 100.0 / n}%.1f%%, records/sec overall: ${id * 1000.0 / (System.currentTimeMillis - starttime)}%.0f, current: ${logeach * 1000.0 / (System.currentTimeMillis - lastlogtime)}%.0f")
lastlogtime = System.currentTimeMillis
}
}
variant match {
case 'A' => sstFileWriter.finish()
case 'B' => sstFileWriterDirect.finish()
}
val duration = System.currentTimeMillis() - starttime
println(f"variant $variant%c: duration ${duration / 1000.0}%.2f secs, that is ${n * 1000.0 / duration}%.0f records/sec overall")
}
} Although this solves the precise original issue, it also raises a questions, and suggests some future improvements of the java-classes Slice and DirectSlice. First, a question, since I cannot oversee the internal dataflow after calling put() on the SstFileWriter: Question: Are the memory regions that are wrapped by the ByteBuffers used for passing data to SstFileWriter.put(...) no longer accessed by C++ as soon as the call to SstFileWriter.put(...) returns? Or may there be some asynchronous later read-access from some C++ internals that would be affected if we modify the ByteBuffers from Java immediately after the call returns? A positive answer to this question is absolutely mandatory in order to guarantee that a single ByteBuffer and DirectSlice may be re-used. In order to discuss the improvements of the java-classes Slice and DirectSlice, I will create a separate issue in order to discuss them - once somebody gives a positive answer to the above question. Here, a teaser: DirectSlice should provide a way to modify its size-parameter after its creation (in order to re-use the same instance for multiple data-transfers of different sizes), or even better, it should stick to the standard position/limit-approach of ByteBuffer. |
Sorry for writing so much, but I just realized that the master branch latterly contains |
Hello again... I just implemented
=> Suggestion to @h-1: since your keys/values are of fixed size, you can use the above workaround with re-using the DirectSlice, until the other approaches are released (someday). Side remark: since the change from |
Final comment & repeated question: the above way of working with re-used |
@zawlazaw Thanks for working on this. It's interesting to me that Personally if the ByteBuffer provides better throughtput I think we should switch over to it. To answer your question, it's safe to change the memory once the SstFileWriter.put/get/merge call finished. |
Thanks, @zawlazaw , for the extensive performance testing. I just gave a test case that had fixed length to isolate issues, actual data will have varied length. Looking at the Java performance, I think I will go with C++ for now. Though still have not tested performance in C++, if you have any numbers on C++, please do share. Thanks again for working on this issue. |
@scv119 Here are some differences between Access at C-side: If you pass Memory allocation type: I am not sure about hardware details, but So, a crucial aspect when using @h-1 Unfortunately, I have no C-benchmark on that, but I am very interested in such, too - as well as in benchmarking the read-performance of |
Closing this via automation due to lack of activity. If you'd like this to be re-opened, please just comment here and we'll open it back up. |
Was testing SstFileWriter with 5.5.5. Below is the summary.
I have a sorted 1 GB text file, keys are 15 bytes, values are also 15 bytes.
This takes about ~70s write time
val sstFileWriter = new SstFileWriter(new EnvOptions, new Options())
This takes about ~140s write time
val sstFileWriter = new SstFileWriter(new EnvOptions, new Options(), new BytewiseComparator(new ComparatorOptions))
rest of the code.
I understand there are parameter options we can set to optimize for write performance, but even with default parameters, I was expecting the write performance to be better than what I got. Am I doing something wrong?
The text was updated successfully, but these errors were encountered: