ByteBufferLib, RecordBuffer: use bulk get/put APIs#1800
Conversation
|
@kinow not really; the function is short-lived and the amount of memory allocated hardly goes beyond a few kilobytes with the datasets I've tested. |
|
Interesting! There was a reason for I created issue #1803 for this. |
|
This is looking good. I tried the various Loading into Fuseki is, roughly, "loader=basic" to avoid an upload ruining the response for queries running at the same time. The bulk loaders like "phased" and "parallel" are standalone and don't support concurrent use. I'll do some more testing with different data sizes and different warm-ups. @lucasvr - for those timings, did you run a warm-up phase? |
|
Hi @afs. I have always tested with a brand new db directory ( |
|
Just to check on detail - it was a new server for each timing run? Levels of optimization kick as the run happens, so warming up also causes the JIT to run. I think the optimizer spots patterns and can, for example, not do all the bound checking in a loop when ti knows the index can not be out of bounds. |
|
The code profiler, which reported an excessive number of calls to Does JIT depend on any particular version of JDK/JRE? Based on my observations it didn't seem to identify the hottest execution path and optimize |
|
Some optimization happens only after a large number of calls to a function. So there may be further optimizations that have not occurred. The exact optimizations will be JVM dependent and get better with every release. (I've only used OpenJDK.) The B+tree implementation was tested by generating random sequences and building tree then deleting it in another random odder, while checking internal consistency. Run 10's of millions of times while printing a dot every so many iterations of create/delete. The dots started appearing quite slowly, then a bit quicker then quicker again as various compile/optimize steps happened. |
|
Interesting, thanks for letting me know. |
|
Generally, improvement with ByteBufferLib changes. Adding RecordBuffer causes some slow down compared to just the ByteBufferLib changes. Setup: current Jena development codebase (same as 4.7.0 for TDB2), with ByteBufferLib changes and with Writing to NVMe M2 SSD. The machine CPU has 8 performance cores (2 threads each) and 4 efficiency cores (12th Gen Intel® Core™ i7-12700 × 20 threads). The test is loading BSBM data with tdb2.tdbloader, running each of 3 loaders twice in one JVM (6 runs per JVM). Data sizes: 1m, 5m, 25m , 50m, and 100m. "basic" is approximately what a bulk load into a running Fuseki server is doing. Time data for bulk loading |
|
It looks like the ByteBufferLib changes on their own are valuable. The impact on the parallel loader is minimal. The parallel loader is not usually suitable for machines that are desktop or portable class with Intel CPUs before the big-little architecture (before 11th generation). |
|
@lucasvr - if you remove the Alternatively, I can take the PR diff and sort it out (but it does not have your name as the git-author). |
See apache#1800 for details.
Hi @afs. Apologies for the late response. I have reverted the changes to RecordBuffer.java and updated the PR. Please let me know how it goes. |
|
@lucasvr -- the diff looks good. |
CPU profiling of a brand new installation of jena-fuseki shows that ~75% of the time spent by `s-put` (when ingesting Turtle tuples via SPARQL) relates to the execution of `ByteBufferLib.bbcopy2()` -- often as part of B+tree operations such as `BPTreeNode.internalDelete()`. `bbcopy2()` and `bbcopy1()` perform data copy from a source ByteBuffer to a destination ByteBuffer by reading/writing one byte at a time, which is not very efficient. In addition, `bbcopy1()` makes a poor use of data cache prefetches, as it iterates the ByteBuffers in reverse order. This commit replaces the implementation of these two functions by one that reads the input data in bulk mode into a dynamically allocated byte array and then writes its contents to the destination buffer using a bulk `put()` operation. The speedup gains introduced by these code changes are consistent regardless of the number of triples being ingested into Jena: Input file: 1.2GB with 1.75M triples Original ingestion time: 544 secs After changes to bbcopy: 454 secs (1.19x speedup) Input file: 21MB with 154k triples Original ingestion time: 7.4 secs After changes to bbcopy: 6.0 secs (1.24x speedup) Refs apache#1800
Done! |
|
Thank you! |
CPU profiling of a brand new installation of jena-fuseki shows that ~75% of the time spent by
s-put(when ingesting Turtle tuples via SPARQL) relates to the execution ofByteBufferLib.bbcopy2()-- often as part of B+tree operations such asBPTreeNode.internalDelete().bbcopy2()andbbcopy1()perform data copy from a source ByteBuffer to a destination ByteBuffer by reading/writing one byte at a time, which is not very efficient. In addition,bbcopy1()makes a poor use of data cache prefetches, as it iterates the ByteBuffers in reverse order.This commit replaces the implementation of these two functions by one that reads the input data in bulk mode into a dynamically allocated byte array and then writes its contents to the destination buffer using a bulk
put()operation. The same approach is used to improve the performance of another bottleneck inRecordBuffer::compare().The speedup gains introduced by these code changes are consistent regardless of the number of triples being ingested into Jena:
Input file: 1.2GB with 1.75M triples
Original ingestion time: 544 secs
After changes to bbcopy: 454 secs (1.19x speedup)
After changes to bbcopy and compare(): 388 secs (1.40x speedup)
Input file: 21MB with 154k triples
Original ingestion time: 7.4 secs
After changes to bbcopy: 6.0 secs (1.24x speedup)
After changes to bbcopy and compare(): 5.23 secs (1.43x speedup)