Skip to content

ByteBufferLib, RecordBuffer: use bulk get/put APIs#1800

Merged
afs merged 1 commit intoapache:mainfrom
lucasvr:main
Mar 31, 2023
Merged

ByteBufferLib, RecordBuffer: use bulk get/put APIs#1800
afs merged 1 commit intoapache:mainfrom
lucasvr:main

Conversation

@lucasvr
Copy link
Contributor

@lucasvr lucasvr commented Mar 16, 2023

CPU profiling of a brand new installation of jena-fuseki shows that ~75% of the time spent by s-put (when ingesting Turtle tuples via SPARQL) relates to the execution of ByteBufferLib.bbcopy2() -- often as part of B+tree operations such as BPTreeNode.internalDelete().

bbcopy2() and bbcopy1() perform data copy from a source ByteBuffer to a destination ByteBuffer by reading/writing one byte at a time, which is not very efficient. In addition, bbcopy1() makes a poor use of data cache prefetches, as it iterates the ByteBuffers in reverse order.

This commit replaces the implementation of these two functions by one that reads the input data in bulk mode into a dynamically allocated byte array and then writes its contents to the destination buffer using a bulk put() operation. The same approach is used to improve the performance of another bottleneck in RecordBuffer::compare().

The speedup gains introduced by these code changes are consistent regardless of the number of triples being ingested into Jena:

Input file: 1.2GB with 1.75M triples
Original ingestion time: 544 secs
After changes to bbcopy: 454 secs (1.19x speedup)
After changes to bbcopy and compare(): 388 secs (1.40x speedup)

Input file: 21MB with 154k triples
Original ingestion time: 7.4 secs
After changes to bbcopy: 6.0 secs (1.24x speedup)
After changes to bbcopy and compare(): 5.23 secs (1.43x speedup)

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I could understand from the code and issue description, change looks good to me. @lucasvr was there any increase in memory for creating the bbBytes in compare to avoid calling ByteBuffer#get inside the loop?

@lucasvr
Copy link
Contributor Author

lucasvr commented Mar 16, 2023

@kinow not really; the function is short-lived and the amount of memory allocated hardly goes beyond a few kilobytes with the datasets I've tested.

@afs
Copy link
Member

afs commented Mar 17, 2023

Interesting!

There was a reason for bbcopy1 and bbcopy2 when written but I can not remember what it was.
It may even be that modern hardware/modern java prefers the allocating of a temporary buffer.

I created issue #1803 for this.

@afs
Copy link
Member

afs commented Mar 20, 2023

This is looking good. I tried the various tdb2.tdbloader choices. I got a ~10% improvement for loader=phased (the default from the command line) and loader=basic. test data was 25e6 BSBM triples. Less on loader=parallel but that is not surprising as it is not typically CPU limited.

Loading into Fuseki is, roughly, "loader=basic" to avoid an upload ruining the response for queries running at the same time. The bulk loaders like "phased" and "parallel" are standalone and don't support concurrent use.

I'll do some more testing with different data sizes and different warm-ups.

@lucasvr - for those timings, did you run a warm-up phase?

@lucasvr
Copy link
Contributor Author

lucasvr commented Mar 20, 2023

Hi @afs. I have always tested with a brand new db directory (rm -rf) and with the input file in the page cache (cat file.ttl > /dev/null)

@afs
Copy link
Member

afs commented Mar 20, 2023

Just to check on detail - it was a new server for each timing run?

Levels of optimization kick as the run happens, so warming up also causes the JIT to run. I think the optimizer spots patterns and can, for example, not do all the bound checking in a loop when ti knows the index can not be out of bounds.

@lucasvr
Copy link
Contributor Author

lucasvr commented Mar 20, 2023

The code profiler, which reported an excessive number of calls to bbcopy2(), sampled a single Fuseki process for several minutes as I ingested different TTL files via SPARQL. Only afterwards I benchmarked different instances of Fuseki by launching a new server for each run.

Does JIT depend on any particular version of JDK/JRE? Based on my observations it didn't seem to identify the hottest execution path and optimize bbcopy2().

@afs
Copy link
Member

afs commented Mar 21, 2023

Some optimization happens only after a large number of calls to a function. So there may be further optimizations that have not occurred.

The exact optimizations will be JVM dependent and get better with every release. (I've only used OpenJDK.)

The B+tree implementation was tested by generating random sequences and building tree then deleting it in another random odder, while checking internal consistency. Run 10's of millions of times while printing a dot every so many iterations of create/delete. The dots started appearing quite slowly, then a bit quicker then quicker again as various compile/optimize steps happened.

@lucasvr
Copy link
Contributor Author

lucasvr commented Mar 21, 2023

Interesting, thanks for letting me know.

@afs
Copy link
Member

afs commented Mar 24, 2023

Generally, improvement with ByteBufferLib changes.

Adding RecordBuffer causes some slow down compared to just the ByteBufferLib changes.
This is probably because the optimizer can avoid the bounds checking in RecordBuffer ByteBuffer.get together with the RecordBuffer operaing on small data items, so the allocation costs of the array are significant.

Setup: current Jena development codebase (same as 4.7.0 for TDB2), with ByteBufferLib changes and with
ByteBufferLib and RecordBuffer changes.

Writing to NVMe M2 SSD.
Reading from data file nt.gz on HDD.
It is slower to put the data on the same SSD as the output database - the parser is faster tan all the figures shown and is not not the limiting factor.

The machine CPU has 8 performance cores (2 threads each) and 4 efficiency cores (12th Gen Intel® Core™ i7-12700 × 20 threads).

The test is loading BSBM data with tdb2.tdbloader, running each of 3 loaders twice in one JVM (6 runs per JVM).
The results of the second run; no class loading, some JIT optimization would have happened.

Data sizes: 1m, 5m, 25m , 50m, and 100m.

"basic" is approximately what a bulk load into a running Fuseki server is doing.

Time data for bulk loading
1 million triples
main:
Basic:    7.842 seconds : Triples = 1,000,312 : Rate = 127,558 /s
Phased:   3.897 seconds : Triples = 1,000,312 : Rate = 256,688 /s
Parallel: 2.793 seconds : Triples = 1,000,312 : Rate = 358,150 /s

ByteBuffer:
Basic:    6.292 seconds : Triples = 1,000,312 : Rate = 158,982 /s
Phased:   3.262 seconds : Triples = 1,000,312 : Rate = 306,656 /s
Parallel: 2.754 seconds : Triples = 1,000,312 : Rate = 363,221 /s

ByteBuffer and Record Buffer:
Basic:    6.704 seconds : Triples = 1,000,312 : Rate = 149,211 /s
Phased:   3.415 seconds : Triples = 1,000,312 : Rate = 292,917 /s
Parallel: 2.720 seconds : Triples = 1,000,312 : Rate = 367,762 /s

ByteBuffer faster than ByteBuffer+RecordBuffer

Comparing existing and byte buffer changes:
Basic faster
Phased faster
Parallel unaffected

------------------------

5m
main:
Basic:    39.224 seconds : Triples = 5,000,599 : Rate = 127,488 /s
Phased:   19.032 seconds : Triples = 5,000,599 : Rate = 262,747 /s
Parallel: 13.148 seconds : Triples = 5,000,599 : Rate = 380,332 /s

ByteBuffer:
Basic:    33.248 seconds : Triples = 5,000,599 : Rate = 150,403 /s
Phased:   16.070 seconds : Triples = 5,000,599 : Rate = 311,176 /s
Parallel: 13.233 seconds : Triples = 5,000,599 : Rate = 377,889 /s

ByteBuffer and Record Buffer:
Basic:    34.614 seconds : Triples = 5,000,599 : Rate = 144,468 /s
Phased:   16.635 seconds : Triples = 5,000,599 : Rate = 300,607 /s
Parallel: 13.307 seconds : Triples = 5,000,599 : Rate = 375,787 /s

ByteBuffer+RecordBuffer slower than just ByteBuffer

Comparing existing and byte buffer changes:
Basic faster
Phased faster
Parallel slightly slower

------------------------

25m
main:
Basic:    141.306 seconds : Triples = 24,997,044 : Rate = 176,900 /s
Phased:    75.321 seconds : Triples = 24,997,044 : Rate = 331,874 /s
Parallel:  63.707 seconds : Triples = 24,997,044 : Rate = 392,375 /s

ByteBuffer:
Basic:    139.969 seconds : Triples = 24,997,044 : Rate = 178,590 /s
Phased:    75.875 seconds : Triples = 24,997,044 : Rate = 329,450 /s
Parallel:  64.162 seconds : Triples = 24,997,044 : Rate = 389,593 /s

ByteBuffer and record buffer:
Basic:    150.354 seconds : Triples = 24,997,044 : Rate = 166,255 /s
Phased:    79.437 seconds : Triples = 24,997,044 : Rate = 314,678 /s
Parallel:  65.110 seconds : Triples = 24,997,044 : Rate = 383,920 /s

ByteBuffer+RecordBuffer slowest (could be GC related)

Comparing existing and byte buffer changes:
Basic faster
Phased about the same
Parallel slightly slower

------------------------

50m
main:
Basic:    428.839 seconds : Triples = 50,005,630 : Rate = 116,607 /s
Phased:   194.982 seconds : Triples = 50,005,630 : Rate = 256,463 /s
Parallel: 132.348 seconds : Triples = 50,005,630 : Rate = 377,834 /s

ByteBuffer:
Basic:    357.125 seconds : Triples = 50,005,630 : Rate = 140,023 /s
Phased:   164.474 seconds : Triples = 50,005,630 : Rate = 304,034 /s
Parallel: 129.619 seconds : Triples = 50,005,630 : Rate = 385,789 /s

ByteBuffer and record buffer:
Basic:    379.890 seconds : Triples = 50,005,630 : Rate = 131,632 /s
Phased:   171.425 seconds : Triples = 50,005,630 : Rate = 291,706 /s
Parallel: 132.076 seconds : Triples = 50,005,630 : Rate = 378,613 /s

ByteBuffer+RecordBuffer slower than just ByteBuffer

Comparing existing and byte buffer changes:
Basic faster
Phased faster
Parallel faster

------------------------

100m:

main:
Basic:     875.759 seconds : Triples = 100,000,748 : Rate = 114,188 /s
Phased:    390.673 seconds : Triples = 100,000,748 : Rate = 255,970 /s
Parallel:  268.516 seconds : Triples = 100,000,748 : Rate = 372,420 /s

ByteBuffer:
Basic:     729.727 seconds : Triples = 100,000,748 : Rate = 137,039 /s
Phased:    334.163 seconds : Triples = 100,000,748 : Rate = 299,257 /s
Parallel:  257.353 seconds : Triples = 100,000,748 : Rate = 388,574 /s

ByteBuffer and Record Buffer:
Basic:     766.970 seconds : Triples = 100,000,748 : Rate = 130,384 /s
Phased:    345.600 seconds : Triples = 100,000,748 : Rate = 289,354 /s
Parallel:  258.614 seconds : Triples = 100,000,748 : Rate = 386,680 /s

Record buffer slower than byte buffer

Basic faster
Phased faster
Parallel slightly faster

@afs
Copy link
Member

afs commented Mar 24, 2023

It looks like the ByteBufferLib changes on their own are valuable.

The impact on the parallel loader is minimal. The parallel loader is not usually suitable for machines that are desktop or portable class with Intel CPUs before the big-little architecture (before 11th generation).

@afs
Copy link
Member

afs commented Mar 27, 2023

@lucasvr - if you remove the RecordBuffer part, we can apply this PR.

Alternatively, I can take the PR diff and sort it out (but it does not have your name as the git-author).

@afs
Copy link
Member

afs commented Mar 30, 2023

PR #1819 is the ByteBufferLib improvements from this PR.

I measued the sped improvement of #1819 and got a similar gain in performance.

lucasvr added a commit to lucasvr/jena that referenced this pull request Mar 30, 2023
@lucasvr
Copy link
Contributor Author

lucasvr commented Mar 30, 2023

@lucasvr - if you remove the RecordBuffer part, we can apply this PR.

Alternatively, I can take the PR diff and sort it out (but it does not have your name as the git-author).

Hi @afs. Apologies for the late response. I have reverted the changes to RecordBuffer.java and updated the PR. Please let me know how it goes.

@afs
Copy link
Member

afs commented Mar 30, 2023

@lucasvr -- the diff looks good.
Could you squash the 2 commits into one commit with commit message updated e.g. "ByteBufferLib: use bulk get/put APIs"? Thanks

CPU profiling of a brand new installation of jena-fuseki shows that
~75% of the time spent by `s-put` (when ingesting Turtle tuples via
SPARQL) relates to the execution of `ByteBufferLib.bbcopy2()` -- often
as part of B+tree operations such as `BPTreeNode.internalDelete()`.

`bbcopy2()` and `bbcopy1()` perform data copy from a source ByteBuffer
to a destination ByteBuffer by reading/writing one byte at a time, which
is not very efficient. In addition, `bbcopy1()` makes a poor use of data
cache prefetches, as it iterates the ByteBuffers in reverse order.

This commit replaces the implementation of these two functions by one
that reads the input data in bulk mode into a dynamically allocated byte
array and then writes its contents to the destination buffer using a bulk
`put()` operation.

The speedup gains introduced by these code changes are consistent
regardless of the number of triples being ingested into Jena:

Input file: 1.2GB with 1.75M triples
Original ingestion time: 544 secs
After changes to bbcopy: 454 secs (1.19x speedup)

Input file: 21MB with 154k triples
Original ingestion time: 7.4 secs
After changes to bbcopy: 6.0 secs (1.24x speedup)

Refs apache#1800
@lucasvr
Copy link
Contributor Author

lucasvr commented Mar 31, 2023

Could you squash the 2 commits into one commit with commit message updated e.g. "ByteBufferLib: use bulk get/put APIs"? Thanks

Done!

@afs afs merged commit 4933549 into apache:main Mar 31, 2023
@afs
Copy link
Member

afs commented Mar 31, 2023

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants