ByteBufferLib, RecordBuffer: use bulk get/put APIs by lucasvr · Pull Request #1800 · apache/jena

lucasvr · 2023-03-16T22:52:49Z

CPU profiling of a brand new installation of jena-fuseki shows that ~75% of the time spent by s-put (when ingesting Turtle tuples via SPARQL) relates to the execution of ByteBufferLib.bbcopy2() -- often as part of B+tree operations such as BPTreeNode.internalDelete().

bbcopy2() and bbcopy1() perform data copy from a source ByteBuffer to a destination ByteBuffer by reading/writing one byte at a time, which is not very efficient. In addition, bbcopy1() makes a poor use of data cache prefetches, as it iterates the ByteBuffers in reverse order.

This commit replaces the implementation of these two functions by one that reads the input data in bulk mode into a dynamically allocated byte array and then writes its contents to the destination buffer using a bulk put() operation. The same approach is used to improve the performance of another bottleneck in RecordBuffer::compare().

The speedup gains introduced by these code changes are consistent regardless of the number of triples being ingested into Jena:

Input file: 1.2GB with 1.75M triples
Original ingestion time: 544 secs
After changes to bbcopy: 454 secs (1.19x speedup)
After changes to bbcopy and compare(): 388 secs (1.40x speedup)

Input file: 21MB with 154k triples
Original ingestion time: 7.4 secs
After changes to bbcopy: 6.0 secs (1.24x speedup)
After changes to bbcopy and compare(): 5.23 secs (1.43x speedup)

kinow

From what I could understand from the code and issue description, change looks good to me. @lucasvr was there any increase in memory for creating the bbBytes in compare to avoid calling ByteBuffer#get inside the loop?

lucasvr · 2023-03-16T23:51:15Z

@kinow not really; the function is short-lived and the amount of memory allocated hardly goes beyond a few kilobytes with the datasets I've tested.

afs · 2023-03-17T12:00:42Z

Interesting!

There was a reason for bbcopy1 and bbcopy2 when written but I can not remember what it was.
It may even be that modern hardware/modern java prefers the allocating of a temporary buffer.

I created issue #1803 for this.

afs · 2023-03-20T14:21:36Z

This is looking good. I tried the various tdb2.tdbloader choices. I got a ~10% improvement for loader=phased (the default from the command line) and loader=basic. test data was 25e6 BSBM triples. Less on loader=parallel but that is not surprising as it is not typically CPU limited.

Loading into Fuseki is, roughly, "loader=basic" to avoid an upload ruining the response for queries running at the same time. The bulk loaders like "phased" and "parallel" are standalone and don't support concurrent use.

I'll do some more testing with different data sizes and different warm-ups.

@lucasvr - for those timings, did you run a warm-up phase?

lucasvr · 2023-03-20T20:11:23Z

Hi @afs. I have always tested with a brand new db directory (rm -rf) and with the input file in the page cache (cat file.ttl > /dev/null)

afs · 2023-03-20T22:19:45Z

Just to check on detail - it was a new server for each timing run?

Levels of optimization kick as the run happens, so warming up also causes the JIT to run. I think the optimizer spots patterns and can, for example, not do all the bound checking in a loop when ti knows the index can not be out of bounds.

lucasvr · 2023-03-20T23:00:53Z

The code profiler, which reported an excessive number of calls to bbcopy2(), sampled a single Fuseki process for several minutes as I ingested different TTL files via SPARQL. Only afterwards I benchmarked different instances of Fuseki by launching a new server for each run.

Does JIT depend on any particular version of JDK/JRE? Based on my observations it didn't seem to identify the hottest execution path and optimize bbcopy2().

afs · 2023-03-21T08:33:26Z

Some optimization happens only after a large number of calls to a function. So there may be further optimizations that have not occurred.

The exact optimizations will be JVM dependent and get better with every release. (I've only used OpenJDK.)

The B+tree implementation was tested by generating random sequences and building tree then deleting it in another random odder, while checking internal consistency. Run 10's of millions of times while printing a dot every so many iterations of create/delete. The dots started appearing quite slowly, then a bit quicker then quicker again as various compile/optimize steps happened.

lucasvr · 2023-03-21T11:29:16Z

Interesting, thanks for letting me know.

afs · 2023-03-24T14:35:20Z

Generally, improvement with ByteBufferLib changes.

Adding RecordBuffer causes some slow down compared to just the ByteBufferLib changes.
This is probably because the optimizer can avoid the bounds checking in RecordBuffer ByteBuffer.get together with the RecordBuffer operaing on small data items, so the allocation costs of the array are significant.

Setup: current Jena development codebase (same as 4.7.0 for TDB2), with ByteBufferLib changes and with
ByteBufferLib and RecordBuffer changes.

Writing to NVMe M2 SSD.
Reading from data file nt.gz on HDD.
It is slower to put the data on the same SSD as the output database - the parser is faster tan all the figures shown and is not not the limiting factor.

The machine CPU has 8 performance cores (2 threads each) and 4 efficiency cores (12th Gen Intel® Core™ i7-12700 × 20 threads).

The test is loading BSBM data with tdb2.tdbloader, running each of 3 loaders twice in one JVM (6 runs per JVM).
The results of the second run; no class loading, some JIT optimization would have happened.

Data sizes: 1m, 5m, 25m , 50m, and 100m.

"basic" is approximately what a bulk load into a running Fuseki server is doing.

Time data for bulk loading

1 million triples
main:
Basic:    7.842 seconds : Triples = 1,000,312 : Rate = 127,558 /s
Phased:   3.897 seconds : Triples = 1,000,312 : Rate = 256,688 /s
Parallel: 2.793 seconds : Triples = 1,000,312 : Rate = 358,150 /s

ByteBuffer:
Basic:    6.292 seconds : Triples = 1,000,312 : Rate = 158,982 /s
Phased:   3.262 seconds : Triples = 1,000,312 : Rate = 306,656 /s
Parallel: 2.754 seconds : Triples = 1,000,312 : Rate = 363,221 /s

ByteBuffer and Record Buffer:
Basic:    6.704 seconds : Triples = 1,000,312 : Rate = 149,211 /s
Phased:   3.415 seconds : Triples = 1,000,312 : Rate = 292,917 /s
Parallel: 2.720 seconds : Triples = 1,000,312 : Rate = 367,762 /s

ByteBuffer faster than ByteBuffer+RecordBuffer

Comparing existing and byte buffer changes:
Basic faster
Phased faster
Parallel unaffected

------------------------

5m
main:
Basic:    39.224 seconds : Triples = 5,000,599 : Rate = 127,488 /s
Phased:   19.032 seconds : Triples = 5,000,599 : Rate = 262,747 /s
Parallel: 13.148 seconds : Triples = 5,000,599 : Rate = 380,332 /s

ByteBuffer:
Basic:    33.248 seconds : Triples = 5,000,599 : Rate = 150,403 /s
Phased:   16.070 seconds : Triples = 5,000,599 : Rate = 311,176 /s
Parallel: 13.233 seconds : Triples = 5,000,599 : Rate = 377,889 /s

ByteBuffer and Record Buffer:
Basic:    34.614 seconds : Triples = 5,000,599 : Rate = 144,468 /s
Phased:   16.635 seconds : Triples = 5,000,599 : Rate = 300,607 /s
Parallel: 13.307 seconds : Triples = 5,000,599 : Rate = 375,787 /s

ByteBuffer+RecordBuffer slower than just ByteBuffer

Comparing existing and byte buffer changes:
Basic faster
Phased faster
Parallel slightly slower

------------------------

25m
main:
Basic:    141.306 seconds : Triples = 24,997,044 : Rate = 176,900 /s
Phased:    75.321 seconds : Triples = 24,997,044 : Rate = 331,874 /s
Parallel:  63.707 seconds : Triples = 24,997,044 : Rate = 392,375 /s

ByteBuffer:
Basic:    139.969 seconds : Triples = 24,997,044 : Rate = 178,590 /s
Phased:    75.875 seconds : Triples = 24,997,044 : Rate = 329,450 /s
Parallel:  64.162 seconds : Triples = 24,997,044 : Rate = 389,593 /s

ByteBuffer and record buffer:
Basic:    150.354 seconds : Triples = 24,997,044 : Rate = 166,255 /s
Phased:    79.437 seconds : Triples = 24,997,044 : Rate = 314,678 /s
Parallel:  65.110 seconds : Triples = 24,997,044 : Rate = 383,920 /s

ByteBuffer+RecordBuffer slowest (could be GC related)

Comparing existing and byte buffer changes:
Basic faster
Phased about the same
Parallel slightly slower

------------------------

50m
main:
Basic:    428.839 seconds : Triples = 50,005,630 : Rate = 116,607 /s
Phased:   194.982 seconds : Triples = 50,005,630 : Rate = 256,463 /s
Parallel: 132.348 seconds : Triples = 50,005,630 : Rate = 377,834 /s

ByteBuffer:
Basic:    357.125 seconds : Triples = 50,005,630 : Rate = 140,023 /s
Phased:   164.474 seconds : Triples = 50,005,630 : Rate = 304,034 /s
Parallel: 129.619 seconds : Triples = 50,005,630 : Rate = 385,789 /s

ByteBuffer and record buffer:
Basic:    379.890 seconds : Triples = 50,005,630 : Rate = 131,632 /s
Phased:   171.425 seconds : Triples = 50,005,630 : Rate = 291,706 /s
Parallel: 132.076 seconds : Triples = 50,005,630 : Rate = 378,613 /s

ByteBuffer+RecordBuffer slower than just ByteBuffer

Comparing existing and byte buffer changes:
Basic faster
Phased faster
Parallel faster

------------------------

100m:

main:
Basic:     875.759 seconds : Triples = 100,000,748 : Rate = 114,188 /s
Phased:    390.673 seconds : Triples = 100,000,748 : Rate = 255,970 /s
Parallel:  268.516 seconds : Triples = 100,000,748 : Rate = 372,420 /s

ByteBuffer:
Basic:     729.727 seconds : Triples = 100,000,748 : Rate = 137,039 /s
Phased:    334.163 seconds : Triples = 100,000,748 : Rate = 299,257 /s
Parallel:  257.353 seconds : Triples = 100,000,748 : Rate = 388,574 /s

ByteBuffer and Record Buffer:
Basic:     766.970 seconds : Triples = 100,000,748 : Rate = 130,384 /s
Phased:    345.600 seconds : Triples = 100,000,748 : Rate = 289,354 /s
Parallel:  258.614 seconds : Triples = 100,000,748 : Rate = 386,680 /s

Record buffer slower than byte buffer

Basic faster
Phased faster
Parallel slightly faster

afs · 2023-03-24T14:44:44Z

It looks like the ByteBufferLib changes on their own are valuable.

The impact on the parallel loader is minimal. The parallel loader is not usually suitable for machines that are desktop or portable class with Intel CPUs before the big-little architecture (before 11th generation).

afs · 2023-03-27T18:10:09Z

@lucasvr - if you remove the RecordBuffer part, we can apply this PR.

Alternatively, I can take the PR diff and sort it out (but it does not have your name as the git-author).

afs · 2023-03-30T15:34:14Z

PR #1819 is the ByteBufferLib improvements from this PR.

I measued the sped improvement of #1819 and got a similar gain in performance.

See apache#1800 for details.

lucasvr · 2023-03-30T16:07:54Z

@lucasvr - if you remove the RecordBuffer part, we can apply this PR.

Alternatively, I can take the PR diff and sort it out (but it does not have your name as the git-author).

Hi @afs. Apologies for the late response. I have reverted the changes to RecordBuffer.java and updated the PR. Please let me know how it goes.

afs · 2023-03-30T20:12:12Z

@lucasvr -- the diff looks good.
Could you squash the 2 commits into one commit with commit message updated e.g. "ByteBufferLib: use bulk get/put APIs"? Thanks

CPU profiling of a brand new installation of jena-fuseki shows that ~75% of the time spent by `s-put` (when ingesting Turtle tuples via SPARQL) relates to the execution of `ByteBufferLib.bbcopy2()` -- often as part of B+tree operations such as `BPTreeNode.internalDelete()`. `bbcopy2()` and `bbcopy1()` perform data copy from a source ByteBuffer to a destination ByteBuffer by reading/writing one byte at a time, which is not very efficient. In addition, `bbcopy1()` makes a poor use of data cache prefetches, as it iterates the ByteBuffers in reverse order. This commit replaces the implementation of these two functions by one that reads the input data in bulk mode into a dynamically allocated byte array and then writes its contents to the destination buffer using a bulk `put()` operation. The speedup gains introduced by these code changes are consistent regardless of the number of triples being ingested into Jena: Input file: 1.2GB with 1.75M triples Original ingestion time: 544 secs After changes to bbcopy: 454 secs (1.19x speedup) Input file: 21MB with 154k triples Original ingestion time: 7.4 secs After changes to bbcopy: 6.0 secs (1.24x speedup) Refs apache#1800

lucasvr · 2023-03-31T15:35:35Z

Could you squash the 2 commits into one commit with commit message updated e.g. "ByteBufferLib: use bulk get/put APIs"? Thanks

Done!

afs · 2023-03-31T15:53:01Z

Thank you!

kinow approved these changes Mar 16, 2023

View reviewed changes

rvesse requested a review from afs March 17, 2023 09:20

afs mentioned this pull request Mar 17, 2023

ByteBufferLib, RecordBuffer: use bulk get/put APIs #1803

Closed

afs mentioned this pull request Mar 30, 2023

GH-1803: Use bulk byte buffer operations #1819

Closed

1 task

lucasvr added a commit to lucasvr/jena that referenced this pull request Mar 30, 2023

Rollback changes to RecordBuffer

a83a281

See apache#1800 for details.

lucasvr force-pushed the main branch from a83a281 to 4e84ba3 Compare March 31, 2023 15:34

afs approved these changes Mar 31, 2023

View reviewed changes

afs merged commit 4933549 into apache:main Mar 31, 2023

afs mentioned this pull request Apr 7, 2023

GH-1803: Keep the old code for byte buffer copy #1833

Merged

1 task

Conversation

lucasvr commented Mar 16, 2023

Uh oh!

kinow left a comment

Choose a reason for hiding this comment

Uh oh!

lucasvr commented Mar 16, 2023

Uh oh!

afs commented Mar 17, 2023

Uh oh!

afs commented Mar 20, 2023

Uh oh!

lucasvr commented Mar 20, 2023

Uh oh!

afs commented Mar 20, 2023

Uh oh!

lucasvr commented Mar 20, 2023

Uh oh!

afs commented Mar 21, 2023

Uh oh!

lucasvr commented Mar 21, 2023

Uh oh!

afs commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afs commented Mar 24, 2023

Uh oh!

afs commented Mar 27, 2023

Uh oh!

afs commented Mar 30, 2023

Uh oh!

lucasvr commented Mar 30, 2023

Uh oh!

afs commented Mar 30, 2023

Uh oh!

lucasvr commented Mar 31, 2023

Uh oh!

afs commented Mar 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

afs commented Mar 24, 2023 •

edited

Loading