Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for handling many small entries #3559

Open
4 of 6 tasks
merlimat opened this issue Oct 20, 2022 · 2 comments
Open
4 of 6 tasks

Improvements for handling many small entries #3559

merlimat opened this issue Oct 20, 2022 · 2 comments

Comments

@merlimat
Copy link
Contributor

merlimat commented Oct 20, 2022

The BK data path is very efficient when processing large entries and it's generally able to saturate the disk and network IO in these cases.

By contrast, when handling a large number of very small entries there are several inefficiencies that cause the CPU to become the bottleneck, because of the per-entry overhead.

There are several low-hanging fruits to tackle to improve performance:

Reduce contention between message passing

Reduce contention in journal & force-write queues:

Improve the OrderedExecutor performance:

Reduce the number of buffers allocated per entry written/read.

For each entry being written in a ledger we are using 4 ByteBuf instances:

  1. The entry payload (this gets passed in to BK client)
  2. The checksum
  3. The serialized `AddRequest
  4. The 4 byte size header

These buffers are passed to Netty which will do a scatter writev, though it will pass all the buffers.
Allocating and managing all these buffer is expensive. There is overhead in:

  • Refcounting
  • Recycler to get the ByteBuf instances and put them back in the pool
  • ByteBuf pool arena to handle allocations/deallocation
  • Inter-thread synchronization: these buffer are normally allocated in one thread and deallocated from a different thread

To make matters worse, while the checksum is computed only once, the AddRequest is serialized each time we write it on a connection.
eg: if we have write-quorum=3, it would mean we are using (2 * 3) + 1 = 7 ByteBuf per each entry.

Finally, while for big entries is very important to avoid copying the payload, for small entries the overhead of maintaining the ByteBufList is greater than just copying the payload into a single buffer.
For that we should do:

  1. If the entries are big -> keep using ByteBufList, with 1 buffer for all the header and the 2nd buffer referencing the payload, with no copy.
  2. If entry is small -> allocate a buffer to contain all the headers and the payload and copy into it.

Pending changes:

  • Avoid extra buffer to prepend frame size #3560 Add the 4 bytes frame size header when serializing the request, instead of relying on a separate Netty filter
  • Consolidate buffer for small entries on read-response
  • Serialize only once and consolidate small entries for add requests
@lhotari
Copy link
Member

lhotari commented Oct 21, 2022

I wonder if the changes could skew USE metrics (BP-44)?

@merlimat
Copy link
Contributor Author

If the timing trace is enabled on the executor it will be measured and collected in the same exact way as before.

In #3546 I've added few more metrics that would make it easier to have charts of submitted/completed/failed/rejected tasks..

The "failed" task counter in particular would be very useful to detect unexpected exceptions that might leave a future hanging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants