OAK-10453 - Pipelined strategy: enforce size limit on memory taken by objects in the queue between download and transform thread by nfsantos · Pull Request #1130 · apache/jackrabbit-oak

nfsantos · 2023-09-22T13:43:00Z

In the Pipelined strategy, the queue between the mongo-dump and the transform stage keeps a list of documents downloaded from Mongo. In the current implementation, this queue is bounded by a limit on the maximum number of documents that is can hold. However, this proved to be fragile because Mongo documents can vary widely in size, so if the size of the queue is adjusted for small documents, it will consume too much memory in the cases where documents are too big.

This PR changes the way this queue is bounded to take in consideration the memory used by the Mongo documents in the queue. It introduces the following configuration properties:

oak.indexer.pipelined.mongoDocQueueReservedMemoryMB (Default 128) - How much memory is reserved for the queue between the download and transform stages.
oak.indexer.pipelined.mongoDocBatchSizeMB (Default: 4) - The approximate maximum size of a batch of documents.
oak.indexer.pipelined.mongoDocBatchMaxNumberOfDocuments (Default: 10000) - The maximum number of documents in a batch.

A batch of Mongo documents is added to the queue either when it reaches the limit of the maximum memory usage or of the number of documents in it.

Estimation of the size of a Mongo document

Estimating the size of the in-memory representation of a Mongo document can be done by traversing the document structure but this would be too slow to be done on the download thread, as it involves a series of if branches with type tests for every single document.

The PR takes a different approach, by taking advantage of the support for custom codecs in the Mongo driver.

Before this PR, we relied on the default codec of the Mongo driver, which were parsing the BSON response received from the server into a series of BasicDBObjects. These objects are then passed to the transform thread, which creates a NodeDocument object (Oak API) and from this, a series of NodeStateEntry instances.

This PR adds a new custom codec that parses the BSON stream directly to a NodeDocument, bypassing the creation of BasicDBObjects. This reduces the amount of work that the transform thread has to do, since it avoids the conversion from BasicDBObjects to NodeDocument.

Additionally, the custom codec also has logic to estimate the size of the NodeDocument that it creates from the stream. This additional calculation is very cheap, because it is done at the moment where the information about the size of the fields is readily available.

This change seems to have a neutral to slightly positive effect on the speed of the download thread.

Additional changes:

Adds a hook to receive events from the Mongo Java driver for commands sent and received. Added a listener that logs at TRACE level the commands done by the mongo-dump thread. This is useful to get metrics on the time that this thread spends waiting for Mongo and processing requests.

… batch in the mongo document queue.

Correct warnings and deprecations.

…luded in a separate PR).

…ot necessary. Reformat code for improved readability.

fabriziofortino

LGTM

...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java

Co-authored-by: Fabrizio Fortino <fabrizio.fortino@gmail.com>

…Felix SCR plugin (#1134)

…1133)

… job. (#1133)" This reverts commit ad64ecb.

tihom88 · 2023-10-05T06:15:46Z

In this approach we are adding some processing to be done by mongo download thread which is the main bottleneck. Would it make sense to do these transformations on download task thread?

nfsantos · 2023-10-05T06:40:00Z

In this approach we are adding some processing to be done by mongo download thread which is the main bottleneck. Would it make sense to do these transformations on download task thread?

The Mongo driver uses the thread that calls the Mongo iterator to perform the request to the server and to deserialize the response, so it is the mongo-dump thread that is doing all this work, there is no separate mongo download thread here.

Before this PR, the download and transform threads would do this work, where the BSON stream is provided by Mongo to the codec:

mongo-dump - BSON stream -> BasicDBObject (OOTB codec)
transform thread - BasicDBObject -> NodeDocument -> NodeStateEntry -> serialize to buffers

Now:

mongo-dump - BSON stream -> NodeDocument (custom codec)
transform thread - NodeDocument -> NodeStateEntry -> serialize to buffers

And the transformation BSON stream -> NodeDocument takes the same or less time than to BasicDBOObject.

Overall, there is less work on the transform thread. The mongo-dump thread is doing the same work. I did not find a way to pull the BSON Stream -> X out of the mongo download thread, this is controlled by the mongo java drivers.

nfsantos added 7 commits September 21, 2023 14:09

WIP

452f60f

Merge branch 'apache:trunk' into OAK-10453

6fa5411

Cleanups.

fd2a5f8

Add configuration property to set the maximum number of documents per…

bd31ddf

… batch in the mongo document queue.

Minor cleanups

a110b01

Merge remote-tracking branch 'upstream/trunk' into OAK-10453

87c832c

Minor cleanups

4aa6988

nfsantos marked this pull request as ready for review September 25, 2023 06:35

nfsantos added 15 commits September 25, 2023 09:32

Minor cleanups

b759805

Refactor to make code more clear.

f429332

Add license header.

d3132bc

Add license header.

efc64c7

Log messages of MongoCommandListener as trace.

f6d2ccf

Add license header.

ead7c66

Fix test.

3f8fc1c

Correct warnings and deprecations.

Merge remote-tracking branch 'upstream/trunk' into OAK-10453

08c85b6

Finish merge.

b7a87c1

Lower log from DEBUG to TRACE.

67fe499

Improve docs and minor refactoring.

af7d892

Remove code to capture and log Mongo connection events (it may be inc…

c95b33d

…luded in a separate PR).

Do not clear array with Mongo documents in mongo-dump thread, it is n…

26b7853

…ot necessary. Reformat code for improved readability.

Minor refactoring.

ce2d809

Fix configuration property name.

bf3cf9e

fabriziofortino approved these changes Sep 28, 2023

View reviewed changes

...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java Outdated Show resolved Hide resolved

...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java Outdated Show resolved Hide resolved

nfsantos and others added 5 commits September 28, 2023 13:58

Apply suggestions from code review

1c608bb

Co-authored-by: Fabrizio Fortino <fabrizio.fortino@gmail.com>

OAK-10461: oak-search-elastic does not build under Java 17 - disable …

d99279c

…Felix SCR plugin (#1134)

Set LZ4 as the default compression algorithm for the indexing job. (#…

13ff042

…1133)

Revert "Set LZ4 as the default compression algorithm for the indexing…

9ad6aba

… job. (#1133)" This reverts commit ad64ecb.

Merge branch 'trunk' into OAK-10453

2845158

tihom88 merged commit 0b6e538 into apache:trunk Oct 5, 2023

nfsantos deleted the OAK-10453 branch October 5, 2023 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-10453 - Pipelined strategy: enforce size limit on memory taken by objects in the queue between download and transform thread#1130

OAK-10453 - Pipelined strategy: enforce size limit on memory taken by objects in the queue between download and transform thread#1130
tihom88 merged 27 commits intoapache:trunkfrom
nfsantos:OAK-10453

nfsantos commented Sep 22, 2023 •

edited

Loading

Uh oh!

fabriziofortino left a comment

Uh oh!

Uh oh!

Uh oh!

tihom88 commented Oct 5, 2023

Uh oh!

nfsantos commented Oct 5, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nfsantos commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Estimation of the size of a Mongo document

Uh oh!

fabriziofortino left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tihom88 commented Oct 5, 2023

Uh oh!

nfsantos commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nfsantos commented Sep 22, 2023 •

edited

Loading

nfsantos commented Oct 5, 2023 •

edited

Loading