[Connectors Python] Preemptively check bulk batch size against chunk_max_mem_size and perform pre-append flush by Jan-Kazlouski-elastic · Pull Request #4012 · elastic/connectors

Jan-Kazlouski-elastic · 2026-05-12T17:48:15Z

Closes https://github.com/elastic/search-team/issues/14453

Sink._run previously checked the batch thresholds after appending each
new doc, meaning a batch could be dispatched only once it had already exceeded
chunk_mem_size. For workloads with docs near the limit (e.g. five 1–2 MiB docs
against a 5 MiB ceiling) this produced bulk requests larger than configured,
risking 413/OOM errors downstream.

This PR moves the threshold check to fire pre-emptively: before adding a
doc, if the prospective batch size (bulk_size + doc_size) would exceed
chunk_mem_size — or the batch would exceed chunk_size entries — the current
batch is flushed first, then the new doc starts a fresh batch. An if batch
guard ensures an oversized single doc is still dispatched on its own rather
than producing an empty flush.

Changes

connectors/es/sink.py: relocate and tighten the flush condition in
Sink._run so dispatched bulks stay at or below chunk_mem_size.
tests/test_sink.py: add four async tests covering
- the bug-report scenario (1,1,1,1,2 MiB vs. 5 MiB ceiling → two batches, never oversized),
- chunk_size boundary flushing,
- oversized single doc handled without an empty pre-flush,
- trailing flush behavior unchanged when no in-loop flush triggers.
NOTICE.txt: refresh idna (3.13 → 3.14) and propcache (0.4.1 → 0.5.2)
entries to match current pinned dependencies.

Testing

Testing was performed by running functional tests 2 times, one without fix (on main) and one with the fix. The only change is replacement of get_mib_size with get_size for more accurate values in the logs.

config.yml has chunk_max_mem_size: 1 #MiB

Command:

DATA_SIZE=medium MAX_DURATION=1800 REFRESH_RATE=2 make -C app/connectors_service ftest NAME=dir 2>&1 | tee ftest-dir.log

Before fix:
There are three batches dispatched above 1.0 MiB, and none of them are single-doc batches — proving the size cap is being violated by the multi-doc batching path itself.

[FMWK][17:20:30][DEBUG] [...] Task 3 - Sending a batch of 110 ops -- 1.1570510864MiB

[FMWK][17:20:33][DEBUG] [...] Task 1 - Sending a batch of 262 ops -- 1.1061477661MiB

[FMWK][17:20:33][DEBUG] [...] Task 1 - Sending a batch of 84 ops -- 17.8958511353MiB

110 ops = 55 docs packed into a 1.16 MiB bulk.
262 ops = 131 docs packed into a 1.11 MiB bulk.
84 ops = 42 docs packed into a 17.9 MiB bulk — almost 18× the configured limit, with plenty of room to have split the batch before the offending doc landed in it.

No such cases in fixed version:

There is exactly one batch dispatched above 1.0 MiB, and it is a single-document batch — the unavoidable case the new code explicitly preserves via the if batch: guard.

[FMWK][17:03:31][DEBUG] [...] Task 2 - Sending a batch of 2 ops -- 17.4230194092MiB

2 ops = 1 doc. A single 17.4 MiB document cannot be split, so it must be sent on its own; the fixed code correctly flushes the previously accumulated batch first (see line 6683 immediately before it, which dispatches the queued 88 ops at 0.53 MiB) and then sends the oversized doc by itself.

Every other batch in the fixed log stays at or below ~1.0 MiB — the largest non-single-doc batches are values like 0.9718 MiB, 0.9711 MiB, 0.9125 MiB, etc., all comfortably under the ceiling.

Checklists

Pre-Review Checklist

this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
this PR has a meaningful title
this PR links to all relevant github issues that it fixes or partially addresses
this PR has a thorough description
Covered the changes with automated tests
Tested the changes locally
Added a label for each target release version
For bugfixes: backport safely to all minor branches still receiving patch releases
Considered corresponding documentation changes

Release Note

[Optional] Fix Elasticsearch sink occasionally dispatching bulk requests larger than the
configured chunk_mem_size, which could trigger 413 Request Entity Too Large
or memory pressure on the cluster. Batches are now flushed pre-emptively so any
single bulk stays within the configured size and memory limits.

- Implement logic to flush the current batch before adding a new document if it exceeds the defined size or memory limits. This ensures that bulk requests remain within the specified constraints. - Adjust the handling of oversized documents to ensure they are sent individually when necessary, maintaining the integrity of batch processing. This change improves the efficiency and reliability of document handling in the Elasticsearch Sink.

- Introduce new tests to validate the behavior of the Sink's batch processing, ensuring it correctly handles memory overflow, chunk size boundaries, and oversized documents. - Implement helper functions to facilitate the creation of mock queues and sink instances for testing. These enhancements improve the reliability and correctness of the Sink's document handling in various scenarios.

- Bump `idna` from 3.13 to 3.14 - Bump `propcache` from 0.4.1 to 0.5.2 These updates ensure that the project is using the latest versions of these dependencies, which may include important bug fixes and improvements.

Jan-Kazlouski-elastic · 2026-05-13T06:25:29Z

@coderabbitai full review

Copilot

Pull request overview

This PR updates the Elasticsearch Sink batching logic to preemptively flush bulks before adding a new document when the next append would exceed configured batch memory limits, aiming to prevent bulk requests from surpassing chunk_mem_size and triggering downstream 413/OOM issues.

Changes:

Adjust Sink._run flush logic to evaluate thresholds before appending a new doc.
Add async tests covering memory-based pre-flush behavior, chunk-size boundary flushing, oversized single-doc behavior, and trailing flush behavior.
Refresh NOTICE.txt dependency entries for idna and propcache.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
app/connectors_service/connectors/es/sink.py	Moves flush condition to a pre-append check to keep bulk requests within `chunk_mem_size`.
app/connectors_service/tests/test_sink.py	Adds tests for the new preemptive flush behavior and related edge cases.
app/connectors_service/NOTICE.txt	Updates third-party notice versions for `idna` and `propcache`.
libs/connectors_sdk/NOTICE.txt	Updates third-party notice versions for `idna` and `propcache`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

artem-shelkovnikov

I don't really remember why we did not do it in the first place, I vaguely remember thinking about it and something stopped me from doing it.

I thought about this change throughout my day and could not find the reason not to merge it, but I'd love to hear a second opinion from somebody from @elastic/search-extract-and-transform just in case. It's a bit of a complicated topic.

…tively-check-the-batch-size # Conflicts: # app/connectors_service/NOTICE.txt # libs/connectors_sdk/NOTICE.txt

- Update the batch processing logic to preemptively check the prospective entry count before flushing, ensuring that the batch size does not exceed the defined `chunk_size` during mixed operations (delete, index, update). - Introduce new helper functions for creating update and delete documents in tests. - Add a test case to validate the behavior of the Sink when handling mixed operations, ensuring that batch sizes remain within limits. These changes improve the reliability of document handling in the Elasticsearch Sink, particularly when dealing with varying operation types.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

- Simplify batch dispatch logic by consolidating document creation helper functions for better readability and maintainability. - Improve pre-append and post-append checks to ensure batches are dispatched correctly based on size and memory constraints, enhancing performance and reducing latency. - Update tests to validate the new logic, ensuring that batch sizes remain within defined limits during mixed operations and that oversized documents are handled appropriately. These changes optimize the document handling process in the Elasticsearch Sink, ensuring efficient and reliable batch processing.

- Introduce new helper functions for document ID extraction and queue management to improve test readability and maintainability. - Update test cases to ensure proper flushing behavior before exceeding chunk size and memory limits, validating the handling of mixed operations and oversized documents. - Rename test functions for clarity, reflecting their purpose more accurately. These changes optimize the testing framework for the Elasticsearch Sink, ensuring robust validation of batch processing logic.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

seanstory

One nit, but lgtm. Great fix!

seanstory · 2026-05-14T20:58:08Z

                    self._logger.warning(f"Skip document {doc} as '_id' is missing.")
                    continue
+                # Flush before adding if this doc would overflow either cap.
+                # `_bulk_op` emits 1 entry for deletes and 2 for index/update,


Yes, but, I don't think that's really what we want to be counting. The metadata object in an index/update is trivially small. Like:

{ "index" : { "_index" : "test", "_id" : "1" } }

I don't think we need to count that as part of the "chunk size".

I think we can do len(batch) + 1 > self.chunk_size...

And I think that's consistent with the old behavior where:

batch.extend(self._bulk_op(doc, operation))

is always adding 1 element to batch, even though doc and operation are technically two separate objects.

I respectfully disagree. First of all I would like to set the record straight in terms of vocabulary.

Term Meaning

doc/file one source record

entry/line one element in batch — _bulk_op produces 1 for delete (action line), 2 for index/update (action line + source line)

len(batch) counts entries (since batch.extend(...) flattens)

chunk_size historically len(batch) >= chunk_size, i.e. caps entries, not docs

len(batch) + 1 > chunk_size doesn't fix the bug

chunk_size=4, sequence delete + index + index:

Step len before Pre len+1 > 4? After append Post len >= 4?

delete (+1) 0, skip – 1 no

index (+2) 1 no 3 no

index (+2) 3 no 5 yes → ships 5-entry batch

The post-check fires after the cap is already broken; it can dispatch but can't un-append. +1 undercounts every index/update by one — the pre-check has to predict the actual extend delta, which is 1 or 2.
So the logs are confusing. elasticsearch.bulk.chunk_size is n, and user sees logs like:

Task 1 - Sending a batch of n+1 ops

Clear separation of source docs and elasticsearch entries

Judging from

metadata object in an index/update is trivially small
adding 1 element to batch, even though doc and operation are technically two separate objects
I don't think we need to count that as part of the "chunk size".

I believe your actual proposal is that chunk_size should mean docs, not entries:

It's a behavior change — existing operators tuned against entries; switching silently ~doubles batch size for index/update workloads.

Right implementation is a dedicated doc counter, not a magic +1:

docs_in_batch += 1 # after extend if docs_in_batch + 1 > self.chunk_size or ...: # pre if docs_in_batch >= self.chunk_size or ...: # post

So it would be more understandable to the user.

1000 new docs result in a batch of 1000 docs, that is actually 2000 entries (operation + source).

1000 deleted docs result in a batch of 1000 entries (operation only)

That should be a part of a separate enhancement PR, not a bug fix.

Proposal

Keep entry-counting in this PR — it preserves the existing contract and only fixes the overshoot. entries = 1 if op == OP_DELETE else 2 isn't a magic number; it mirrors _bulk_op one line above, so both move together if it ever changes.

metadata object in an index/update is trivially small
Actually the metadata line is essentially the same size for all three ops
I don't think we need to count that as part of the "chunk size"
Not counting metadata would stop us from counting delete operations, because it is the only part of the delete operation
even though doc and operation are technically two separate objects
operation is just an operation enum, it is not added separately, but that is a nitpick.

counts entries (since batch.extend(...) flattens)

🤦 you're right, and this was the key bit that I missed. My brain read this as equivalent to append.

I believe your actual proposal is that chunk_size should mean docs, not entries

Yep. And I'd always been under the impression that's what this was. And I expect that was the author's impression too. But you're right, even if this has long been a misunderstaning on my/our side, changing it now would be a behavior change. I agree with your conclusion, we should ignore my comment and leave this as you implemented it.

Thanks for pushing back. 🫡

…tively-check-the-batch-size

github-actions · 2026-05-15T14:07:36Z

💔 Failed to create backport PR(s)

Status	Branch	Result
❌	9.5	The branch "9.5" is invalid or doesn't exist
✅	9.4	#4025
✅	9.3	#4026

Successful backport PRs will be merged automatically after passing CI.

To backport manually run:
backport --pr 4012 --autoMerge --autoMergeMethod squash

…chunk_max_mem_size and perform pre-append flush (#4012) (#4026) Backports the following commits to 9.3: - [Connectors Python] Preemptively check bulk batch size against chunk_max_mem_size and perform pre-append flush (#4012) Co-authored-by: Jan-Kazlouski-elastic <jan.kazlouski@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…chunk_max_mem_size and perform pre-append flush (#4012) (#4025) Backports the following commits to 9.4: - [Connectors Python] Preemptively check bulk batch size against chunk_max_mem_size and perform pre-append flush (#4012) Co-authored-by: Jan-Kazlouski-elastic <jan.kazlouski@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…ce-max-document-size-config Integrates upstream PR #4012 (preemptive `chunk_max_mem_size` flush in `Sink._run`) on top of the local `max_text_document_size` cap. Conflicts in `connectors/es/sink.py`: - `Sink._run` cap-check + pre-flush ordering: build `ops = self._bulk_op(doc, operation)` once, then run the text cap check (drop + `continue` if oversized text doc), then the upstream prospective pre-flush, then stats accounting, then `batch.extend(ops)` and the upstream post-append flush. Replaced upstream's inline `entries = 1 if OP_DELETE else 2` with `len(ops)` since the bulk op list is already built. Conflicts in `tests/test_sink.py`: - Both branches added a top-level `_make_sink` helper with incompatible signatures. Renamed the local cap-test factory to `_make_cap_sink` and kept upstream's generic `_make_sink(queue, *, chunk_size, chunk_mem_size)` unchanged so the new chunk-flush tests need no edits. - Local cap tests switched to upstream's `_queue_yielding` helper to drop the duplicate `_queue_with_items`. Co-authored-by: Cursor <cursoragent@cursor.com>

Jan-Kazlouski-elastic added 3 commits May 12, 2026 14:36

Update NOTICE.txt to reflect dependency version bumps

fa20b5c

- Bump `idna` from 3.13 to 3.14 - Bump `propcache` from 0.4.1 to 0.5.2 These updates ensure that the project is using the latest versions of these dependencies, which may include important bug fixes and improvements.

Jan-Kazlouski-elastic requested a review from artem-shelkovnikov May 12, 2026 17:48

Jan-Kazlouski-elastic self-assigned this May 12, 2026

Jan-Kazlouski-elastic requested a review from a team as a code owner May 12, 2026 17:48

Jan-Kazlouski-elastic added v9.5.0 v9.4.1 v9.3.5 labels May 12, 2026

github-actions Bot added the auto-backport label May 12, 2026

Update NOTICE.txt

56cadbb

Jan-Kazlouski-elastic requested a review from Copilot May 13, 2026 06:25

Copilot started reviewing on behalf of Jan-Kazlouski-elastic May 13, 2026 06:26 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread app/connectors_service/connectors/es/sink.py Outdated

Comment thread app/connectors_service/tests/test_sink.py

artem-shelkovnikov reviewed May 13, 2026

View reviewed changes

Jan-Kazlouski-elastic added 2 commits May 14, 2026 17:24

Merge remote-tracking branch 'origin/main' into jan-kazlouski/pre-emp…

9ebe2f7

…tively-check-the-batch-size # Conflicts: # app/connectors_service/NOTICE.txt # libs/connectors_sdk/NOTICE.txt

Jan-Kazlouski-elastic requested review from artem-shelkovnikov and Copilot May 14, 2026 15:40

Copilot started reviewing on behalf of Jan-Kazlouski-elastic May 14, 2026 15:40 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread app/connectors_service/connectors/es/sink.py Outdated

Comment thread app/connectors_service/connectors/es/sink.py Outdated

Jan-Kazlouski-elastic added 2 commits May 14, 2026 20:17

Jan-Kazlouski-elastic requested a review from Copilot May 14, 2026 18:33

Copilot started reviewing on behalf of Jan-Kazlouski-elastic May 14, 2026 18:33 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

seanstory approved these changes May 14, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into jan-kazlouski/pre-emp…

1e27978

…tively-check-the-batch-size

Jan-Kazlouski-elastic merged commit 51cfce3 into main May 15, 2026
2 checks passed

Jan-Kazlouski-elastic deleted the jan-kazlouski/pre-emptively-check-the-batch-size branch May 15, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Connectors Python] Preemptively check bulk batch size against chunk_max_mem_size and perform pre-append flush#4012

[Connectors Python] Preemptively check bulk batch size against chunk_max_mem_size and perform pre-append flush#4012
Jan-Kazlouski-elastic merged 9 commits into
mainfrom
jan-kazlouski/pre-emptively-check-the-batch-size

Jan-Kazlouski-elastic commented May 12, 2026

Uh oh!

Jan-Kazlouski-elastic commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

artem-shelkovnikov left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

seanstory left a comment

Uh oh!

seanstory May 14, 2026

Uh oh!

Jan-Kazlouski-elastic May 15, 2026

Uh oh!

seanstory May 15, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Term	Meaning
doc/file	one source record
entry/line	one element in `batch` — `_bulk_op` produces 1 for delete (action line), 2 for index/update (action line + source line)
`len(batch)`	counts entries (since `batch.extend(...)` flattens)
`chunk_size`	historically `len(batch) >= chunk_size`, i.e. caps entries, not docs

Step	`len` before	Pre `len+1 > 4`?	After append	Post `len >= 4`?
delete (+1)	0, skip	–	1	no
index (+2)	1	no	3	no
index (+2)	3	no	5	yes → ships 5-entry batch

Conversation

Jan-Kazlouski-elastic commented May 12, 2026

Closes https://github.com/elastic/search-team/issues/14453

Changes

Testing

Checklists

Pre-Review Checklist

Release Note

Uh oh!

Jan-Kazlouski-elastic commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

artem-shelkovnikov left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

seanstory left a comment

Choose a reason for hiding this comment

Uh oh!

seanstory May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Jan-Kazlouski-elastic May 15, 2026

Choose a reason for hiding this comment

len(batch) + 1 > chunk_size doesn't fix the bug

Clear separation of source docs and elasticsearch entries

Proposal

Uh oh!

seanstory May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

💔 Failed to create backport PR(s)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

`len(batch) + 1 > chunk_size` doesn't fix the bug