Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex - option for batch memory size not just document count #57535

Open
costin opened this issue Jun 2, 2020 · 1 comment
Open

Reindex - option for batch memory size not just document count #57535

costin opened this issue Jun 2, 2020 · 1 comment
Labels
:Distributed/Reindex Issues relating to reindex that are not caused by issues further down >enhancement Team:Distributed Meta label for distributed team

Comments

@costin
Copy link
Member

costin commented Jun 2, 2020

When dealing with large documents, there are no dedicated options inside Reindex API to prevent memory from blowing up. There is size:

size
{Optional, integer) The number of documents to index per batch. Use when indexing from remote to ensure that the batches fit within the on-heap buffer, which defaults to a maximum size of 100 MB.

however that is not enough since a few large documents can go way beyond 100MB regardless of the rest of the documents inside the batch.
For example:

send message failed [channel: Netty4TcpChannel{localAddress=/1.2.3.4:12345, remoteAddress=/6.7.8.9:98765}]
java.lang.IllegalArgumentException: ReleasableBytesStreamOutput cannot hold more than 2GB of data
	at org.elasticsearch.common.io.stream.BytesStreamOutput.ensureCapacity(BytesStreamOutput.java:156) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.io.stream.ReleasableBytesStreamOutput.ensureCapacity(ReleasableBytesStreamOutput.java:70) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:90) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.CompressibleBytesOutputStream.writeBytes(CompressibleBytesOutputStream.java:85) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:461) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.bytes.BytesReference.writeTo(BytesReference.java:90) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.io.stream.StreamOutput.writeBytesReference(StreamOutput.java:205) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.search.SearchHit.writeTo(SearchHit.java:856) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.search.SearchHits.writeTo(SearchHits.java:257) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.search.fetch.FetchSearchResult.writeTo(FetchSearchResult.java:102) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:70) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:53) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:107) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:93) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:140) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.OutboundHandler.internalSendMessage(OutboundHandler.java:78) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:70) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:748) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:732) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:47) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:30) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:381) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.7.jar:6.8.7]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]

The solution in this case is to keep decreasing the size so that the batch memory size is acceptable - without a dedicated option for that, it's a lot of guess work.
It would be useful to have a companion to size, say max_memory or memory_size, to limit the memory of a batch.
So instead of limiting a reindex batch just to the number of documents, it can also be limited based on its memory size.

@costin costin added >enhancement needs:triage Requires assignment of a team area label :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Jun 2, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Reindex)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jun 2, 2020
@romseygeek romseygeek removed the needs:triage Requires assignment of a team area label label Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Reindex Issues relating to reindex that are not caused by issues further down >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

3 participants