New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Reporting] Add support of chunked export #108485
Conversation
97e0196
to
3becd6d
Compare
Pinging @elastic/kibana-app-services (Team:AppServices) |
This is looking really great @dokmic! Very exciting to see this moving forward. I have just a couple of thoughts I'd like to get your feedback on, I left minor comments in the code (for the most part this looks really solid).
I wonder whether 100mb default chunk size is too large, I'd defer to Tim here, but to improve memory use on Kibana instances I think smaller chunks might yield better results. I've not tested this, just guessing because we don't know what else Kibana might be handling at any given moment. I'm also not sure how big a report needs to be to hit this chunk limit but it seems very large. With Chromium effectively putting the entire report into memory when writing already I suspect we can extract more gains at write time/generation time if we chunk more aggressively. As follow up to this work, I wonder if there are UX edge cases that should be handled, for e.g.:
|
set<any>({}, 'body.hits.hits.0._source', { | ||
jobtype: 'pdf', | ||
output: { | ||
content: '12', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the size
value (set in tests above) optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is optional in the typings, so I guess it is optional for some old reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size should only be undefined for any report that isn't completed
or completed_with_warnings
expect(client.search).toHaveBeenCalledTimes(3); | ||
}); | ||
|
||
it('should decode every chunk separately', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: maybe we could combine should decode base64 encoded content
and should decode every chunk separately
into just should decode every chunk separately
as they seem to overlap quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are two separate cases. The first one also covers the old reports when we don't break them down.
return hits?._source?.output.content; | ||
} | ||
|
||
private isRead() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be bit clearer
private isRead() { | |
private hasReadAllContent() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class is about working with the content, and it has it in the name. I would rather keep it that way because it seems a bit redundant to me. It is a private method anyway.
_write(chunk: Buffer | string, _encoding: string, callback: Callback) { | ||
this.buffer += typeof chunk === 'string' ? chunk : chunk.toString(); | ||
callback(); | ||
private async clear() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it will only delete the non-parent documents currently. It might be more accurate to name this function clearNonHeadChunks
or should we update it to also set the parent chunk content to ''
. Just a thought, not necessary to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I've renamed that to removeChunks
since we are not really clearing them.
} | ||
|
||
private isRead() { | ||
return this.jobSize != null && this.bytesRead >= this.jobSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, when will jobSize be null
| undefined
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's defined in the report document typings. I guess that might be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typings are that way to support pending jobs, I believe
/** | ||
* @note The Elasticsearch `http.max_content_length` is including the whole POST body. | ||
* But the update/index request also contains JSON-serialized query parameters. | ||
* 1Kb span should be enough for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine, since id
, index
, if_primary_term
, if_seq_no
should all fit in that 1kb space.
This is the default
Users do not have direct access to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! It's really cool being able to create exports that span multiple documents.
The code is working very well and was written clearly - thank you for that.
One point of feedback I have is on the logging. When debug logging is on, we need the chunking to be a bit less transparent for troubleshooting purposes.
Otherwise, LGTM once PR is green.
I reviewed the code changes and tested CSV chunking by tweaking a line of code:
- const maxContentSize = get(settings, 'http.max_content_length', '100mb');
+ const maxContentSize = get(settings, 'http.max_content_lengf', '4mb');
const maxChunkSize = await this.getMaxChunkSize(); | ||
|
||
while (this.buffer.byteLength >= maxChunkSize) { | ||
await this.flush(maxChunkSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a debug line that prints out every time a chunk of output gets flushed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added more extensive logging that can help in the case of debugging. Thanks for pointing this out 👍
That was indeed a bug 👍 I completely missed that part. I have fixed that by performing content overwrite since the delete operation should remain separate if we decide to add more storage types. |
@tsullivan There is an easier way to generate large reports. We can create an index pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was indeed a bug 👍 I completely missed that part. I have fixed that by performing content overwrite since the delete operation should remain separate if we decide to add more storage types.
Great, thanks for addressing that and adding a test.
} | ||
|
||
private async getJobContentEncoding() { | ||
if (!this.jobContentEncoding) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make encoding an option and remove dependency from reporting. this way this stream will be reusable outside of reporting (and will be simpler).
* Decouple job error fetching from the content stream * Encapsulate content encoding and decoding in the content stream * Move report size calculation from task runners * Remove configuration check from the reporting diagnostics * Add support of chunked export
💔 Build Failed
Failed CI StepsTest FailuresKibana Pipeline / general / Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/dashboard_mode/dashboard_view_mode·js.dashboard mode Dashboard View Mode Dashboard viewer "before all" hook: Create dashboard only mode user for "shows only the dashboard app link"Standard Out
Stack Trace
Kibana Pipeline / general / Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/security/users·js.security app users should show the default rolesStandard Out
Stack Trace
Kibana Pipeline / general / Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/security/users·js.security app users should show the default rolesStandard Out
Stack Trace
Metrics [docs]Async chunks
Page load bundle
Unknown metric groupsAPI count
API count missing comments
History
To update your PR or re-run it, just comment with: |
Summary
This Pull-Request adds support of chunked exports by the reporting plugin and resolves #18322.
Chunk Structure
Chunks are stored in separate documents in the same index as the report document. Chunks refer to its parent in the
parent_id
field and define the order in theoutput.chunk
.Chunk Size
The feature doesn't require any configuration to toggle this on. It uses
http.max_content_length
cluster setting to determine chunk size. If not set, it will be split into 100Mb chunks. Since this option includes the entire POST body, the content part size will be less than the option's value. It always subtracts 1Kb for serialized JSON structure of the body. For PNG and PDF exports, we are using Base64 encoding so the chunk size will be] (100mb - 1kb) / 4 [ * 3
~75mb
. And for CSV exports, it will be] (100mb - 1kb) / 2 [
~50mb
since we are assuming that every character can be serialized.Writing
Every chunk is encoded separately to provide backward compatibility with already existing reports and to reduce memory pressure. In that case, decoded chunks can be streamed up to the Elasticsearch, and used memory can be released.
Reading
The content stream reads chunks until it reaches
output.size
bytes or meets an empty or non-existing chunk. The latter means either a malformed report or an incorrect report size value.Memory Usage
The content stream is buffering all the written data until it reaches the chunk size. After writing the chunk, all the memory is released.
Read operation is performed lazily, and all read data is passed directly down to the consumers/listeners. Because of that, large reports can be downloaded without creating any pressure on the memory.
Testing
Ensure to allocate enough heap size when running Elasticsearch; otherwise, it will crash with the
OutOfMemory
exception when handling large requests. It needs at least 2Gb (-Xmx2g
) for 100Mb requests.Checklist
For maintainers
Release Notes
Removed limit in reports size by chunking large reports between multiple documents. This feature also reduces memory usage in the reporting plugin thanks to streaming the report data directly without keeping it in the memory.