Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify InputStreamResponseTransformer subscription feedback #16

Closed
wants to merge 2 commits into from

Conversation

grddev
Copy link
Contributor

@grddev grddev commented Apr 22, 2020

The previous approach attempted to throttle the S3 download by measuring the bytes received, since we can only request chunk of unknown sizes from the subscription, rather than a specified number o f bytes. While the previous commit avoided the 10x buffering, the ramp-up in capacity becomes slow, and doing something better would require some additional bookkeeping and guesswork.

Instead of trying to track the number of inflight bytes, this settles for just keeping 30 chunks in flight. When running both locally and in our production environment, the typical chunk size seem to align around 1500 bytes per chunk, which would mean that this effectively aims for a target buffer of about 45k in flight. This is much smaller than the previous 32MB, but given that we want to process a CSV-result line-by-line, it seems like 45k should typically be enough to cover at least a bunch of lines. It is also not clear how the chunk size is derived, and if that will stay consistent, and if it was bumped to 8k instead, the 30 in-flight requests would correspond to about 240k instead.

The first version of this just reduced the fetch size from ten to one, and that ought to workaround the problem, but it's not really clear to me whether the extra logic brings much additional value over the simpler approach.

In short, the problem with the previous version was that the throttling didn't take into account how much we had previously requested, and thus it would over-request data by at least 10x.

The somewhat longer version is that when we receive (or finish) a chunk of the response, and we are below the threshold, we request 10 more chunks. With the threshold set to 32MB, and the typical chunk size set to 1500, we will have process well over 22k chunks before we reach the threshold. Each and every one of these times, we will request 10 additional chunks (plus an additional 10 if the chunk was non-empty and we finished it before reaching the threshold). This means that we will sign up for 10x the number of chunks required to reach the threshold, and we will buffer up to that amount, as there is no means of telling the producer we want less. Assuming the chunk sizes stay consistent, this behavior is actually independent of the actual chunk sizes.

By changing the number of chunks we request to 1, it means we only grow the number of in-flight chunks when finishing a chunk, but since we check the threshold at this location as well, and only increase by 1 here as well, we can't really overshoot.

One potential drawback with this approach is that the ramp-up in buffer size will be dependent on the rate and amount of data processing.
@grddev grddev self-assigned this Apr 22, 2020
Copy link
Contributor

@stenlarsson stenlarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

The previous approach attempted to throttle the S3 download by measuring the bytes received, since we can only request chunk of unknown sizes from the subscription, rather than a specified number o f bytes. While the previous commit avoided the 10x buffering, the ramp-up in capacity becomes slow, and doing something better would require some additional bookkeeping and guesswork.

Instead of trying to track the number of inflight bytes, this settles for just keeping 30 chunks in flight. When running both locally and in our production environment, the typical chunk size seem to align around 1500 bytes per chunk, which would mean that this effectively aims for a target buffer of about 45k in flight. This is much smaller than the previous 32MB, but given that we want to process a CSV-result line-by-line, it seems like 45k should typically be enough to cover at least a bunch of lines. It is also not clear how the chunk size is derived, and if that will stay consistent, and if it was bumped to 8k instead, the 30 in-flight requests would correspond to about 240k instead.
@grddev grddev force-pushed the input-stream-response-transformer-feedback branch from 0e28052 to 3f80e73 Compare April 28, 2020 14:17
@grddev
Copy link
Contributor Author

grddev commented Apr 28, 2020

As noted in #19, this turned out to be too slow, and picking a reasonable hardcoded limit for the number of chunks is difficult without knowing the internals of the aws-sdk, so I'm closing this in favor of #19.

@grddev grddev closed this Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants