Simplify InputStreamResponseTransformer subscription feedback #16

grddev · 2020-04-22T20:41:20Z

The previous approach attempted to throttle the S3 download by measuring the bytes received, since we can only request chunk of unknown sizes from the subscription, rather than a specified number o f bytes. While the previous commit avoided the 10x buffering, the ramp-up in capacity becomes slow, and doing something better would require some additional bookkeeping and guesswork.

Instead of trying to track the number of inflight bytes, this settles for just keeping 30 chunks in flight. When running both locally and in our production environment, the typical chunk size seem to align around 1500 bytes per chunk, which would mean that this effectively aims for a target buffer of about 45k in flight. This is much smaller than the previous 32MB, but given that we want to process a CSV-result line-by-line, it seems like 45k should typically be enough to cover at least a bunch of lines. It is also not clear how the chunk size is derived, and if that will stay consistent, and if it was bumped to 8k instead, the 30 in-flight requests would correspond to about 240k instead.

The first version of this just reduced the fetch size from ten to one, and that ought to workaround the problem, but it's not really clear to me whether the extra logic brings much additional value over the simpler approach.

In short, the problem with the previous version was that the throttling didn't take into account how much we had previously requested, and thus it would over-request data by at least 10x. The somewhat longer version is that when we receive (or finish) a chunk of the response, and we are below the threshold, we request 10 more chunks. With the threshold set to 32MB, and the typical chunk size set to 1500, we will have process well over 22k chunks before we reach the threshold. Each and every one of these times, we will request 10 additional chunks (plus an additional 10 if the chunk was non-empty and we finished it before reaching the threshold). This means that we will sign up for 10x the number of chunks required to reach the threshold, and we will buffer up to that amount, as there is no means of telling the producer we want less. Assuming the chunk sizes stay consistent, this behavior is actually independent of the actual chunk sizes. By changing the number of chunks we request to 1, it means we only grow the number of in-flight chunks when finishing a chunk, but since we check the threshold at this location as well, and only increase by 1 here as well, we can't really overshoot. One potential drawback with this approach is that the ramp-up in buffer size will be dependent on the rate and amount of data processing.

stenlarsson

Nice! 👍

The previous approach attempted to throttle the S3 download by measuring the bytes received, since we can only request chunk of unknown sizes from the subscription, rather than a specified number o f bytes. While the previous commit avoided the 10x buffering, the ramp-up in capacity becomes slow, and doing something better would require some additional bookkeeping and guesswork. Instead of trying to track the number of inflight bytes, this settles for just keeping 30 chunks in flight. When running both locally and in our production environment, the typical chunk size seem to align around 1500 bytes per chunk, which would mean that this effectively aims for a target buffer of about 45k in flight. This is much smaller than the previous 32MB, but given that we want to process a CSV-result line-by-line, it seems like 45k should typically be enough to cover at least a bunch of lines. It is also not clear how the chunk size is derived, and if that will stay consistent, and if it was bumped to 8k instead, the 30 in-flight requests would correspond to about 240k instead.

grddev · 2020-04-28T18:18:06Z

As noted in #19, this turned out to be too slow, and picking a reasonable hardcoded limit for the number of chunks is difficult without knowing the internals of the aws-sdk, so I'm closing this in favor of #19.

grddev requested review from iconara and stenlarsson April 22, 2020 20:41

grddev self-assigned this Apr 22, 2020

stenlarsson approved these changes Apr 22, 2020

View reviewed changes

grddev force-pushed the input-stream-response-transformer-feedback branch from 960c49b to 0e28052 Compare April 28, 2020 12:08

grddev mentioned this pull request Apr 28, 2020

Improve InputStreamResponseTransformer subscription feedback #19

Merged

iconara approved these changes Apr 28, 2020

View reviewed changes

grddev force-pushed the input-stream-response-transformer-feedback branch from 0e28052 to 3f80e73 Compare April 28, 2020 14:17

grddev closed this Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify InputStreamResponseTransformer subscription feedback #16

Simplify InputStreamResponseTransformer subscription feedback #16

grddev commented Apr 22, 2020

stenlarsson left a comment

grddev commented Apr 28, 2020

Simplify InputStreamResponseTransformer subscription feedback #16

Simplify InputStreamResponseTransformer subscription feedback #16

Conversation

grddev commented Apr 22, 2020

stenlarsson left a comment

Choose a reason for hiding this comment

grddev commented Apr 28, 2020