Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pipeable streams #588

Closed
h2non opened this issue Apr 23, 2015 · 11 comments
Closed

Support for pipeable streams #588

h2non opened this issue Apr 23, 2015 · 11 comments
Labels
closing-soon This issue will automatically close in 4 days unless further comments are made. feature-request A feature should be added or improved. needs-major-version Can only be considered for the next major release

Comments

@h2non
Copy link

h2non commented Apr 23, 2015

I don't understand why this was not supported yet. Streams today are to node/io.js like classes are to C++.

It's especially useful for S3:

fs.createReadStream('image.jpg')
  .pipe(s3.putObject(params, function (err, data) {}))

Definitively there aren't plans to support it?

@lsegal
Copy link
Contributor

lsegal commented Apr 23, 2015

@h2non streams are supported in the API, just not via pipe. You can easily do:

var stream = fs.createReadStream('image.jpg');
s3.putObject({Bucket: ..., Body: stream}, function (err, data) {});

This format is more conducive to many of the other features provided by the SDK, namely parameter validation and bound parameters. Another important point is that S3 streams must be rewindable for retry support and signing in certain regions, which is not well supported by Node.js streams.

That said, supporting pipe natively is something we could look at. We currently have no plans to support this, but at the same time, we are not definitively opposed to the idea. If you wanted to put together a pull request that implemented this syntax sugar which aliases the above syntax, we would be happy to take a look!

@lsegal lsegal added the feature-request A feature should be added or improved. label Apr 27, 2015
@h2non h2non changed the title Support for pipable streams Support for pipeable streams May 4, 2015
@h2non
Copy link
Author

h2non commented May 4, 2015

Thanks for the reply.

Sounds reasonable. I see your point and it could be definitively great introducing support for a pipeable interface. The unique challenge here is about keeping the existent interface consistency.

As far I've seen after a brief code review, putObject() (and indeed almost all of the service API methods) returns an instance of Request, so maybe extending the prototype chain with the required _write() method used by node.js streams could be enough. The client should control if the body was already defined (and it's required) for the given request so the request is ready to fly over the network, otherwise it could wait until the input stream chucks of data are ready.

I wrote a trivial example to demonstrate a possible approach (but breaking the current interface):
https://gist.github.com/h2non/b30e7e6114415cbef172

It uses a small module which acts as wrapper.
I would be happy to give a hand if you're considering this feature seriously

@lsegal
Copy link
Contributor

lsegal commented May 4, 2015

Unfortunately we can't really consider any new functionality to this SDK that introduces breaking changes. Also, we would still have to answer questions of how this approach would deal with retryability and multiple passes at the body payload (for signing and other customizations). Without an answer for those concerns, I don't think we would recommend this approach for general usage.

That said, it's possible this kind of a project might just be best suited as a "third-party library" that extends the SDK within the npm ecosystem. If it became a popular library, we could consider it for inclusion in the next major version of the SDK when we can make breaking changes. How does that sound?

@h2non
Copy link
Author

h2non commented May 4, 2015

Thanks for the fast reply. Let's separate things.

Forget about the module. I just made made it to satisfy my needs (which are partially related to the topic we're actually discussing here, but not specifically for this purpose). Ideally no third-party library should be used to accomplish this feature.

Never break interfaces. 100% agree. This is a must software principle. Note that I was just pointing into a representative example, not a real production focused implementation, and should be considered only as an inspirational start point.

I can see your point about extending the SDK with a sort of plugins or middleware, but I believe that this is simply not required for the feature we're discussing here since it's a low-level feature and probably very coupled to the internal state logic like to be fully externalized.

The point I was talking about is, indeed, the idea of implementing a stream writable compatible interface in the SDK Request prototype chain. I think this doesn't break any of the existent interface contract. The unique thing here is how to deal and dispatch the request factory until the chunks has been received at all (in the case that they come from readable stream via pipe)
I can't understand at all what you are pointing about the retries... I guess that after you have the complete buffer of the body you can cache it and perform the required operations over it when you want, and all of this operations are encapsulated and agnostic to the consumer.

Obviously my knowledge about the internal SDK core is very limited, so what's your analysis about this and possible constraints from a more low-level point of view? Thanks

@lsegal
Copy link
Contributor

lsegal commented May 4, 2015

the idea of implementing a stream writable compatible interface in the SDK Request prototype chain. I think this doesn't break any of the existent interface contract.

Relying on a stream-like interface would break the current contract, since "retries" are part of the SDK contract (this is just one example). Streams are inherently incompatible with retryability since they are not retryable. More on this below.

I can't understand at all what you are pointing about the retries...

Node.js streams are uni-directional streams that have no native concept of seeking or length. Basically they simply read or write data and that's it. Unfortunately this is not sufficient for the functionality we expose. Seeking is necessary for most of the SDKs concerns, namely signing payloads, checksum validation, and potentially retrying failed requests, since in all of these cases we need to potentially read the stream multiple times (at least 2 times in order to sign and send the payload). Length is also a required for payloads to Amazon S3. Using bare streams would disable any of the above mentioned features that the SDK natively supports, all of which are core features of the SDK.

I guess that after you have the complete buffer of the body you can cache it and perform the required operations over it when you want, and all of this operations are encapsulated and agnostic to the consumer.

This isn't quite feasible. Many payloads, especially for interfaces that we want to use streaming operations for (namely S3), can be extremely large (500MB, 1GB, or larger), and it would not be reasonable (or possible) to buffer these streams entirely into memory in order to operate on them. Buffering into memory also defeats the purpose of streaming, i.e., to keep memory usage low. Basically, if we exposed a generic streaming interface, it would come with the caveat that many of the core SDK features would be turned off for this stream, including but not limited to signature version 4 signing.

what's your analysis about this and possible constraints from a more low-level point of view?

Effectively the analysis has always been that Node.js streams are an insufficient abstraction to support our use case. We need to support seekable streams-- something we use in all other SDKs-- so that we can support reading a payload multiple times without buffering the entire payload into memory. Reading the payload multiple times is necessary for signing, generating extra checksums (a 2nd read pass in input), sending, verifying the response body for checksums, and possibly retrying if a request error occurred, which potentially requires another pass at signing the request, and definitely another one at sending it. In total, we may read through a payload as many as 15 times in a single request cycle (in the case of 3 retries)-- this is simply not possible with the standard Node.js stream interface. We work around this with file streams (fs.createReadStream) via a hack that allows us to re-open the stream, but this approach would not be applicable to the general case, and so we would never be able to promote this generic interface for the general case.

Hope that explains some things!

@h2non
Copy link
Author

h2non commented May 4, 2015

Hope that explains some things!

Absolutely!

This isn't quite feasible. Many payloads, especially for interfaces that we want to use streaming operations for (namely S3), can be extremely large (500MB, 1GB, or larger)

You're totally right. It's simply non-viable. Indeed V8 max allocated memory is limited to 1GB.

Node.js streams are uni-directional streams that have no native concept of seeking or length.

In node/io.js there're duplex streams, so technically they're bi-directional if we consider it as a whole abstract data type entity. For instance, network sockets uses both directions.

Effectively the analysis has always been that Node.js streams are an insufficient abstraction to support our use case. We need to support seekable streams

That's true, but you can discovery the byte length via the _write in a writable stream on every buffer chunk (like you're currently doing when using fs.createReadStream), and ignoring the buffer data which should only remain in the stack for a limited amount of time.

Tending to simplify a bit the things.
Essentially I believe that the problem here is not related to the mechanism to read data from a stream, which is actually done and works fine. It's just to find the way to plug-in different interfaces with the same behavior.

I think that currently we have:
ReadableStream -> Body -> Request Flow

And the idea is to have something like:
ReadableStream | WritableStream (collector) | ReadableStream -> Body -> Request Flow

Thank you for the reply!

PS: To be honest I feel like the potential benefit of this feature doesn't justify the invested time discussing/thinking about it, but it's definitively interesting digging into it.

@achselschweisz
Copy link

Hi,

+1 for adding piping support

I, too, was excited at first to see the upload method taking readable streams as parameters, and immediately changed the implementation of one of our applications to use that.

fyi: the application takes a readable stream and pipes it to several writable streams at the (virtually) same time, writing files, to s3, and other destinations (all implemented in Transform streams).

Anyway, as you probably guessed, stuff didn't work as expected, since the aws-sdk internally consumes the buffers (you are adding a readable listener in the "send" method) in the readable stream which leaves the other pipes with 0 bytes of data.

However,

Effectively the analysis has always been that Node.js streams are an insufficient abstraction to
support our use case. We need to support seekable streams-- something we use in all other SDKs--
so that we can support reading a payload multiple times without buffering the entire payload into
memory. Reading the payload multiple times is necessary for signing, generating extra checksums
(a 2nd read pass in input), ....

I am sorry to say, but I have to call bull$hit on this :-) This functionality is what Transform streams can be used for, after all. We've been doing this w/ "uploadPart" internally so far (buffer the part you want to retry / sign / etc).

Piping streams is one of the most powerful and useful features node / iojs have to offer, I reckon all aws-sdk-js users would benefit greatly if this was supported.

Cheerio!

@achselschweisz
Copy link

Btw: a quick workaround for piping a readable stream into several writables including handing over the readable stream to the "upload" method is to pipe the original readable stream into another readable (Transform) stream, then hand over the reference to the upload method (e.g. the stream.PassThrough class):

var stream = require('stream');
var fs = require('fs');
var readable = fs.createReadStream('xxx');
var passThrough = new stream.PassThrough();
var s3ref = readable.pipe(passThrough);

// suppose we have Writable streams "A" and "B", then we could do:

readable.pipe(A);
readable.pipe(B);
s3client.upload({ Body: s3ref });

This way the buffer of s3ref will get consumed w/o interfering with the original readable stream.

@jeskew jeskew added the needs-major-version Can only be considered for the next major release label Jan 18, 2017
@Mic75
Copy link

Mic75 commented May 19, 2018

@achselschweisz Your workaround work like a charm. I am using it to to upload from one host (with the request package ) directly to a S3 bucket.

I am not understanding the necessity of to pipe the readable stream into the passThrough though.
For instance, any idea why this code isn't working?

request.get('somehostname')
        .on('response', response => upload({Bucket: bucket, Key: 'test-hub/UnityHubSetup.exe', Body: response})

@achselschweisz
Copy link

achselschweisz commented Jul 25, 2018

@Mic75 Sorry it's been some time since your question:
Quoting myself here:

Anyway, as you probably guessed, stuff didn't work as expected, since the aws-sdk internally
consumes the buffers (you are adding a readable listener in the "send" method) in the readable
stream which leaves the other pipes with 0 bytes of data.
This way the buffer of s3ref will get consumed w/o interfering with the original readable stream.

So basically by using your own stream (or the pass-through) which you can pipe into the aws-sdk, you will still have control over it despite the fact that the stream on the aws-sdk's end is being consumed.
The authors are basically buggering up the way stream piping works, hence the need for "something in between" :-)

@github-actions
Copy link

Greetings! We’re closing this issue because it has been open a long time and hasn’t been updated in a while and may not be getting the attention it deserves. We encourage you to check if this is still an issue in the latest release and if you find that this is still a problem, please feel free to comment or open a new issue.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label May 14, 2020
@h2non h2non closed this as completed May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closing-soon This issue will automatically close in 4 days unless further comments are made. feature-request A feature should be added or improved. needs-major-version Can only be considered for the next major release
Projects
None yet
Development

No branches or pull requests

5 participants