Return real streams (previously a request for real promises) #13

Closed
ForbesLindesay opened this Issue Dec 20, 2012 · 55 comments

7 participants

@ForbesLindesay

At the moment the AWS object is 'like a promise'. For why that's a terrible thing, I'd advise you to read @domenic's rant You're missing the point of promises. In short, what you need to do is provide a .then method which complies with the promises-aplus spec. It's not too complicated, and it would make that 'promise like' api into an absolutely killer 'promise' api.

@lsegal

We've decided to drop the promise API altogether and opt for a much simpler event based callback system which was just merged in via #22. The event based refactor allows for EventEmitter events registered on the request object. The initial goal here was to allow for success/fail callbacks, but this can now be achieved with something like .on('success', callback) instead.

If promises are needed, it makes much more sense to go to the many already-available libraries like promise.js or Q as mentioned in #2.

@lsegal lsegal closed this Jan 3, 2013
@ForbesLindesay

An event emitter that emits success and error events is still just a really degenerate promise. You just have an even more different API. I'd suggest depending on one of the existing promises-a+ compliant libraries:

Most of those libraries are tiny and provide really simple APIs to let you build non-degenerate promises out of the requests you're already making. I can't see anybody wanting to use the event API as it stands. If you don't want to support promises you might as well just support callbacks, because the only use I see for the event based API is using it to build a real promise. You can see an example of this being done in dynostore (the database layer API for jepso-ci)

@lsegal

We already support callbacks as implemented in #2. The event based system is just extra sugar to allow for specific events that are not compatible with a callback/promise based system (like data streaming events). I don't think it's wise for us to make an opinionated decision on which promise library to depend on when there are so many options out there. Our current callback-style interface should allow any of those promise libraries to be plugged in and used if needed.

@ForbesLindesay

I'm not asking you to support choosing something specific and opinionated, I'm asking you to adopt a standard. I realize you support callbacks already, I'm simply making the point that 'success' events don't add anything.

If you use any of the aforementioned libraries, the resulting promises can be assimilated by any of the other libraries.

e.g. if there was an aws function getFromAWS() which returned a promise (using any of the above libraries) then:

var res = promise()
  .then(function () {
    return getFromAWS();
  });

res is then a promise of the same type as that returned by promise(), but for the result of the AWS service function. So you can do:

var promise = require('my-promise-library-that-conforms-to-promises-a-plus');
var res = promise()
  .then(function () {
    return getQpromise('foo');
  })
  .then(function (foo) {
    return getWhenPromise(foo);
  })
  .then(function (bar) {
    return getVowPromise(bar);
  })
  .then(function (url) {
    return getFromAWS(url);
  });

res would then be a promise of the type used by 'my-promise-library-that-conforms-to-promises-a-plus' for the result of that chain of asyncronous operations. They also all play nicely in terms of helper functions like all:

Q.all([getFromAWS(url), getFromVowsAPI(url2), getFromWhenAPI(url3)])
  .spread(function (res1, res2, res3) {
    //use the result of 3 different types of promise here
  });

A lot of effort went into producing a promises/A+ spec that everyone can agree on, and it needs to be adopted at this level in addition to being used in promise libraries.

@ForbesLindesay

In addition, if you are using events for data type partial results, you should really be using streams. See why you should use streams. They provide a standardized method for handling streaming data, including helpful things like back-pressure and the ability to pass through a number of transformation streams. This is needed for writing high performance node.js applications.

@domenic

Yes, the event-based API is pretty non-Nodey. (Well, actually, it's Node 0.1.x-ey.) Either use streams or (err, result) callbacks as the base. Then maybe use promises on top of that.

@trevorrowe
Amazon Web Services member

@domenic AWS has only a pair of operations (S3 GetObject and Glacier GetJobOutput) that return raw data as the http response payload. The current event-based API exists to provide access to everything we get back from AWS. It would be very simple to wrap the returned request and create a read stream from it.

stream = s3.getObject(params).createReadStream();

Or are you looking for something like this?

s3.getObject(params, function(err, stream) {
  // ...
});

I want to make sure I understand correctly what it is you are looking for. We purposely have not jumped in and created any higher level abstractions around streams yet.

@ForbesLindesay

If you look at a library like the very popular request you can see a typical examle of something that works like a stream. Normally you'd expect something like:

stream = s3.getObject(params);

That way you can for example have a method in an express app that sends the stream straight to the client:

app.get('/s3-object', function (req, res) {
  s3.getObject(params).pipe(req);
});

Since streams are also event emitters themselves, there should be no problem emitting additional events as needed for the headers etc. (indeed request does this).

As for the APIs that don't stream, since you're not planning to support real promises, you shouldn't really make them event emitters as that turns them into sudo-promises, which is much worse.

@lsegal

@ForbesLindesay, I am a little confused about the need to "adopt a standard" for promises. It seems like it should not be the concern of the AWS SDK to implement, return, or deal with promises in any way. It's very trivial to bake your favourite promise library into the API with a helper method, or even by extending the API:

AWS.Request.prototype.promise = function() {
  var deferred = require('q').defer();
  this.on('complete', function(resp) {
    if (resp.error) deferred.reject(resp.error);
    else deferred.resolve(resp.data);
  });
  this.send();
  return deferred.promise;
};

You can use this quite easily:

db = new AWS.DynamoDB({region:'us-east-1'});
db.client.listTables().promise().
  then(function(data) {
    console.log(data);
  }, function(err) {
    console.log(err);
  }).then(...);

Now you have a "real" promise, and there's no need for the SDK to take on extra dependencies for users who have no need for a promise API. The above snippet could make a great aws-sdk-promises plugin, and would keep concerns of the core library separate from having to adopt any specific interface.

@trevorrowe
Amazon Web Services member

@ForbesLindesay I see the value of s3.client.getObject returning a stream. That said, what if I also want access to other relevant information available from the response (e.g. content type, etag, metadata, etc). What is wrong with having the req object define a method that returns a stream?

req = s3.client.getObject(params);
req.on('success', function(resp) {
  // access interesting attributes about the object returned
  resp.data.Etag;
  resp.data.ContentType;
  resp.data.Metadata['custom-attribute']; //
});
stream = req.createStream();

If you don't care about the other object attributes you could simply:

s3.client.getObject(params).createStream().pipe(...)
@ForbesLindesay

@lsegal Yes, but you have a .on('complete', function (result) method, which is almost identical to the .then method of a real promise, so why not just provide a real .then method? Or alternatively, why maintain the sudo-promise API you currently do? Either have the upkeep and the benefit of returning promises, or don't have the upkeep or the benefits. At the moment you have the upkeep without the benefits.

@domenic

As @ForbesLindesay says in his latest comment, the problem is your nonstandard promises, i.e. your event emitters with a pair of success/failure events. Since you've clarified you didn't mean those to be promises, the original goal of this issue is no longer relevant. Now the discussion has turned toward removing the pseudo-promise interface in favor of real streams or simply callbacks.

@domenic

@trevorrowe streams are allowed to and encouraged to have other properties. See for example every stream in node core.

@ForbesLindesay

@trevorrowe The problem with that is that it's superfluous. Because streams are event emitters, with the API I'm suggesting you could simply write:

stream = s3.client.getObject(params);

stream.on('success', function(resp) {
  // access interesting attributes about the object returned
  resp.data.Etag;
  res.contentType(resp.data.ContentType);
  res.header('custom-attribute', resp.data.Metadata['custom-attribute']); //
});

stream.pipe(res);
@trevorrowe
Amazon Web Services member

@ForbesLindesay The request is not a stream object, it is a request object. It carries context of the request parameters, the http request headers, uri, payload, endpoint, etc. I think it would be better to create a readable stream from the request object. Here is a simple example:

AWS.Request.prototype.createReadableStream = function() {
  s = new require('stream');
  stream = new s.Stream();
  stream.readable = true;

  var req = this;
  stream.on = function(e, callback) {
    req.on(eventName, callback);
  });

  this.on('httpData', function(data) {
    stream.emit('data', data);
  });

  this.on('httpDone', function(data) {
    stream.emit('end');
    stream.emit('close');
    stream.readable = false;
  });

  this.on('error', function(err) {
    stream.emit('error', err);
  });

  this.send();

  return stream;
};

Using the above addition you could do something very similar to you example:

stream = s3.client.getObject(params).createReadableStream();
stream.pipe(require('fs').createWriteStream('./output'));

If you want access to the "other response data" you could do this instead:

req = s3.client.getObject(params);

req.on('success', function(resp) {
  // access interesting attributes about the object returned
  resp.data.Etag;
  resp.data.ContentType;
  resp.data.Metadata;
});

// now get a stream from the request
req.createReadableStream().pipe(('fs').createWriteStream('./output'));

This could be expanded to support creating writable streams for operations like S3 PutObject.

I think the primary difference between this example and the proposed API you gave is that the pair of AWS operations (S3 GetObject and Glacier GetJobOutput) would instead return the streams directly instead of the request object. Is this correct? Thoughts?

@trevorrowe
Amazon Web Services member

As @domenic pointed out, the request is not intended to be a promise. It is a low-level representation of a request object. It can be used to construct a promise, create a stream, etc. If you do not need access to the low-level events emitted by request (validate, build, sign, send, httpHeaders, httpData, HttpDone, retry, error, success, complete) then it is possible to pass a traditional node callback function to the operations.

Any similarities between the events emitted from request to those emitted by a promise object should be ignored.

@ForbesLindesay

Yes, they would return the stream directly, rather than the request object, but that stream could also behave exactly like the current request object. What you're doing currently just makes it look like something that was developed without an understanding of the node.js ecosystem.

I think it's a shame given how close your request objects come to being promises, that you're choosing not to make them into promises (a promise can still be an event emitter), but that decision is understandable. Returning an object that is a stream that fires 'httpData' instead of 'data' and 'httpDone' instead of 'end' is insane and doesn't inherit from stream is insane.

If you were to:

  1. Rename the 'httpData' event to 'data' and 'httpDone' to 'end'.
  2. Automatically start the request if either 'data' or 'end' are listened to.
  3. Inherit from var Stream = require('stream');

That last step is simply a case of:

AWS.Request.prototype = Object.create(require('stream').prototype);

You would have something that looked like it fitted in with node.js rather than worked against node.js

@ForbesLindesay

The close event is optional, so you don't need to fire it.

Streams can be duplex (readable and writable) so that's how PutObject should work.

@lsegal

@ForbesLindesay:

Returning an object that is a stream that fires 'httpData' instead of 'data' and 'httpDone' instead of 'end' is insane and doesn't inherit from stream is insane.

This is why we should not make the Request object a stream, because this would be insane. @trevorrowe pointed this out: Request objects are not, and should not, be Streams. Making them as such would make no conceptual sense. Requests contain context for a service request and are much more than just a handle to a socket to read and write from. Yes, they share similar concepts with Streams, but only through composition-- a request is composed of a Stream; it is not, itself, a Stream. Smashing these two objects into one would be confusing from an architectural point of view, and I think it would cause a lot of user confusion as well. The idea to createReadStream() in order to get back a proper Steam object makes sense and seems perfectly in line with how Node's fs module works. It also properly separates these composable objects and keeps two very separate and well encapsulated components.

I also think it's important to take a step back and realize that the lifecycle of a request to AWS is much more than just an HTTP socket stream that you read and write from. There is also plenty of logic in building, signing, retrying and parsing data from responses that is baked into the whole request lifecycle that seems to be taken for granted here. These are interesting parts of a request lifecycle that we are trying to expose to users through events, and make the system much more composable and extensible. A promises interface is much too simplistic to answer these questions, this is why we created a much more robust event-based system. The more simplistic promises interface can be much more easily derived from the event-based system than the other way around.

As a sidenote, prefixing the HTTP portion of the lifecycle with "http" was done to conceptually separate these events for our users. Furthermore, emitting the "end" event for the end of the HTTP stream would be wrong, because the end of the HTTP request does not signify the end of the request (from a logical point of view, it's actually less than half of the full lifecycle). If anything, the "end" of the request is on "complete", but we explicitly decided against naming this end to avoid confusion with Streams (recall: Requests are not Streams). Ditto for "data", as there are actually two kinds of "datas" in a request (the raw HTTP data and then the parsed response XML/JSON data).

I think looking at the bigger picture is important, here. Providing streaming operations and promise interfaces can be useful things, but there are better places for these things than at the low-level client APIs.

@ForbesLindesay

Node.js Stream != Low level Socket Stream

You can have a stream of JSON objects for example. They represent things that can be processed in parts and where back-pressure may be useful to improve scalability and reduce unnecessary buffering.

@ForbesLindesay

A promise alone is too simplistic, but they can easilly be used as a mixin on other objects. Let's forget the promises though. The stream issue is much more important. It may well only make sense for the S3 API which should then use the raw binary data of the blob for the stream.

Read the fs API more carefully, what your proposing is not similar. They have one method createReadStream(params...) you have a two stage process. It's that extra step I don't like. I also dont like that requests are lazy (you have to explicitly start them) as that is not the typical way node.js works.

@lsegal

I understand that Streams are not inherently low level, but the Request object is inherently low level. Streaming parsed JSON is a completely orthogonal concern to the Request class, and can be implemented independently of the Request in higher level APIs (you would actually be operating entirely on the Response object).

As far as one or two stage: I don't see this as significant. I don't believe the value of Node is that you can do everything with a single one-step method call (and that is not even true). Separating concerns is a much more valuable concept, and I think even Node developers would agree that preferring composition over inheritance is ideal.

Finally, initializing a request when a listener registers to the 'data' event simply cannot be done. This approach might work in the simple case of piping content directly to a stream, but it fails miserably in a more complex scenario, i.e., if multiple objects want to register for the 'data' event (one debug logger, one stream).

However-- to clarify: we are in agreement that we should add better Streaming support. We just merged in #3 which addresses parts of these issues, and @trevorrowe's streaming snippet can be added to give better stories for streaming raw content. Streaming parsed XML/JSON will most likely be left up to higher level APIs, though.

@ForbesLindesay

Fortunately there are other AWS libraries. You've ended up building a high level API that simply fails to interoperable with the rest of node.js.

@lsegal

Can you list concrete examples of where the SDK is not interoperable with "the rest of node.js"? I have shown that you can quite easily interop with the promise interface using a simple wrapper, and @trevorrowe has provided an easy adapter method that returns a stream from a request (something which we agree we would like to incorporate into the API). Are there any technical limitations with these solutions? It would be much more useful if we had examples of concrete cases where the solutions proposed do not work.

@ForbesLindesay

I should have said "directly interoperable" you've presented shims that work, but we shouldn't need shims.

@lsegal

I don't consider the create[Read|Write]Stream() proposal to be a shim since the Request object is not actually a Stream object (and is not intended to be). From a technical perspective, there is no difference to calling stream.pipe(out) vs. stream = req.createReadStream().pipe(); it is simply a stylistic distinction.

As far as promises go, yes, this is a shim, but I don't believe this wrapper code belongs in the core library. Given that users who are interested in promises will already have a library of choice pulled into their applications, it is not correct for us to make assumptions about which promise library to use, especially since writing a wrapper is trivial and could be published as a separate package for those users who want a promise interface. It's also important to note that part of the reason promises were dropped from the SDK was the overwhelming feedback from users in #2 stating that promises were, in fact, no longer idiomatic in Node, and it was something they were simply not interested in seeing in the library. And you can always add it in if you really do want it.

@domenic

@lsegal I am flabbergasted by the misunderstanding of streams that you are displaying here. Have you looked at the http module? The one that comes with node.js? Statements like

This is why we should not make the Request object a stream, because this would be insane. @trevorrowe pointed this out: Request objects are not, and should not, be Streams.

or

I understand that Streams are not inherently low level, but the Request object is inherently low level.

or

I also think it's important to take a step back and realize that the lifecycle of a request to AWS is much more than just an HTTP socket stream that you read and write from

are laughable in the face of http.ServerRequest, for just one example. You might as well say "a HTTP request is not just a TCP socket stream you read from"---in both cases, you're missing the point entirely.

If you'd find it helpful, I can try to carve out time for a point-by-point rebuttal, but I think first it would be valuable if you familiarized yourself with the Node.js core library and the idioms embodied there.

Note how the HTTP request, in addition to being a readable stream, has properties like method, url, headers, etc.


You think this is a stylistic difference, but this is actually a very deep one. You are creating an entirely new and bizarre abstraction in this "request" object, which behaves like no other request object in the Node.js ecosystem. It represents a network I/O operation, but is not a stream---this is a clear-cut WAT.

It is very accurate to say that you've provided a shim for streams on top of the library, since the core abstraction is some weird object that represents, in the end, your fundamental misunderstanding as to how I/O and network abstractions in Node work, and streams are just something you can get by adapting that weird object via the shim method.

@ForbesLindesay is right that, if you guys are displaying this level of deep misunderstanding of streams in specific and the Node.js ecosystem in general, moving on to other libraries whose authors actually understand is going to be the right move.

@Raynos

To reinforce.

If you have an api that returns an object which conceptually represents a stream, i.e. it asynchronously gives you data in multiple chunks, where each chunk is a Buffer or string and eventually terminates with some kind of end symbol then it should be an instance of require("stream").

Or if your api returns an object which conceptually you can apply back pressure to then it should also be a stream

If you have a conceptual stream of non-Buffer / string data then your allowed to use what you deem to be a more suitable abstraction if you wish (i.e. gozala/reducers)

It should be noted that an event emitter with arbitary events is not a more suitable abstraction.

My personal objection to your API not being a stream is that:

  • your custom events are not documented
  • it is trivial to infer how to read your data if I inspect an AWSRequest instance and see a pipe method.
  • common composable abstractions exist for a reason. I have a tool chain of high order functions that work cleanly with (err, data) callbacks and streams.
  • Writing a wrapper that makes your module a valid stream is silly and fragments the aws developer community. I will do it if necessary

It should also be noted that a createStream() method is a reasonable enough compromise although unnecessary.

@lsegal

@Raynos:

your custom events are not documented

These changes were just merged in yesterday-- the documentation on http://docs.amazonwebservices.com/AWSJavaScriptSDK/latest/frames.html represents the last release (v0.9.1-pre.2), not what is in GitHub. There will be published documentation for all of the events emitted by the request object come the next release.

it is trivial to infer how to read your data if I inspect an AWSRequest instance and see a pipe method.

This is where things get complicated. How do you define "the data"? Only for streaming operations (a handful of methods on Amazon S3 and Glacier) is the raw HTTP body equivalent to "the data" (and even then, it's not quite the only data). In most other cases, the actual data will be post processed, will NOT be string/buffer data, and will be much more useful than the raw data that would be retrieved by pipe(). In the case of non-streaming operations (the vast majority of AWS operations), we do not want users using pipe() to stream raw data, as this short-circuits most of the valuable work the SDK does to extract data from responses. In other words, unless you're streaming files to/from S3, you probably don't want to be calling pipe() anyway.

@Raynos

@lsegal

In the case of S3 I'm most likely reading a remote file and only care about the file. So I would expect an API very similar to fs.

For the rest of your APIs I can't comment other then to say that if it does back pressure it should probably be a node stream.

@ForbesLindesay

As for S3 it's almost always going to be a file. It may well be post-processed, but those processes are going to want to treat it as a stream so they can apply back-pressure and so that they don't have to buffer large chunks. Doing post-processing as a stream also helps ensure that the post-processing can be re-used in other streaming interfaces (such as the excellent windows azure SDK).

As an example, say I was reading a text file from AWS and I wanted to force the result of a text blob being read form S3 to be in all lower case

var pass = require('pass-stream');//see https://npmjs.org/package/pass-stream
function toLowerStream() {
  return pass(function (data) { this.queueWrite(data.toString().toLowerCase()); });
}
var lowerCaseFile = AWS.getBlob(id).pipe(toLowerStream());
var lowerCaseFile = WindowsAzure.getBlob(id).pipe(toLowerStream());
var lowerCaseFile = fs.openReadStream(id).pipe(toLowerStream());

For the other APIs there are two different classes of people you need to cater to:

  1. Those who can use your API directly and are happy with the passed response, these people may want a progress event, but probably just use the callback API as they probably don't want anything else.
  2. People who are going to want to do the parsing of the strings themselves. They may well have a streaming JSON parser for example. These are people after extreme performance, they just want a stream and are deliberately bypassing a lot of the guts of your API as it's (correctly) built for maintainability over performance.
@domenic

It sounds like part of the problem is you're trying to use the same abstraction for all AWS APIs, instead of using different ones for S3 and for others.

@trevorrowe
Amazon Web Services member

@domenic I agree. Currently we are trying to use the same abstractions for all AWS operations at the client level. Please notice that we have grouped API operations for each service into its own Client class. The client class is nested beneath the service class (e.g. AWS.S3.Client, AWS.EC2.Client). This was done purposefully so higher level abstractions for each service could be latter added to the service class.

This intentional grouping and nesting is why you need to do the following:

s3 = new AWS.S3();
s3.client.someOperation(params)

// instead of simply
s3.someOperation(params)

Would everyone's concerns be eliminated if this existed?

s3 = new AWS.S3();
s3.upload(params).pipe(...) // internally uses s3.client.putObject
s3.download(params).pipe(...) // internally uses s3.client.getObject

Please note, in the example above the methods names could be anything that makes sense. The suggested upload and download are arbitrary.

I want to stress. There are currently only 3 streaming upload operations (S3 PutObject, UploadPart and Glacier UploadArchive) and only 2 streaming download operations (S3 GetObject and Glacier GetJobOutput). The hundreds and hundreds of other API requests would not be well served as a Stream object. We feel there is value in consistency at the low-level (i.e. Client class) where all of the methods have the same return type.

The one-to-one mapping of API operations and their parameters is an explicit design choice. This allows us to rapidly update the API configuration for each service client keeping them lock step with the service updates. During this developer preview we plan to expand the coverage of low-level clients across more services. Are we interested in higher level abstractions? Absolutely yes. We simply have not spent many cycles on them.

@domenic

@trevorrowe that does look quite reasonable. And your concerns about 1-1 mapping of API operations and parameters are sensible, although I am not sure that necessitates having the same return type for every operation.

@ForbesLindesay

OK, you're close with that last one. There are two issues with it, the first is simple and I imagine is just a typo:

You pipe from a read stream to a write stream

s3 = new AWS.S3();
someOtherStream.pipe(s3.upload(params)) // internally uses s3.client.putObject
s3.download(params).pipe(someOtherStream) // internally uses s3.client.getObject

Backpressure

You haven't made explicit your intention to support back-pressure (in fact I don't think @trevorrowe or @lsegal have even used the word in this discussion). Without support for back-pressure you can end up with vast un-necessary buffers. This would seriously impact scaleability.

The frustrating thing, is that if you take your slightly odd event API out, you must at some point be making a raw HTTP or TCP request. That request will be returning a stream that will support back-pressure and thus be ideal for this use case. Because this underlying Stream is where you want to start when building this API, it's not helpful to build it on top of your API, it would be easier to start from scratch.

Keeping Consistency

You've expressed your desire to keep a consistent API across all your requests. So I implore you to give me what you view as the likely use cases across your APIs. I envisage there being precisely two different ways people want to extract data from your API and precisely two ways they want to send data to your API.

Sending data to your API

With the exception of file upload to S3/Glacier, the vast majority of people will want to just pass a few arguments to a function and have it upload their data. For S3/Glacier they will want to pipe a file to it:

//Use case 1

AWS.serviceName.methodName(params...);
//where params contains structured data or small
//blobs of text

//Use case 2

stream.pipe(AWS.serviceName.methodName(params...);
//where stream is data to send to the service
//and params... represents any metadata

Retrieving data from your API

With the exception of file download/upload from S3/Glacier, the vast majority of people will want to just pass a callback function to the method. It should take exactly 2 arguments (no more) the first of which should be an error or null.

//Use case 1

AWS.serviceName.methodName(params..., function callback(err, res) {
  if (err) throw err;//or handle it properly
  ...
});

For S3/Glacier storage people will want to have a stream object returned:

//Use case 2

AWS.serviceName.methodName(params...}).pipe(stream);
//where stream is a writeable stream and 
//params are the metadata to help locate the item to download

A small minority of people will want to retrieve the results of one of the other APIs as a stream which they will parse themselves (there are standard libraries to parse streams:

//Use case 2

AWS.serviceName.methodName(params...}).pipe(parserStream).pipe(stream);
//where stream is a writeable stream and 
//parserStream is a duplex stream that parses the results
//params represent a query

N.B. deliberately use case 2 again

Uploading a blob in more detail

The result of a blob upload is not in fact a readable stream, that would be insane, but it is a collection of events so that makes perfect sense: It can just be the writable stream (exactly like in the core fs module)

var file = fs.createReadStream();
var uploadOperation = file.pipe(AWS.serviceName.methodName(params...));
//uploadOperation is a writeable stream 
//It is the stream returned by AWS.serviceName.methodName(params...)
//It has been extended for metadata

uploadOperation.on('aws-metadata', function (metadata) {
  ...
});
uploadOperation.on('end', function () {
  //stream end event fired when upload complete
});

Summary

This presents a consistent, low-level api:

  • Methods return their results either using a callback function or, if there is no callback function, as a readable stream.
  • Methods can accept input either as parameters or by being written to as a writable stream.

This low level API creates a suitable base for all high-level abstractions, but mostly it's actually sane enough that you wouldn't need massive abstraction layers on top of it, it could be used directly (something I wouldn't say about the current API).

@ForbesLindesay

P.S. as @domenic points out, having a 1-1 mapping with the API is ideal, but make the implementation of that API streams where appropriate and callbacks where appropriate, your sudo-streams have nothing to do with the AWS API and everything to do with not fully understanding how node operates.

P.S. I think it's fair to say this should be an open issue (I'll update the title).

@lsegal lsegal reopened this Jan 4, 2013
@Termina1

Actually, it is very frustrating that I can't use streaming for uploading files to S3.

@domenic

@Termina1 thankfully knox has got you covered.

@ForbesLindesay

👍 for knox, they should just make that the "official" S3 client and be done with it 😄

@Termina1

@domenic Thanks!

@lsegal

@Termina1 @domenic @ForbesLindesay just to clarify, you can use the aws-sdk to stream PUT operations to S3, you simply need to pass a stream object as the body parameter:

var fs = require('fs');
var stream = fs.createReadStream('/path/to/file');
s3.putObject({Bucket:'bucket', Key:'key', Body:stream}, function() { ... });

For streams that are not FS objects, you will need to define a .length property on the stream object to tell the SDK how large the payload is, but we were just discussing exposing a ContentLength property in #94 to allow this as a separate parameter. That would make the story very similar to knox.

@domenic

@lsegal is there any way you could get the S3 team to support chunked transfer-encoding, so we don't have to do silly extra steps to discover the content length? That would be... amazing.

Note that multipart upload doesn't work for this, since it only works for files 5 MB and larger.

@ForbesLindesay

👍 for chunked transfer-encoding

@mitar

So are streams supported for getObject or are they not? I am not sure what is current state of this?

@lsegal

@mitar streams are supported for payloads in both directions. For getObject you would call .createReadStream() to return a stream wrapper for the request object like so:

var fs = require('fs');
var writeStream = fs.createWriteStream('/path/to/outfile');
var readStream = s3.getObject(params).createReadStream();
readStream.pipe(writeStream);

We are working on revamping our guides to make these examples more prominent-- until then, you can see this in our API docs for the AWS.Request object.

@ForbesLindesay

Also, no word yet on whether these streams implement back pressure?

@lsegal

@ForbesLindesay as Node moves to the new "streams2" API, it is no longer necessary for library writers to support back-pressure. This is something that is now supported by the underlying streams implementation with the new internal high/low water mark settings. Effectively, the stream API is now poll instead of push, meaning the consumers now read whatever they can (and want to) instead of what they are forced to. Pause and resume are no longer used in the new API, so this simplification means we do not have to support back-pressure at the SDK level going forward. That's a great thing for us and all other library developers implementing custom streams or stream wrappers.

@ForbesLindesay

That's simply not true. You need to support back pressure by not pulling unnecessary data down. What's changed, is that buffering is automatic. Back pressure still requires you to not read additional data. I should be able to effortlessly write a node.js proxy that serves a 4GB file from Amazon S3 and runs on a node.js drone that has only 50MB of memory. If we assume the user downloading the file is on a really slow wi-fi connection:

Buffering won't work because it'll completely run out of memory.

Back pressure will work, because as the destination stops requesting data from the proxy, the proxy will stop requesting data from Amazon S3 and everyone will be happy.

You do get this for free if you directly expose the underlying streams that you get from the core APIs. You don't get this for free if you wrap the underlying streams in something that looks like a stream but isn't, then convert that non-stream back into a stream.

If you need to support transformations on the stream, just do with with something like https://npmjs.org/package/through. If at some point you need to buffer up all that data and provide a callback API, use https://npmjs.org/package/concat-stream. Never create something that looks a bit like a stream, but isn't.

@lsegal

@ForbesLindesay I think the buffering issue you raised was an oversight of the implementation. I've just pushed a fix in 91e5964 that addresses buffering. Previously we were taking data events right from our HttpClient stream which reverted it to the old-style stream. We just didn't get around to making the stream2 switch everywhere.

This change corrects that and respects the new readable event everywhere, so that our createReadStream() only calls read() on the underlying HTTP socket when you do. This should implicitly support the backpressure functionality you've been talking about.

A few notes:

You don't get this for free if you wrap the underlying streams in something that looks like a stream but isn't, then convert that non-stream back into a stream.

We are not wrapping the stream in something that looks like a stream but isn't. The stream we are exposing is (and always was) a bona fide stream.Readable (or Stream before Node 0.10.0) that implements the proper _read protocol. The problem was that prior to the above commit we were not fully respecting the streams2 protocol. Please correct me if I'm wrong, but the streams2 protocol should hide all backpressure details from stream implementors, as the stream implementors never actually try to read anything; the consumers do. Therefore we do actually get backpressure for free by implementing this protocol properly. Basically, in a Node 0.10.x client, our streams do not buffer data-- we only read when the consumer asks for bytes through the readable event. If a starved process stops asking for bytes, we should stop trying to read from the underlying HTTP stream.

Let us know if we're still missing anything here, otherwise I think this issue is set to be closed.

@ForbesLindesay

I've added a number of line comments for things that could be vastly improved. What's there should work, but it could be a lot cleaner and simpler.

You also really ought to make your above example into:

var fs = require('fs');
var writeStream = fs.createWriteStream('/path/to/outfile');
var readStream = s3.createReadStream(params);
readStream.pipe(writeStream);

That would be much more node-like.

And writing should also be as simple:

var fs = require('fs');
var writeStream = s3.createWriteStream(params);
var readStream = fs.createReadStream('/path/to/infile');
readStream.pipe(writeStream);

There is a piped event you can listen for to get a reference to any stream that's been piped to you.

@lsegal

Thanks for the inline comments @ForbesLindesay, they look helpful. We will look through them to see what we can fix up and simplify. It's also good to see sign off from members of the community when we are doing things right, even if they might not immediately be perfect.

As for your examples, I'm not sure this is the right approach for our SDK long-term. I can think of a few problems with that API:

  1. Your proposed "createReadStream()" method operation does not map to a specific operation on the S3 service, which deviates from our explicit design of keeping the Service objects (previously "Clients") a 1:1 mapping of the actual service operations (you can see these service definitions in our lib/services/api/ directory which we use to build up-to-date service objects).

  2. Following from (1), it is non-obvious that "createReadStream" refers to "getObject". More importantly, a service might have multiple operations that can stream a payload of data (S3 and Glacier already do, if you include multi-part operations). In those cases, our API will be confusing to users trying to operate on those "secondary" streaming operations. In short, there is more than one kind of streaming operation that can exist per service, so it does not make sense to map this one stream call for the entire service. This goes similarly for createWriteStream and putObject.

  3. The proposal is specific to the S3 service (or specific to a service), which means we would have to opt-in to this kind of functionality for other services (like Glacier, which also supports streaming). The benefit of the createReadStream living on AWS.Request is that you can pipe HTTP data from any operation on any service without us having to explicitly support it. This is much more of a composable interface, IMO, and supports many more of AWS's use cases without any extra manual code. I think higher level APIs can do more to make these service objects easier to use, but those interfaces should live outside the S3 service object and be a composable interface in the same way the 'through' module or 'concat-stream' modules live outside stream and are composable.

Finally, I don't know that our current interface is less node-like. It might be slightly more verbose, but not less node-like. If you are comparing it to 'fs', this might be a little unfair. Let me explain why:

If you think about the API in terms of creating streams from resources, createReadStream should give you a stream for a given resource. I think we both would agree on this interpretation (yes? no?). The difference here is that there is only one "fs" object, therefore it makes sense to say fs.createReadStream() and never instantiate a new fs (it's basically a singleton representation of your file system). Services, however, have multiple resources that can be turned into streams, specifically, all of their operations. In other words, we stream operations, not services. If it were possible to access multiple "filesystems" in the same way that we create operation requests, I think it would be completely idiomatic/node-like to say something like fs('/dev/sda1').createReadStream('/path/to/file'), but the lack of this functionality means there is no node-like example in the standard library, and for that reason, our API seems out of place.

Is there something I'm missing here, or is it really just the fact that we are not creating a stream from a single service object (something that seems like a "singleton")?

@ForbesLindesay

OK, my problem is essentially that the API is a little bit backwards. You should make Stream the low level default interface and callback an optional "High Level" interface. request is a really nice example of an API that can either support a stream, or a callback interface. You can use the API in one of two ways:

var fs = require('fs');
var request = require('request');

//stream api
request('http://example.com/foo.json').pipe(fs.createWriteStream('foo.json'))

//callback api
request('http://example.com/foo.json', function (err, res) {
  if (err) throw err;
  console.dir(JSON.parse(res.body.toString()));
}

This gives you a really clean, idiomatic, streaming API, or a simple buffering one when that's more appropriate.

Translating this to your API, you could do:

var fs = require('fs');

//stream api
s3.getObject(params).pipe(fs.createWriteStream('foo.json'))

//callback api
s3.getObject(params, function (err, res) {
  if (err) throw err;
  console.dir(JSON.parse(res.body.toString()));
}

Alternatively you could follow the example of knox which is a hugely popular S3 library. Judging by stars it probably has more users currently than this library.

They have chosen to create API methods such that it's a single method call still but there are separate APIs for get and getFile which return a stream and a buffer respectively. I think this API could be improved upon slightly, but it's worth looking to for inspiration. This is somewhat paralleling the core libraries that have api.foo and api.fooSync for pretty much all file system APIs. You could do s3.getObject(params) and s3.getObjectStream(params). One of the key things that makes for a lot of ugly code is that rather than just making the request and then returning the result, you're first attaching and then detaching lots of event handlers.

@ForbesLindesay

I'd actually propose that you just have your core shared code just be a function which takes the request parameters and returns the response as a stream. Substack's hyperquest would make an excellent starting point. You can then get a callback with the response object, but you only ever get a stream for the actual body of the response.

To do buffering, you then just use concat-stream which buffers all the data and gives you a callback.

This way you split making buffered requests into two stages:

  1. get a stream
  2. buffer the stream and parse the result

And your streaming interfaces can be really clean because they can just skip step 2.

@ForbesLindesay

One other thing to consider: This is a really big library for node.js generally it's preferable that libraries are of a more similar size to knox so a better idea if possible would be to break this up into lots of separate modules.

  • an aws-core package, this would contain shared code useful to anyone interacting with any service from AWS. It would have docs and tests but would generally only be intended to be used by library authors.
  • a package for each service that aws provides (e.g. aws-s3, aws-sqs, aws-sns etc.). Virtually all of these would depend on aws-core
  • a single core package package that depends on the latest stable version of each of the other libraries and exposes one unified interface. This would be for people building applications that have a lot of integration with AWS and therefore find it more convenient to just install them in one go.

The advantages of this would be huge. At the moment if I have a small library that just happens to support interacting with an AWS SQS, it might end up requiring the entirety of the AWS library. That becomes a lot of wasted time spent installing code I won't ever use. You might also release a new version of this library, which fixes a bug that only impacts EC2. I'd then have to update the dependencies in all my libraries even though they have no interaction with EC2 or risk dependent applications ending up needing to download multiple different copies of the AWS API.

If you keep your modules small, people can create lots of neat little libraries that just wrap your core and provide a slightly nicer API or an API slightly more optimized to one specific use case.

@lsegal

Thanks @ForbesLindesay. All the feedback you've provided is certainly appreciated. I think at this point it seems like we've moved beyond the original issue that was reported, namely returning real streams. Although the API might have some rough edges for certain use cases, and there is some optimization / refactoring that can be done on the internals, it seems that we are indeed providing real stream objects that can be streamed from as normal. The internals can be refactored over time without affecting compatibility, and the rough edges can be worked down. I believe this issue can be closed (we've met this goal), but I also believe we should certainly keep iterating on this. At this point, I actually think the best way for you to help us iterate would be through pull requests that implement the modifications you are suggesting. That way we can compare the pros and cons with real examples rather than hypotheticals.

On a sidenote, I should point out that @trevorrowe and I have already discussed and taken serious looks at what it would take to split the SDK up into multiple packages. We've also heard that requests from others, and it's not lost on us. There are some pros and cons with multiple packages, but the option is still on the table. I don't think further discussion of that topic belongs in this issue, though.

Again, thank you to everyone who took part in this discussion!

@lsegal lsegal closed this May 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment