async's forEach insn't actually asynchronous #232

Closed
brianmaissy opened this Issue Feb 7, 2013 · 8 comments

4 participants

@brianmaissy

Unless I'm simply delirious after staying up all night thinking about asynchronous function calls, I think I've found a huge bug in how async fundamentally works. Consider the following test, which I will not commit to any repository for moral reasons, but feel free to copy and paste into test-async.js and run (the constant 40 might need to be modified to fit the speed of your javascript interpreter):

exports['horrible'] = function(test){
    function fib(n){
        if(n<=1) return 1;
        return fib(n-1) + fib(n-2);
    }
    async.parallel([
        function(callback){
            fib(40);
            test.equal(1, 1); // A
            callback();
        },
        function(callback){
            fib(40);
            test.notEqual(1, 2); // B
            callback();
        }
    ], function(error, result){
        test.ok(true); // C
        test.done();
    });
    test.same(1, 1); // D
};

Pardon my extremely hackish logging techniques, and notice the order in which the assertions are logged by the testing engine: A, B, C, D.

What we notice first is that async.parallel doesn't return until all its tasks are finished. That doesn't seem right: I want them to run asynchronously in parallel, invoking the callback when they are all finished, and in the mean time, allow me to go about my business. The correct ordering, it seems to me, should be D, A, B, C

But that's not the worst of it. If we look deeper, if t is the time necessary to execute fib(40), async.parallel blocks for a full 2t straight! Instead it should block for t, release the event loop for a cycle, and then block for another t (which is how the built-in browser forEach operates). This is an extreme example to demonstrate the point, but imagine a real use case where each function blocks for 50ms and I'm executing a list of 50 of them. Async will cause my browser to noticeably hang, unnecessarily!

Luckily, I think this can be fixed, primarily with some carefully placed usages of nextTick, but it will require changes in multiple places, including possibly no longer using the built in Array.prototype.forEach. Assuming I'm not crazy and @caolan agrees that this is an issue, I'll start with writing failing tests that demonstrate the issue without hanging the browser, and then attempt to solve it.

@caolan
Owner

It's a common misconception that the async library will magically make your synchronous JavaScript asynchronous, alas, it only contains tools for composing function with are already asynchronous. In order to run your example in parallel you'd have to fire up multiple node.js processes or web-workers since JavaScript is single-threaded... that said, I think what you're suggesting is that it just releases control to the event loop after calling each synchronous function. I think using synchronous functions with async is an abuse of the library but people seem determined to do it, so perhaps we should placate them since the performance implications are small. The iterators called in series should already release control to the event loop, it's the ones operating in parallel which currently don't... to keep it predictable we could also wait for a nextTick before calling any async functions to make them consistently asynchronous, but who are these people running sync functions as async iterators??

@brianmaissy

I clarify that I mean calling functions asynchronously (a matter of control flow) and not executing functions asynchronously (of course node executes all your code synchronously). It's not primarily an issue of performance, but an issue of conforming to the convention that a function that invokes a callback (or an iterator, I would posit) invokes it asynchronously (on the next tick of the event loop), the result being releasing control to the event loop each time.

As I said, my example is exaggerated, but the principle still holds. You're right that people shouldn't use async (or node for that matter) for long-running synchronous tasks. But someone who uses async for tasks that are at the core asynchronous, might call functions that do some amount of synchronous set-up before calling the asynchronous function.

function getData(callback){
  parameter = // some minor but necessary synchronous javascript
  trulyAsynchronousIOCall(parameter, callback);
}

// a long of array of these 'mostly asynchronous' tasks
var tasks = [getData, getData, ...];

async.parallel(tasks, callback);

I don't think you would call this an abuse of async, because each call to getData in tasks does its primary work asynchronously. However, it has a synchronous component, in the sense of taking some time (depending on what is done to calculate parameter) between when it is called and when it returns. I'm pretty sure a lot of people use async this way, and unless you would rather them all use it like this:

function getData(callback){
  nextTick(function(){
    parameter = // some minor but necessary synchronous javascript
    trulyAsynchronousIOCall(parameter, callback);
  }
}

I would expect the event loop is released each time, so async.parallel returns before its tasks are run, and other synchronous code can run while they are being invoked.

@tleach

It's not primarily an issue of performance, but an issue of conforming to the convention that a function that invokes a callback (or an iterator, I would posit) invokes it asynchronously (on the next tick of the event loop)

Not sure I agree that this is the convention. The whole point of the node callback idiom is to recognize that certain tasks (e.g. network/file IO) would typically involve blocking a thread while the CPU does very little and so are therefore good candidates for being executed asynchronously. Because those types of tasks don't involve synchronous code execution it is a natural consequence that their callbacks are invoked on a later loop iteration, but it is not an enforced convention.

FTIW I don't think async should be making any assumptions about when callbacks are run. You might want it on a later loop iteration, others may not. If you want your synchronous code to run in a subsequent event loop then put it in a nextTick block.

As @caolan says, async is designed as a control flow library for asynchronous code. It's not intended to arbitrarily postpone synchronous code to an later loop iteration.

@brianmaissy

The essential issue here is whether or not one expects callbacks to be called asynchronously. Apparently @caolan, you don't, based on the reason you closed issue 215:

I'd expect the callback to be called synchronously.

You're right that in practice I don't think I've seen people call their callbacks asynchronously, but this article encourages doing so as a convention, and Tim Caswell appears to agree (or at least did a year ago). Although in practice I don't think I've seen people do it, it makes a lot of sense to me.

Another way of framing the issue: why the inconsistency that async.series returns before executing its tasks, but async.parallel executes its tasks before returning? Of course if they are all purely asynchronous it can't matter, and that might make all the difference.

@brianmaissy

Just corresponded with Tim by email, he still thinks returning before invoking callbacks is a good idea. At the very least, an API should be consistent in it behavior. It bothers me that async.series returns quickly, but async.parallel doesn't.

The cost of doing so might be a slight performance hit, but I can think of all the following benefits:
1. consistency with the Node APIs
2. callbacks appear more like events in the sense of event-driven programming - the callbacks are logically new execution stacks
3. stack traces won't be cluttered with irrelevant frames or blow up from recursion
4. exception handling stops working unexpectedly (as in 215)
5. one can use an object returned by a function in a callback passed to that function

In any case it's a matter of convention and API design, so If @caolan chooses not to adopt this convention of returning before calling iterators and callbacks, we should clearly document that, including instructing the users of async to explicitly wrap synchronous functions in nextTick if they want async to return immediately.

@brianmaissy

@caolan let me know your thoughts if you want to pursue this issue further, otherwise feel free to close this.

@yarcowang

Yes, i agree with @caolan . The same time, i think @brianmaissy is right in theory.
But it is beyond async itself, and even a single nodejs process.
what about put async.parallel into another process if D is not rely on the result and the parallel process will cost a lot of time? ( http://nodejs.org/api/child_process.html)

@brianmaissy

That's possible, but besides the point here I think.

@caolan caolan closed this Mar 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment