Skip to content

Commit

Permalink
Update control-flow-part-iii.
Browse files Browse the repository at this point in the history
  • Loading branch information
creationix committed Jul 27, 2010
1 parent 246cbb8 commit 6683fe2
Show file tree
Hide file tree
Showing 6 changed files with 197 additions and 271 deletions.
216 changes: 40 additions & 176 deletions articles/control-flow-part-iii.markdown
@@ -1,219 +1,83 @@
Title: Control Flow in Node Part III
Author: Tim Caswell
Date: Mon Feb 15 2010 09:21:11 GMT-0600 (CST)
Node: v0.1.102

While working on my quest to make async programming easier, or at least bearable, I discovered that often in programming you work with a set of data and want to do things on all the items in that set at once.

This article will explain a way to do async `filter` and `map` functions where the callback to `map` or `filter` is an async operation itself. I will compare the simple task of reading all the files in a directory into memory in both sync and async style programming.

## The Sync Way
**UPDATE** This article has been heavily updated to use callbacks and new node APIs. See the past revisions in the panel to the right for the original Promise/Continuable based article.

In a synchronous programming language, this task is very straightforward and can be done in node as long as you understand the consequences. Currently about the only way to program synchronously in node is to use the `wait()` method of [promises][]. It's not a real sync command, but kinda emulates co-routines/fibers. Using `wait` may or may not be deprecated in the near future.
## The Blocking Way

For this example we will need three methods from the `posix` (recently renamed to `fs`) package. We need `readdir` to get a listing of files in a directory, `stat` to test the results (we only want files, not directories), and `cat` to read the contents to memory.
In a synchronous programming language where I/O is blocking, this task is very straightforward and can be done in node as long as you understand the consequences. Node exposes `Sync` versions of many of it's I/O functions for the special cases where you don't care about performance and would rather have the much easier coding style (like server startup).

First we'll make a little module that wraps around these three commands and exposes a sync api:
For this example we will need three methods from the `fs` package. We need `readdir` to get a listing of files in a directory, `stat` to test the results (we only want files, not directories), and `readFile` to read the contents to memory.

// Sync api wrapper around some Posix commands
// Sells it soul to `wait`.
var posix_sync = {
readdir: function (path) {
return Posix.readdir(path).wait();
},
stat: function (filename) {
return Posix.stat(filename).wait();
},
cat: function (filename) {
return Posix.cat(filename).wait();
}
}
Solving the problem is very straightforward using sync style coding:

Then solving the problem is very straightforward using sync style coding:
<control-flow-part-iii/sync-loaddir.js>

// Here is the sync version:
function loaddir_sync(path) {
return posix_sync.readdir(path).filter(function (filename) {
return posix_sync.stat(filename).isFile();
}).map(function (filename) {
return [filename, posix_sync.cat(filename)];
});
}
Since the commands are sync we are able to use the built in `filter` and `map` from `Array.prototype` on the array returned by `fs.readdirSync`.

Since the commands are sync we are able to use the built in `filter` and `map` from `Array.prototype` on the array returned by `posix_sync.readdir`.
This is extremely easy to code, but has dangerous side effects. The program waits while waiting for the blocking `fs` operations to finish. Since CPUs are very fast compared to other hardware (like hard-drives), then the cpu is wasted when it could be busy working on requests for another client if this was part of a hot running event loop.

This is extremely easy to code, but has dangerous side effects. If these commands were truly sync, then the program would be stopped blocking after every call, waiting for some external IO to finish. Since CPUs are very fast compared to other hardware (like hard-drives), then the cpu sits idle waiting for something else to get back to it.
Obviously this is not optimal. Nothing is done in parallel. Many CPU cycles are wasted.

As a disclaimer, `wait()` in node isn't really synchronous, but it has other deep nasty implications because of the way it works. [Search the mailing list][] for a history of people who had problems with `wait`.
## The Non-Blocking Way

Either way, you can't stat one file until the stat for the previous has returned, and you can't read any of the files until all the stats have finished. Obviously this isn't optimal.
They say that in computer science there is always a give and take when comparing different algorithms. The pro to synchronous coding style is that it's very easy to read and write. The con is that it's very inefficient. That's why most programming languages need threads to achieve any level of concurrency, but node is able to do quite a bit on a single threaded platform.

## The Async Way
(Yes I'm aware of coroutines, but in JavaScript where everything is so mutable, they don't work well and are about the same as multi-threading complexity wise. See [the archives] for information on Node's experiment with this idea)

They say that in computer science there is always a give and take when comparing different algorithms. The pro to synchronous coding style is that it's very easy to read and write. The con is that it's very inefficient. That's why most programming languages need threads to achieve any level of concurrency, but node is able to do quite a bit on a single threaded platform.
To make the comparison simple, I'll do the same thing, but using non-blocking apis and callbacks. An initial implementation of our `loaddir` function would be this:

To make the comparison simple, I'll wrap the `posix` library again, but this time in the continuable style. These examples could be written using vanilla promises, but the code would be [even longer][].

// Async api wrapper around some Posix commands
// Uses continuable style for cleaner syntax
var posix = {
readdir: function (path) { return function (next) {
Posix.readdir(path).addCallback(next);
}},
stat: function (filename) { return function (next) {
Posix.stat(filename).addCallback(next);
}},
cat: function (filename) { return function (next) {
Posix.cat(filename).addCallback(next);
}}
}

And an initial implementation of our `loaddir` function would be this:

// Here is the async version without helpers
function loaddir1(path) { return function (next) {
posix.readdir(path)(function (filenames) {
var realfiles = [];
var count = filenames.length;
filenames.forEach(function (filename) {
posix.stat(filename)(function (stat) {
if (stat.isFile()) {
realfiles.push(filename);
}
count--;
if (count <=0) {
var results = [];
realfiles.forEach(function (filename) {
posix.cat(filename)(function (data) {
results.push([filename, data]);
if (results.length === realfiles.length) {
next(results);
}
});
});
}
});
});
});
}}
<control-flow-part-iii/async-loaddir.js>

Yikes! That is almost four times as long and indented several times deeper. I know it's a trade-off, but at this point I'm thinking I'll return to [Ruby][] with [clusters of servers][] on the backend to handle concurrency.

### Map and Filter Helpers for Async Code

Since map and filter are common tasks in programming and that's what we really want here, let's write some helpers to make this beast of code a little smaller:

// Both of these take an array and an async callback. When all callbacks
// have returned, it sends the output to `next`

function map(array, callback) { return function (next) {
var counter = array.length;
var new_array = [];
array.forEach(function (item, index) {
callback(item, function (result) {
new_array[index] = result;
counter--;
if (counter <= 0) {
new_array.length = array.length
next(new_array);
}
});
});
}}

function filter(array, callback) { return function (next) {
var counter = array.length;
var valid = {};
array.forEach(function (item, index) {
callback(item, function (result) {
valid[index] = result;
counter--;
if (counter <= 0) {
var result = [];
array.forEach(function (item, index) {
if (valid[index]) {
result.push(item);
}
});
next(result);
}
});
});
}}
Since map and filter are common tasks in programming and that's what we really want here, let's write some helpers to make this beast of code a little smaller.

Here is a `map` helper. It takes an array, a filter function, and a callback. The filter function itself it an async function that takes a callback.

<control-flow-part-iii/helpers.js#map>

And here is a filter helper. It works the same, but removes items that don't pass the filter.

<control-flow-part-iii/helpers.js#filter>

Now with our helpers, let's try the async version again to see how much shorter we can make it:

// Here is the async version with filter and map helpers:
function loaddir2(path) { return function (next) {
posix.readdir(path)(function (filenames) {
filter(filenames, function (filename, callback) {
posix.stat(filename)(function (stat) {
callback(stat.isFile());
});
})(function (filenames) {
map(filenames, function (filename, callback) {
posix.cat(filename)(function (data) {
callback([filename, data]);
});
})(next);
});
});
}}

That code is much shorter and easier to read. Also, now that the code is executing in parallel, we can issue a stat call for all the files in a directory at once and then collect the results as they come in. But with this version, not a single `cat` can execute until all the `stat` calls finish. In an ideal world, the program would start reading the file as soon as it knows it's a file and not a directory.
<control-flow-part-iii/async-loaddir2.js>

That code is much shorter and easier to read. Since `fs.readFile` and our `callback` are themselves async functions following the node convention, we can use them directly as the second and third arguments to the `helpers.map` call. There is benefit in this common pattern.

Also, now that the code is executing in parallel, we can issue a stat call for all the files in a directory at once and then collect the results as they come in. But with this version, not a single `readFile` can execute until all the `stat` calls finish. In an ideal world, the program would start reading the file as soon as it knows it's a file and not a directory.

### Combined Filter and Map Helper

Often you will want to filter and then map on the same data set. Let's make a combined `filter_map` helper and see how it helps:

function filter_map(array, callback) { return function (next) {
var counter = array.length;
var new_array = [];
array.forEach(function (item, index) {
callback(item, function (result) {
new_array[index] = result;
counter--;
if (counter <= 0) {
new_array.length = array.length;
next(new_array.filter(function (item) {
return typeof item !== 'undefined';
}));
}
});
});
}}

I found it neat that this combined helper is about as small as the smaller of the two separate helpers. That is very cool!
Often you will want to filter and then map on the same data set. Let's make a combined `filterMap` helper and see how it helps:

<control-flow-part-iii/helpers.js#filtermap>

Now with this combined helper, let's write a truly parallel `loaddir` function:

// Here is the async version with a combined filter and map helper:
function loaddir3(path) { return function (next) {
posix.readdir(path)(function (filenames) {
filter_map(filenames, function (filename, callback) {
posix.stat(filename)(function (stat) {
if (stat.isFile()) {
posix.cat(filename)(function (data) {
callback([filename, data]);
});
} else {
callback();
}
});
})(next);
});
}}

Here we will issue all the `stat` commands at once, and as they come back, check to see if it's a file and if so, then fire off the `cat` command right away. If not we'll output the result of `undefined` signifying to `filter_map` that we're not interested in that entry. When the `cat` command comes back we'll send the file contents to the helper. When all the items have either sent `undefined` or some text, then the helper knows it's done and gives us the result.
<control-flow-part-iii/async-loaddir3.js>

Here we will issue all the `stat` commands at once, and as they come back, check to see if it's a file and if so, then fire off the `readFile` command right away. If not we'll output the result of `undefined` signifying to `filter_map` that we're not interested in that entry. When the `readFile` command comes back we'll send the file contents to the helper. When all the items have either sent `undefined` or some text, then the helper knows it's done and gives us the result.

## Conclusion and Source Code

While it is a tradeoff in code complexity vs performance, with a little thinking and some good libraries, we can make async programming manageable enough to be understandable while taking full advantage of the parallel nature of non-blocking IO in node.

Full runnable source code can be found on [github][]. These examples were tested on node version `v0.1.28-68-gdc01587`.
All source code used in these examples is linked to on the right side of the page or in the upper-right corner of the code snippets.

**UDATE** I've since made a general purpose callback library called [Step]. While it doesn't include map and filter helpers, it does have the more useful parallel and group helpers.

[Ruby]: http://www.ruby-lang.org/
[clusters of servers]: http://unicorn.bogomips.org/
[even longer]: http://github.com/creationix/howtonode.org/blob/master/articles/control-flow-part-iii/program.js#L84
[github]: http://github.com/creationix/howtonode.org/tree/master/articles/control-flow-part-iii/
[promises]: http://nodejs.org/api.html#_tt_events_promise_tt
[Search the mailing list]: http://groups.google.com/group/nodejs/search?group=nodejs&q=wait
[the archives]: http://groups.google.com/group/nodejs/search?group=nodejs&q=wait
[Step]: /step-of-conductor
37 changes: 37 additions & 0 deletions articles/control-flow-part-iii/async-loaddir.js
@@ -0,0 +1,37 @@
var fs = require('fs');

// Here is the async version without helpers
function loaddir(path, callback) {
fs.readdir(path, function (err, filenames) {
if (err) { callback(err); return; }
var realfiles = [];
var count = filenames.length;
filenames.forEach(function (filename) {
fs.stat(filename, function (err, stat) {
if (err) { callback(err); return; }
if (stat.isFile()) {
realfiles.push(filename);
}
count--;
if (count === 0) {
var results = [];
realfiles.forEach(function (filename) {
fs.readFile(filename, function (err, data) {
if (err) { callback(err); return; }
results.push(data);
if (results.length === realfiles.length) {
callback(null, results);
};
});
});
}
});
});
});
}

// And it's used like this
loaddir(__dirname, function (err, result) {
if (err) throw err;
console.dir(result);
});
24 changes: 24 additions & 0 deletions articles/control-flow-part-iii/async-loaddir2.js
@@ -0,0 +1,24 @@
var fs = require('fs'),
helpers = require('./helpers');

// Here is the async version with filter and map helpers:
function loaddir(path, callback) {
fs.readdir(path, function (err, filenames) {
if (err) { callback(err); return; }
helpers.filter(filenames, function (filename, done) {
fs.stat(filename, function (err, stat) {
if (err) { done(err); return; }
done(null, stat.isFile());
});
}, function (err, filenames) {
if (err) { callback(err); return; }
helpers.map(filenames, fs.readFile, callback);
});
});
}

// And it's used like this
loaddir(__dirname, function (err, result) {
if (err) throw err;
console.dir(result);
});
25 changes: 25 additions & 0 deletions articles/control-flow-part-iii/async-loaddir3.js
@@ -0,0 +1,25 @@
var fs = require('fs'),
helpers = require('./helpers');

// Here is the async version with a combined filter and map helper:
function loaddir(path, callback) {
fs.readdir(path, function (err, filenames) {
if (err) { callback(err); return; }
helpers.filterMap(filenames, function (filename, done) {
fs.stat(filename, function (err, stat) {
if (err) { done(err); return; }
if (stat.isFile()) {
fs.readFile(filename, done);
return;
}
done();
});
}, callback);
});
}

// And it's used like this
loaddir(__dirname, function (err, result) {
if (err) throw err;
console.dir(result);
});

0 comments on commit 6683fe2

Please sign in to comment.