Skip to content

Commit

Permalink
generalize test suite and add better documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Dave Pacheco committed Jul 30, 2012
1 parent 445a5b3 commit aefbc95
Show file tree
Hide file tree
Showing 10 changed files with 187 additions and 79 deletions.
7 changes: 6 additions & 1 deletion Makefile
Expand Up @@ -22,7 +22,7 @@ NPM = npm
#
# Files
#
JS_FILES := $(shell find lib tests -name '*.js')
JS_FILES := $(shell find lib tests survey -name '*.js')
JSL_CONF_NODE = tools/jsl.node.conf
JSL_FILES_NODE = $(JS_FILES)
JSSTYLE_FILES = $(JS_FILES)
Expand All @@ -34,6 +34,11 @@ JSSTYLE_FILES = $(JS_FILES)
all:
$(NPM) install

test:
tests/tst.strsplit.sh
tests/tst.strpatterns.js
@echo All tests passed.

DISTCLEAN_FILES += node_modules

include ./Makefile.targ
66 changes: 55 additions & 11 deletions README.md
Expand Up @@ -2,17 +2,62 @@

## strsplit(str, pattern[, limit])

Splits a string `str` into at most `limit` fields using the pattern `pattern` as
a delimeter. The pattern may be either a string or a regular expression. The
returned value is an array of fields.
Splits a string `str` into fields using `pattern` as the separator, which may be
either a string or a regular expression.

If `limit` is unspecified or zero, any number of fields may be returned, and the
behavior is exactly identical to `str.split(pattern)`.
If `limit` is a positive number, the pattern will be applied at most `limit - 1`
times and the returned array will have at most `limit` elements. The last
element will contain all of `str` beyond the last separator. (This is unlike
the JavaScript standard String.split method, which also provides a `limit`
argument to control the number of returned fields. String.split always applies
the pattern as many times as possible, and only returns the first `limit`
fields, so the rest of the input is lost. See Notes below for details.)

`strsplit(str, pattern, limit)` behaves just like `str.split(pattern, limit)`,
with one important exception: the string is split at most `limit - 1` times,
so the last returned element contains the contents of all subsequent fields.
String.split truncates all such fields.
If `limit` is unspecified, negative, or zero, then there is no limit on the
number of matches or returned fields. Additionally, if `limit` is zero,
trailing empty fields are discarded.

It's often desirable to skip leading empty fields as well, as awk(1) and bash(1)
do in processing fields. To do this, use String.trim before calling strsplit.


## Examples

Split a colon-separated list (e.g., a line from /etc/passwd):

> strsplit('nobody:*:-2:-2:Nobody User:/var/empty:/usr/bin/false', ':');
[ 'nobody', '*', '-2', '-2', 'Nobody User', '/var/empty', '/usr/bin/false' ]

Split a whitespace-separated list (e.g., output from "ps"):

> strsplit('86008 ttys000 0:00.05 -bash', /\s+/);
[ '86008', 'ttys000', '0:00.05', '-bash' ]

Similarly, split a line into words:

> strsplit('How about a game of chess?', /\s+/)
[ 'How', 'about', 'a', 'game', 'of', 'chess?' ]

Some tabular data formats allow the last field to contain the delimeter. The
reader is expected to know how many fields there are to avoid getting confused.
The number of fields can be specified with the `limit` argument:

> /* 4 Fields: Games, Wins, Losses, Team Name */
> strsplit('101 55 46 San Francisco Giants', ' ', 4);
[ '101', '55', '46', 'San Francisco Giants' ]

See [node-strtab](https://github.com/davepacheco/node-tab) for a higher-level
interface to read and write tabular data.


## Notes

As described above, `strsplit` is similar to `String.split`, but limits the
number of times the pattern is matched rather than simply the number of matched
fields returned. If you actually want only the first N matches, then specify no
limit and call `slice` on the result (or just use String.split). If `limit` is
negative or unspecified, the behavior is exactly identical to
`str.split(pattern)`.

By comparison, here's String.split:

Expand All @@ -24,5 +69,4 @@ and here's strsplit:
> strsplit('alpha bravo charlie delta', ' ', 3)
[ 'alpha', 'bravo', 'charlie delta' ]

This is the behavior implemented by `split` in Perl, Java, and other
environments.
This is the behavior implemented by `split` in Perl, Java, and Python.
14 changes: 12 additions & 2 deletions lib/strsplit.js
Expand Up @@ -8,7 +8,7 @@ function strsplit(str, pattern, limit)
{
var i, rv, last, match;

if (arguments.length < 3 || !limit)
if (arguments.length < 3 || limit == -1)
return (str.split(pattern));

if (limit == 1)
Expand All @@ -24,7 +24,7 @@ function strsplit(str, pattern, limit)

rv = [];
last = 0;
for (i = 0; i < limit - 1; i++) {
for (i = 0; limit === 0 || i < limit - 1; i++) {
match = pattern.exec(str);

if (!match)
Expand All @@ -36,5 +36,15 @@ function strsplit(str, pattern, limit)
}

rv.push(str.substr(last));

if (limit === 0) {
for (i = rv.length - 1; i > 0; i--) {
if (rv[i].length > 0)
break;
}

rv = rv.slice(0, i + 1);
}

return (rv);
}
7 changes: 5 additions & 2 deletions package.json
@@ -1,6 +1,6 @@
{
"name": "strsplit",
"version": "0.0.1",
"version": "1.0.0",
"description": "split a string by a regular expression",
"main": "lib/strsplit.js",
"repository": {
Expand All @@ -9,5 +9,8 @@
},
"keywords": [ "string", "split" ],
"author": "Dave Pacheco <dap@joyent.com>",
"license": "MIT"
"license": "MIT",
"scripts": {
"test": "make test"
}
}
7 changes: 5 additions & 2 deletions survey/Makefile
@@ -1,4 +1,4 @@
all: java.csv perl.csv python.csv
all: java.csv perl.csv python.csv js-strsplit.csv

java.csv: testcases.csv | StringSplitTest.class
java StringSplitTest < $< > $@
Expand All @@ -12,5 +12,8 @@ perl.csv: testcases.csv strsplit.pl
python.csv: testcases.csv strsplit.py
./strsplit.py < $< > $@

js-strsplit.csv: testcases.csv strsplit.js
./strsplit.js < $< > $@

clean:
rm -f java.csv perl.csv python.csv StringSplitTest.class
rm -f java.csv perl.csv python.csv js-strsplit.csv StringSplitTest.class
27 changes: 12 additions & 15 deletions survey/README.md
Expand Up @@ -9,6 +9,9 @@ Specifically, this is:
* Python: re.split. While the "split" method on strings may be more common, it
does not handle regular expressions, while the Java and Perl counterparts do.

For comparison, there's also a test case for this implementation of "strsplit".
in JavaScript.

The test cases here test both a simple string as a splitter (a space) and a
simple regular expression (`\s+`, indicating some non-zero number of whitespace
characters), as well as various values of the optional "limit" parameter.
Expand All @@ -25,18 +28,12 @@ identical. The Python implementation differs in a few ways:
* Java and Perl strip trailing empty fields when "limit" is 0. Python never
strips trailing empty fields.

The point of all of this is to figure out what a new JavaScript "strsplit"
should do. JavaScript has a "split" method, but it behaves substantially
different than all of these implementations when "limit" is specified.
This implementation of "strsplit" for JavaScript will mirror the Java and Perl
implementation, as the differences in Python do not seem substantial or better.

The remaining use case that would be nice to address is splitting fields in
textual input like awk(1) does. This doesn't work directly in any of these
implementations when the first field has leading whitespace (as in the case of
ps(1) output, for example, and any other tool that right-aligns some fields).
In that case, these tools return an empty first field, while awk and bash would
skip the leading whitespace. Python's string split does handle this if you
specify a separator of None (or don't specify one), but this seems pretty ad
hoc. It seems to make more sense to have consumers simply call trim() first, if
that's what they want.
JavaScript has a "split" method, but it behaves substantially different than all
of these implementations when "limit" is specified. This implementation of
"strsplit" for JavaScript mirrors the Java and Perl implementations, as the
differences in Python do not seem substantial or better.

The remaining use case that would be nice to address is splitting fields the way
awk(1) and bash(1) do, which is to strip leading whitespace. Python's *string*
split also does this, but only if you specify None as the pattern. strsplit
doesn't support this; just trim the string first if you want that behavior.
40 changes: 40 additions & 0 deletions survey/strsplit.js
@@ -0,0 +1,40 @@
#!/usr/bin/env node

var mod_fs = require('fs');

var strsplit = require('../lib/strsplit');
var buffer = '';

process.stdin.resume();
process.stdin.on('data', function (chunk) {
var i, line;

buffer += chunk.toString('utf8');

i = buffer.indexOf('\n');
while (i != -1) {
line = buffer.substr(0, i);
buffer = buffer.substr(i + 1);
i = buffer.indexOf('\n');
processLine(line);
}
});

function processLine(line)
{
var parts, rv;

if (line.length === 0 || line[0] == '#') {
console.log(line);
return;
}

parts = line.split(';');
if (parts.length != 3) {
console.log('line garbled: %s', line);
return;
}

rv = strsplit(parts[2], new RegExp(parts[1]), Math.floor(parts[0]));
console.log(rv.join(';'));
}
46 changes: 0 additions & 46 deletions tests/strsplit.js

This file was deleted.

33 changes: 33 additions & 0 deletions tests/tst.strpatterns.js
@@ -0,0 +1,33 @@
#!/usr/bin/env node

/*
* JavaScript-special test cases. These are required to test string patterns
* rather than regular expressions.
*/

var mod_assert = require('assert');
var strsplit = require('../lib/strsplit');

/* basic case with string pattern (uses String.split) */
mod_assert.deepEqual([ 'one', 'two', 'three' ],
strsplit('one two three', ' '));

/* limit ineffective, simple cases */
mod_assert.deepEqual([ 'one', 'two', 'three', 'four', 'five' ],
strsplit('one two three four five', ' ', 10));
mod_assert.deepEqual([ 'one', 'two three', 'four', 'five' ],
strsplit('one two three four five', ' ', 10));

/* limit ineffective, empty fields */
mod_assert.deepEqual([ 'one', 'two', '', 'three' ],
strsplit('one two three', ' ', 10));

/* regexp escaping for string patterns */
mod_assert.deepEqual([ 'one', 'two', 'three' ],
strsplit('one.two.three', '.', 10));
mod_assert.deepEqual([ '', '', '', '', 'two.three' ],
strsplit('one.two.three', /./, 5));

/* limit effective */
mod_assert.deepEqual([ 'one', 'two three' ],
strsplit('one two three', ' ', 2));
19 changes: 19 additions & 0 deletions tests/tst.strsplit.sh
@@ -0,0 +1,19 @@
#!/bin/bash

#
# The main test suite for strsplit is to run the body of test cases in
# ../survey/ and compare the output to that of Java and Perl, whose
# implementations we intend to mirror exactly. errexit will cause this script
# to exit with failure if any of these operations fail.
#
set -o errexit

surveydir=$(dirname $0)/../survey

set -o xtrace
make -C $surveydir perl.csv java.csv js-strsplit.csv
diff $surveydir/js-strsplit.csv $surveydir/perl.csv > /dev/null
diff $surveydir/js-strsplit.csv $surveydir/java.csv > /dev/null
set +o xtrace

echo "Test PASSED"

0 comments on commit aefbc95

Please sign in to comment.