-
-
Notifications
You must be signed in to change notification settings - Fork 208
fix: isWellFormed #792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: isWellFormed #792
Conversation
This is imho not acceptable for fairness reasons. as the test is basically from @cesco69. You should try to give him atleast the credentials for the test. IMHO checkout his PR by doing |
This comment was marked as off-topic.
This comment was marked as off-topic.
Ah, ok. :) Yeah, i thought so too. I guess we need to revert it completely |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was marked as outdated.
This comment was marked as outdated.
I have made a benchmark for test only TLTR: isWellFormed with a short regex (only ASCII escape) seem faster. I have compared:
On my PC (cpu: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz)
BENCHMARKI have published the benchmark online
index.mjs import { run, bench, summary, do_not_optimize } from "mitata";
/**
* NEW IMPLEMENTATION OF asString
*
* Uses String.prototype.isWellFormed() to skip regex check for surrogate pairs
*/
const ASCII_ESCAPE = /[\u0000-\u001f\u0022\u005c]/;
const asStringNew = (str) => {
const len = str.length;
if (len === 0) {
return '""';
} else if (len < 42) {
// magically escape strings for json
// relying on their charCodeAt
// everything below 32 needs JSON.stringify()
// every string that contain surrogate needs JSON.stringify()
// 34 and 92 happens all the time, so we
// have a fast case for them
let result = "";
let last = -1;
let point = 255;
for (let i = 0; i < len; i++) {
point = str.charCodeAt(i);
if (
point === 0x22 || // '"'
point === 0x5c // '\'
) {
last === -1 && (last = 0);
result += str.slice(last, i) + "\\";
last = i;
} else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
// The current character is non-printable characters or a surrogate.
return JSON.stringify(str);
}
}
return (
(last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
);
} else if (len < 5000 && str.isWellFormed() && ASCII_ESCAPE.test(str) === false) {
// Only use the regular expression for shorter input. The overhead is otherwise too much.
return '"' + str + '"';
} else {
return JSON.stringify(str);
}
};
/**
* OLD IMPLEMENTATION OF asString
*
* Uses a regex to check for surrogate pairs
*/
const STR_ESCAPE = /[\u0000-\u001f\u0022\u005c\ud800-\udfff]/;
const asStringOld = (str) => {
const len = str.length;
if (len === 0) {
return '""';
} else if (len < 42) {
// magically escape strings for json
// relying on their charCodeAt
// everything below 32 needs JSON.stringify()
// every string that contain surrogate needs JSON.stringify()
// 34 and 92 happens all the time, so we
// have a fast case for them
let result = "";
let last = -1;
let point = 255;
for (let i = 0; i < len; i++) {
point = str.charCodeAt(i);
if (
point === 0x22 || // '"'
point === 0x5c // '\'
) {
last === -1 && (last = 0);
result += str.slice(last, i) + "\\";
last = i;
} else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
// The current character is non-printable characters or a surrogate.
return JSON.stringify(str);
}
}
return (
(last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
);
} else if (len < 5000 && STR_ESCAPE.test(str) === false) {
// Only use the regular expression for shorter input. The overhead is otherwise too much.
return '"' + str + '"';
} else {
return JSON.stringify(str);
}
};
/**
* BENCHMARKS CASES
*/
const shortString = 'a'.repeat(10)
const longString = 'a'.repeat(1000)
const shortDirtyString = shortString + '\n'
const longDirtyString = longString + '\n'
const surrogate = '\uD800';
const escapeChar = '"';
const shortSurrogateString = 'a'.repeat(10) + surrogate;
const shortQuoteString = 'a'.repeat(10) + escapeChar;
const longSurrogateString = 'a'.repeat(1000) + surrogate;
const longQuoteString = 'a'.repeat(1000) + escapeChar;
const args = {
str: [
'',
shortString, longString,
shortDirtyString, longDirtyString,
shortSurrogateString, shortQuoteString,
longSurrogateString, longQuoteString
]
};
/**
* BENCHMARKS IMPLEMENTATIONS
*/
summary(() => {
bench("asStringNew", function* xNew(state) {
const str = state.get('str');
yield () => do_not_optimize(asStringNew(str));
}).gc("inner").args(args);
bench("asStringOld", function* xOld(state) {
const str = state.get('str');
yield () => do_not_optimize(asStringOld(str));
}).gc("inner").args(args);
});
await run({
format: "mitata",
colors: true,
throw: true,
}); run with node --expose-gc --allow-natives-syntax index.mjs --iterations=50000 RAW RESULTnode --expose-gc --allow-natives-syntax index.mjs --iterations=50000
clk: ~2.96 GHz
cpu: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
runtime: node 20.14.0 (x64-win32)
benchmark avg (min … max) p75 / p99 (min … top 1%)
------------------------------------------- -------------------------------
asStringNew 6.67 ns/iter 7.69 ns ▃ █▂ ▆▅
(2.29 ns … 38.26 ns) 20.02 ns ███████▄▃▂▃▂▃▄▃▁▂▁▁▁▁
gc( 1.75 ms … 6.64 ms) 0.50 b ( 0.17 b… 42.74 b)
asStringNew 120.74 ns/iter 143.55 ns █▃▂ ▃▂▆ ▃
(51.32 ns … 231.05 ns) 202.03 ns ▂▂▂▅███▄▆██████▆▂▂▁▁▂
gc( 1.91 ms … 6.16 ms) 60.07 b ( 6.28 b…144.35 b)
asStringNew 1.04 µs/iter 1.04 µs ▃██▅▂
(916.89 ns … 1.57 µs) 1.48 µs █████▂▄▄▁▂▃▄▃▂▂▂▂▂▂▁▂
gc( 1.79 ms … 5.84 ms) 78.31 b ( 64.19 b…340.62 b)
asStringNew 269.22 ns/iter 321.44 ns ██▅
(125.76 ns … 875.37 ns) 842.99 ns ███▆▆▆▆▃▄▃▁▁▂▁▂▁▂▂▂▁▂
gc( 1.72 ms … 6.89 ms) 62.08 b ( 5.14 b…238.20 b)
asStringNew 6.78 µs/iter 8.08 µs ▅▂ █▂ ▂▂
(4.26 µs … 10.95 µs) 9.87 µs ▄▄██▄▄▄▄██▇▇▄▁██▇▁▁▁▄
gc( 2.18 ms … 7.96 ms) 2.23 kb ( 2.19 kb… 2.23 kb)
asStringNew 702.15 ns/iter 826.42 ns ▇▂▃▃ █
(378.81 ns … 1.38 µs) 1.37 µs ███▅████▆█▇▃▆▂▂▁▂▁▃▃▂
gc( 1.93 ms … 9.78 ms) 205.19 b (142.14 b…473.12 b)
asStringNew 201.91 ns/iter 248.00 ns ▆█▇▄ ▃▅▇ ▃▆▃
(100.61 ns … 357.50 ns) 340.36 ns ▂▄█████████████▆▂▁▃▅▃
gc( 1.78 ms … 6.80 ms) 168.70 b ( 43.10 b…361.37 b)
asStringNew 5.95 µs/iter 6.21 µs █▂
(3.06 µs … 8.44 µs) 7.83 µs ▃▁▁▁▃▃▁▃▆▄▃▇██▁▆▄▁▃▃▄
gc( 2.09 ms … 7.20 ms) 200.28 b (199.85 b…212.28 b)
asStringNew 7.87 µs/iter 8.52 µs █
(5.33 µs … 10.24 µs) 9.85 µs ▃▁▁▁▃▅▅▃▃▇█▃▃▅▃▃▁▅▅▁▅
gc( 2.33 ms … 7.66 ms) 2.23 kb ( 2.23 kb… 2.24 kb)
asStringOld 11.60 µs/iter 12.30 µs ▅█ ▆█
(8.10 µs … 33.60 µs) 16.00 µs ▃▄▄▄▃██████▇▇▃▅▃▄▂▂▂▄
gc( 2.78 ms … 11.86 ms) 519.60 b (504.00 b… 2.64 kb)
asStringOld 177.89 ns/iter 208.23 ns ▂▅█▃▆▄
(86.43 ns … 252.03 ns) 244.36 ns ▂▇▃▅█▇▆▅▇▄▆████████▆▃
gc( 1.86 ms … 8.59 ms) 54.05 b ( 22.36 b…191.69 b)
asStringOld 1.88 µs/iter 2.22 µs ▃ ▆▆█▅ ▅▆
(981.91 ns … 3.37 µs) 2.94 µs ▆▄▄██▄██████████▃▁▃▁▃
gc( 1.83 ms … 9.18 ms) 65.97 b ( 32.11 b…288.79 b)
asStringOld 275.36 ns/iter 361.18 ns ▄█
(168.02 ns … 732.08 ns) 514.65 ns ██▆▅▄▃▅▄▃▃▃▅▃▆▅▄▃▁▁▁▁
gc( 1.76 ms … 6.52 ms) 50.09 b ( 48.11 b…163.04 b)
asStringOld 5.48 µs/iter 6.73 µs █ ▃
(3.36 µs … 7.92 µs) 7.77 µs ██▆▄▆█▁█▁█▁▆▆▄▄█▄▄█▆█
gc( 2.24 ms … 7.49 ms) 2.23 kb ( 2.20 kb… 2.25 kb)
asStringOld 732.46 ns/iter 868.70 ns ▇ ▃ █
(368.90 ns … 1.61 µs) 1.56 µs ▇█████▄▇██▅▂▃▂▃▂▃▁▃▂▂
gc( 1.81 ms … 5.73 ms) 199.43 b ( 92.16 b…360.21 b)
asStringOld 191.50 ns/iter 250.49 ns ██
(119.07 ns … 360.08 ns) 336.96 ns ██▇▆▅▄▅▂▆▃▅▃▄▄▄▆▅▆▂▁▁
gc( 1.76 ms … 10.20 ms) 165.59 b ( 40.81 b…452.06 b)
asStringOld 4.80 µs/iter 5.75 µs █
(3.12 µs … 7.11 µs) 7.04 µs █▃▄▃▆▄▆▄▇▁▆▄▃▆▄▁▃▆▃▆▄
gc( 1.97 ms … 4.95 ms) 199.61 b (173.26 b…222.61 b)
asStringOld 3.43 µs/iter 3.31 µs █
(2.92 µs … 6.92 µs) 6.88 µs ██▅▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂
gc( 1.76 ms … 7.42 ms) 2.23 kb ( 2.21 kb… 2.27 kb)
summary
asStringNew
+1.47…+26.67x faster than asStringOld Side note for @Uzlopak in my previous comment I say it is slower, but it is slow only on browser side The built in benchPR
MASTER from #707
PS sorry for the bad english |
@nigrosimone can you please include a revert of the revert of @climba03003 original commits? I've added tests for all listed regressions |
Signed-off-by: Nigro Simone <nigro.simone@gmail.com>
Merge latest
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSLGTM
To be honest, I cant follow the benchmarks. mitata is not the best benchmark out there. Actually tinybench is afaik the best benchmark out there, which is the reason it is used for benchmarks in nodejs.
If you post that master has 15.000 ops and this PR has 12.000 ops then this PR is slower. But as I said I cant read the mitata benchmarks. Or lets say, I dont want to dig deep into them.
Disclaimer: I am also maintainer of tinybench.
Ok, I can try to make a benchmark with Tiny bench |
@Uzlopak for tinybench asStringNew seem slower Resultnode .\index.mjs
asString
┌─────────┬───────────────┬──────────────────┬───────────────────┬────────────────────────┬────────────────────────┬─────────┐
│ (index) │ Task name │ Latency avg (ns) │ Latency med (ns) │ Throughput avg (ops/s) │ Throughput med (ops/s) │ Samples │
├─────────┼───────────────┼──────────────────┼───────────────────┼────────────────────────┼────────────────────────┼─────────┤
│ 0 │ 'asStringNew' │ '7098.5 ± 0.12%' │ '7100.0 ± 100.00' │ '142612 ± 0.04%' │ '140845 ± 2012' │ 140875 │
│ 1 │ 'asStringOld' │ '6991.4 ± 0.69%' │ '6900.0 ± 400.00' │ '145945 ± 0.05%' │ '144928 ± 7941' │ 143033 Benchmarkimport { Bench } from 'tinybench'
/**
* NEW IMPLEMENTATION OF asString
*
* Uses String.prototype.isWellFormed() to skip regex check for surrogate pairs
*/
const ASCII_ESCAPE = /[\u0000-\u001f\u0022\u005c]/;
const asStringNew = (str) => {
const len = str.length;
if (len === 0) {
return '""';
} else if (len < 42) {
// magically escape strings for json
// relying on their charCodeAt
// everything below 32 needs JSON.stringify()
// every string that contain surrogate needs JSON.stringify()
// 34 and 92 happens all the time, so we
// have a fast case for them
let result = "";
let last = -1;
let point = 255;
for (let i = 0; i < len; i++) {
point = str.charCodeAt(i);
if (
point === 0x22 || // '"'
point === 0x5c // '\'
) {
last === -1 && (last = 0);
result += str.slice(last, i) + "\\";
last = i;
} else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
// The current character is non-printable characters or a surrogate.
return JSON.stringify(str);
}
}
return (
(last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
);
} else if (len < 5000 && str.isWellFormed() && ASCII_ESCAPE.test(str) === false) {
// Only use the regular expression for shorter input. The overhead is otherwise too much.
return '"' + str + '"';
} else {
return JSON.stringify(str);
}
};
/**
* OLD IMPLEMENTATION OF asString
*
* Uses a regex to check for surrogate pairs
*/
const STR_ESCAPE = /[\u0000-\u001f\u0022\u005c\ud800-\udfff]/;
const asStringOld = (str) => {
const len = str.length;
if (len === 0) {
return '""';
} else if (len < 42) {
// magically escape strings for json
// relying on their charCodeAt
// everything below 32 needs JSON.stringify()
// every string that contain surrogate needs JSON.stringify()
// 34 and 92 happens all the time, so we
// have a fast case for them
let result = "";
let last = -1;
let point = 255;
for (let i = 0; i < len; i++) {
point = str.charCodeAt(i);
if (
point === 0x22 || // '"'
point === 0x5c // '\'
) {
last === -1 && (last = 0);
result += str.slice(last, i) + "\\";
last = i;
} else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
// The current character is non-printable characters or a surrogate.
return JSON.stringify(str);
}
}
return (
(last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
);
} else if (len < 5000 && STR_ESCAPE.test(str) === false) {
// Only use the regular expression for shorter input. The overhead is otherwise too much.
return '"' + str + '"';
} else {
return JSON.stringify(str);
}
};
/**
* BENCHMARKS CASES
*/
const shortString = 'a'.repeat(10)
const longString = 'a'.repeat(1000)
const shortDirtyString = shortString + '\n'
const longDirtyString = longString + '\n'
const surrogate = '\uD800';
const escapeChar = '"';
const shortSurrogateString = 'a'.repeat(10) + surrogate;
const shortQuoteString = 'a'.repeat(10) + escapeChar;
const longSurrogateString = 'a'.repeat(1000) + surrogate;
const longQuoteString = 'a'.repeat(1000) + escapeChar;
const strs = [
'',
shortString, longString,
shortDirtyString, longDirtyString,
shortSurrogateString, shortQuoteString,
longSurrogateString, longQuoteString
];
/**
* BENCHMARKS IMPLEMENTATIONS
*/
const bench = new Bench({ name: 'asString', time: 1000 })
bench
.add('asStringNew', () => {
for (const str of strs) {
asStringNew(str)
}
})
.add('asStringOld', () => {
for (const str of strs) {
asStringOld(str)
}
})
await bench.run()
console.log(bench.name)
console.table(bench.table()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for fixing the mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There are conflicts now. |
Guys, this implementation seem slower... |
Oh I thought it was fixing something else actually, then yeah maybe let's keep it as is? |
I am closing this, as this part of the code made already too much confusion and friction. We should first make some good benchmarks for strings, with various cases. E.g. I think one of the tests is repeated NUL-character (\u0000), which is basically pointless, because it should have the same performance like a string with length of 1 with one NUL character. So we should make some smart test cases and then balance the performance. Also using tinybench, as it is more correct than any other benchmark engine right now on the market. E.g. one of the tables in benchmark.js is mathematically wrong. mitata is programmatically not correct. |
Side note, It will probably be very difficult for the committee to approve such a niche idea. A dedicated JSON.isWellFormed method would provide a fast, standardized way to check this case. |
Checklist
npm run test && npm run benchmark --if-present
and the Code of conduct
see #792 (comment)
fix #794
fix #793