Hi everybody, my name is Felix Geisendörfer and today I'd like to talk about "Faster than C? Parsing Node.js Streams!".
I have this node.js module called node-mysql. I started it because in mid 2010 there were no MySQL modules for node.js.
Unfortunately or fortunately, depending on how you look at it, I was not a very sane person back then:
Because, if I was a sane person, I probably would have:
- Picked a platform that already had a MySQL client
- Waited for somebody else to do the work
- Created a binding for libmysql
I mean all of these are somewhat reasonable choices, given that my goal at that point was to build a company, not a database client.
So if you would rather listen to a sane person presenting, I completely understand. Now is still a good time to switch tracks.
So for a long time I thought so as well. Shortly after I released my library, other libraries based on libmysql were released, and they vastly outperformed my library in many benchmarks.
Initially I thought I would have to accept this. But I always thought that my library could perform at least a little bit better. So earlier this year I had some time to work on this, and was able to release a new version of my library that was much faster. So much faster in fact, that it was beating all other C based libraries at that time.
Later on I will explain in more detail what I mean by this.
But, before I get started, lets talk about benchmarking. Not because I want to, but because it is kind of impossible to talk about performance without talking about benchmarks.
The first question you may have is: What benchmarking library should I use?
My answer to that is: Probably none, most of them suck.
Now don't get me wrong, there are some nice and convenient libraries out there. However, almost all of them do something very terrible: They mix data collection and analysis into one step.
You know what this reminds me off? Performing SQL queries from within your HTML templates. Sure, it's quick & easy, but it's certainly not the best approach in many cases.
I mean some of the benchmarking libraries out there are really clever, they will do a few warmup rounds on your code before using the results, they will calculate statisticals properties such an mean, median, stand deviation and other things. Some may even draw pretty graphs for you. However, most of them don't produce raw data, and that's a problem.
I'll show you why. Here is a graph comparing my current node-mysql parser with another experimental version I have been hacking on:
Great! It looks like my new parser is 2x as fast as the current one. And the current one is already faster than many libmysql bindings. It must be great, right?
Well, unfortunately this graph is exactly what is wrong with most benchmarks you will see. It's the usual, here look, A is better than B, so you should use that. But it's completely lacking the raw data and more importantly, the proper analysis, that you can only perform if you have the raw data.
So, if this kind of graph is all your benchmarking library can produce, you should throw it away. Instead, all you really need is a standalone script that produces the raw data set for you. In case of this graph, the data set looks like this:
Now we can suddenly do much more with it, than comparing it based on median performance. For example, we can plot the individual data points on a jitter graph like this:
Oh, that's an odd distribution of data points. It seems like the results for both parsers split into two categories: fast and slow. One thing is clear now, this data can't be used to demonstrate anything until we figure out what is going on. So let's plot those data points on a time scale:
Ok, this makes more sense now. It seems like both parsers start out fast, and
after a certain amount of time become slow from one moment to the next. Due
to the sudden nature of the drop, I first suspected that the V8 JIT was
de-optimizing my code after a while. However, I was unable to confirm this
even when starting my benchmark with
Looking at this graph, it seems that there is a correlation between the maximum heap used in between GC cycles, and the throughput of the benchmark. This could indicate a memory leak. However, after the first performance regression, it seems like the heap is no longer growing.
Another look at the heapTotal, which is the total amount of memory allocated by v8, some of it always empty, reveals a similar picture:
As we can see, our performance problem seem to be correlated with v8 deciding to grow the heap total. From this data it is still unclear to me if v8 is making the wrong decision by growing the heap total here, or if there is a problem in my code, causing this performance issue.
Anyway, the whole point of this example was to show you why it's important to have benchmarks producing raw data. If you don't have the raw data, you're never going to to be able to analyze your benchmarks for problems like this, always running the risk of fooling yourself.
So if you want to do any performance related work, in any language really, here is a benchmarking toolchain that is working well for me:
- Create a benchmark producing tab separated data points on stdout
- Cycles per second (Hz) and bytes per second (B/s) are generally good units
- Add plenty of useful meta data (time, memory, hardware, etc.) to each line
- Use the unix
teeprogram to watch your output and record to a file at the same time.
- Use the R Language and ggplot2 to analyze / plot your data as pdfs (Check out RStudio and this tutorial to get started quickly)
- Annotate your PDFs with Skitch or similar
- Use imagemagick to convert your PDFs into PNGs or similar for the web
- Use Makefiles to automate your benchmark -> analysis pipeline
Parsing binary streams in node.js
Now let's talk about parsing binary streams in node.js. If you have not parsed a binary stream in node.js or another plattform before, here is a quick example.
Let's say we want to write a MySQL client.
( Live coding, code available here )
So as you can see, parsing binary data itself is not so hard. What's hard is keeping track of your internal state. Because the parser we just wrote is inherently broken because it does not handle the case where we only receive a partial handshake packet from our server in our first 'data' event. So in my first node-mysql version, I tackled this with a huge state machine / switch statement.