Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Erlang implementation #10
I was curious how Erlang would compare using binary pattern matching. It's patterned roughly after the Elixir implementation.
Update 2015-05-20: Added the option to run the the Erlang implementation using unsafe, binary, or regex mode. Unsafe reads the entire dataset into memory, which is the fastest, but may not work on larger datasets. The other modes read each file line-by-line instead.
Hardware: MacBook Pro 2.6GHz i7 (quad core) with 16GB RAM and PCIe SSD (very similar to the hardware in the part 2 blog post)
While playing around with the implementation, I wrote three other binary pattern matching implementations which used different strategies (these aren't included in this pull request):
The riak_pipe implementation is interesting as it could be run across multiple nodes and could potentially scale horizontally to handle larger datasets better than some of the other implementations.
@josevalim Yeah, I probably drifted away from the original intent of this project. I saw that the go implementation used substring pattern matching, so I thought binary pattern matching might also be allowed. Plus, it was too much fun to find out how fast I could get it to go.
I made a few changes so the implementation could be run in 1 of 3 modes:
referenced this pull request
Jun 10, 2015
I'm late to the party but great job optimizing the erlang implementation. We've had a lot of language submissions that are optimized far better than my original submissions. But comparisons are becoming apples to oranges at this point because of inconsistencies across the languages (regex, substring, ascii, unicode, map reduction, etc, loading everything into memory as opposed to streaming).
Great job with the
@dimroc let me know if you would like those changes to be ported to Elixir. I would really like to follow Erlang footsteps in here otherwise it will generate a whole "Elixir is 5x slower than Erlang" which is certainly not true. :)
Thanks for merging this.
Sure @josevalim that would be great. Can you do me two favors though?
You can see the early (early) draft of that post here (minus bar charts): https://github.com/dimroc/blog/blob/source/_drafts/2015-08-31-etl-language-showdown-pt3.md
Hope I'm not asking too much.
Btw, I can explain 2 right now. :)
@potatosalad, feel free to correct me if I got anything wrong. :)
@josevalim After reviewing the code again, that all sounds correct. Thank you for the explanation, it's much more informative than I wrote in the description above. Also, sorry to have caused you extra work so things wouldn't appear to be "Elixir is 5x slower than Erlang"
Interesting side notes that will hopefully be part of OTP 19:
I ran the original elixir implementation which used String.split/2 against the
I was also curious how much a speedup a split based solution running with the OTP PR would provide for Elixir. If you're interested it's on this branch: potatosalad/etl-language-comparison@match_and_split.
However, while fooling around with the implementations, I stumbled upon an even faster method:
Therefore, while there is currently no similar implementation in Erlang, let the record show that Elixir is roughly 2-3x faster than Erlang
Also, I think 1.1 seconds is currently the fastest runtime for any of the languages currently posted on this repository (it beats the Rust implementation by roughly 1-2x).