The Erlang implementation has three modes: unsafe, binary, or regex mode.
- Unsafe reads the entire dataset into memory, which is the fastest, but may not work on larger datasets. It then performs binary matching as described in step 2.
- Binary uses binary pattern matching with file:read_line/1 (~4.4s)
- Because binary matches are case sensitive (while the solution is to be case insensitive), the first step of the mapper algorithm is to generate all permutations for the word being counted.
- Regex uses regular expressions with file:read_line/1 (~6.6s)
Further discussion can be found in this pull request.