terashuf implements a quasi-shuffle algorithm for shuffling multi-terabyte text files using limited memory. It is a C++ implementation of this Python script.
Why not GNU sort -R instead?
terashuf has 2 advantages over
- terashuf is much, much faster. See benchmark below.
- It can shuffle duplicate lines. To deal with duplicate lines in
sort, the input has to be modified (append an incremental token) so that duplicate lines are different otherwise sort will hash them to the same value and place them in adjacent lines, which is not desirable in a shuffle! Then the tokens have to be removed. It's simpler to use terashuf where none of this is required.
Why not GNU shuf?
shuf does all the shuffling in-memory, which is a no-go for files larger than memory.
For small files, terashuf doesn't write any temporary files and so functions exactly like
the benchmark below, terashuf marginally outperforms shuf.
The following compares shuffle times for a 20GB Wikipedia dump. terashuf is tested with limited memory and with memory large enough to fit the entire file (in-memory like shuf):
|Command||Memory (GB)||Real Time||User Time||Sys Time|
Benchmark run on Xeon E5-2630 @ 2.60GHz with 128GB of RAM.
Note: I'm looking for pull requests implementing the shuf interface so that terashuf can become a drop-in replacement for shuf.
terashuf can be built by calling
$ make. It has no dependencies other than the stdlib.
$ ./terashuf < filetoshuffle.txt > shuffled.txt
It reads 2 ENV variables:
- TMPDIR: defaults to /tmp if not set.
- MEMORY: defaults to 4.0, meaning use a shuffle buffer of 4 GB.
Note: the last line in the file to be shuffled will be ignored if it does not end with a newline marker (\n).
When shuffling very large files, terashuf needs to keep open
SIZE_OF_FILE_TO_SHUFFLE / MEMORY temporary files. Make sure to set the maximum number of file descriptors to at least this number. By setting a large file descriptor limit, you ensure that terashuf won't abort a shuffle midway, saving precious researcher time.
terashuf implements a quasi-shuffle as follows:
- Divide N input lines into K files containing L lines.
- Shuffle each of the K files (this is done in memory before writing the file).
- Read one line from each of the K files into a buffer until the buffer has L lines.
- Shuffle the buffer and write to output.
- Repeat 3. and 4. until all lines have been written to output.
Pull requests are very welcome!
- Rather than use fixed-length lines in the buffer which wastes memory, use variable-length lines such that all buffer memory is used.
- Implement --help
shufinterface so that terashuf becomes a drop-in replacement
- Add benchmarks
Copyright (c) 2017 Salle, Alexandre email@example.com. All work in this package is distributed under the MIT License.