kv-par-merge-sort

Key-Value Parallel Merge Sort

Sort Pod (key, value) data sets that don't fit in memory.

This crate provides the kv_par_merge_sort library, which enables the user to sort [Chunk]s of (key, value) pairs (AKA entries) via a [SortingPipeline]. The sorted output lands in two files: one for keys and one for values. The keys file is sorted, while the values file is "parallel" to the key file.

More precisely, we denote the input stream as [(k[0], v[0]), (k[1], v[1]), ...]. The final key and value files are laid out as [k[a], k[b], ...] and [v[a], v[b], ...] respectively, such that the key array is sorted. The reason for separate files is to ensure correct data type alignment (for zero-copy reads) without wasting space to padding.

Sorting ~17 GB data set (half a billion entries)

$ time RUST_LOG=debug cargo run --release --example large_data_set -- -o /ssd_data/bench_data/ -t /ssd_data/tmp/
[2022-05-28T08:24:56Z INFO  large_data_set] Random input data set will contain 18 unsorted chunks of at most 28071681 entries each
[2022-05-28T08:25:36Z INFO  large_data_set] Done generating random chunks
[2022-05-28T08:26:00Z INFO  kv_par_merge_sort] Running merge of 16 persisted chunks
[2022-05-28T08:26:01Z INFO  kv_par_merge_sort] All chunks sorted, only merge work remains
[2022-05-28T08:27:02Z INFO  kv_par_merge_sort] Running merge of 3 persisted chunks
[2022-05-28T08:28:30Z INFO  kv_par_merge_sort] Done merging! Performed 2 merge(s) total

real    3m33.830s
user    3m31.733s
sys     0m42.923s

Implementation

To sort an arbitrarily large data set without running out of memory, we must resort to an "external" sorting algorithm that uses the file system for scratch space; we use a parallel merge sort. Each [Chunk] is sorted separately, in parallel and streamed to a pair of files. These files are consumed by a merging thread, which (also in parallel) iteratively merges groups of up to merge_k similarly-sized chunks.

File Handles

WARNING: It's possible to exceed your system's limit on open file handles if [Chunk]s are too small.

Memory Usage

WARNING: If you are running out of memory, make sure you can actually fit max_sort_concurrency [Chunk]s in memory. Also note that [std::env::temp_dir] might actually be an in-memory tmpfs.

File System Usage

This algorithm requires twice the size of the input data in free file system space in order to perform the final merge.

License: MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
examples		examples
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE.Apache-2.0		LICENSE.Apache-2.0
LICENSE.MIT		LICENSE.MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

src

src

.gitignore

.gitignore

Cargo.toml

Cargo.toml

LICENSE.Apache-2.0

LICENSE.Apache-2.0

LICENSE.MIT

LICENSE.MIT

README.md

README.md

Repository files navigation

kv-par-merge-sort

Key-Value Parallel Merge Sort

Sorting ~17 GB data set (half a billion entries)

Implementation

File Handles

Memory Usage

File System Usage

About

Releases

Packages

Languages

License

bonsairobo/kv-par-merge-sort-rs

Folders and files

Latest commit

History

Repository files navigation

kv-par-merge-sort

Key-Value Parallel Merge Sort

Sorting ~17 GB data set (half a billion entries)

Implementation

File Handles

Memory Usage

File System Usage

About

Resources

License

Stars

Watchers

Forks

Languages