Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives for .skf format #17

Closed
johnlees opened this issue Feb 17, 2023 · 6 comments · Fixed by #19
Closed

Alternatives for .skf format #17

johnlees opened this issue Feb 17, 2023 · 6 comments · Fixed by #19
Labels
enhancement New feature or request

Comments

@johnlees
Copy link
Member

Look at blocked/distributed hash maps

@johnlees johnlees added the enhancement New feature or request label Feb 17, 2023
@johnlees
Copy link
Member Author

johnlees commented Feb 17, 2023

@johnlees
Copy link
Member Author

Initial trial with 28 samples and ~3M k-mers

    // 28 listeria
    // 1 thread 11s, 2.3Gb
    // 8 threads 5s, 3Gb
    // 3228084 k-mers
    // 38Mb skf file

Reading all merge ska array k-mer and vars in (fill) and changing one variant (modify)

Hash fill: 284ms modify: 95ms
sled fill: 21113ms modify: 16751ms 641Mb

With sled compression
sled fill: 37990ms modify: 21814ms 433Mb

@johnlees
Copy link
Member Author

johnlees commented Feb 18, 2023

If doing this kind of approach, using a BTreeMap and reading/writing blocks of given range (and probably async) would be better. Would also want to deserialise without buffering, see: https://serde.rs/stream-array.html.

Some difficulties here are that values are not likely to be equally spaced in the btree, and doing blocked reads with serde isn't easily supported. A Vec<u64> which is read manually from disk might be a better solution.

I think the algorithm for build_and_merge would be something more like:

  1. read a file into a hashmap
  2. merge into the dict in blocks (using seek to get correct parts of file)

Two more things to try first:

  1. Change build and merge so that files are read then merged one at a time, rather than all read, then all merged. (could then make these blocked, to give some threading)
  2. Try the memmapped hashmap to see if it's possible

@johnlees
Copy link
Member Author

The first, and simplest, change to make would be to combine the read and merge steps as they are. i.e. inside the append loop also do the build there. That keeps the current parallelism approach

@johnlees
Copy link
Member Author

Using a memmap odht works and is fast:

let mut mmap = unsafe { MmapMut::map_mut(&file).unwrap() };
let mut mmap_hashtable = HashTable::<MyConfig, &mut [u8]>::init_in_place(mmap.as_mut(), max_items, load).unwrap();

memmap fill: 199ms modify: 234ms
size = 296M

But:

  • Record size has to be fixed at compile time, so would work for SkaDict but not MergedSkaDict
  • Without flushing, memory use is the same as the file size.
  • Flushing is awkward, makes things slower, doesn't decrease memory use

Basically memory map seems to be the wrong choice when writing the whole file. If it could be changed to be a BTree of blocks it might work. Going to leave this for now

@johnlees
Copy link
Member Author

https://github.com/wspeirs/btree looks like it might work here

@johnlees johnlees reopened this Jul 11, 2023
@johnlees johnlees changed the title Think about streaming the merge step to save on memory use Alternatives for .skf format Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant