Alternatives for .skf format #17

johnlees · 2023-02-17T11:58:36Z

Look at blocked/distributed hash maps

johnlees · 2023-02-17T14:31:05Z

Key-value store (database) options:

Probably faster:

johnlees · 2023-02-18T09:45:35Z

Initial trial with 28 samples and ~3M k-mers

    // 28 listeria
    // 1 thread 11s, 2.3Gb
    // 8 threads 5s, 3Gb
    // 3228084 k-mers
    // 38Mb skf file

Reading all merge ska array k-mer and vars in (fill) and changing one variant (modify)

Hash fill: 284ms modify: 95ms
sled fill: 21113ms modify: 16751ms 641Mb

With sled compression
sled fill: 37990ms modify: 21814ms 433Mb

johnlees · 2023-02-18T09:47:01Z

If doing this kind of approach, using a BTreeMap and reading/writing blocks of given range (and probably async) would be better. Would also want to deserialise without buffering, see: https://serde.rs/stream-array.html.

Some difficulties here are that values are not likely to be equally spaced in the btree, and doing blocked reads with serde isn't easily supported. A Vec<u64> which is read manually from disk might be a better solution.

I think the algorithm for build_and_merge would be something more like:

read a file into a hashmap
merge into the dict in blocks (using seek to get correct parts of file)

Two more things to try first:

Change build and merge so that files are read then merged one at a time, rather than all read, then all merged. (could then make these blocked, to give some threading)
Try the memmapped hashmap to see if it's possible

johnlees · 2023-02-18T10:36:25Z

The first, and simplest, change to make would be to combine the read and merge steps as they are. i.e. inside the append loop also do the build there. That keeps the current parallelism approach

johnlees · 2023-02-19T17:28:41Z

Using a memmap odht works and is fast:

let mut mmap = unsafe { MmapMut::map_mut(&file).unwrap() };
let mut mmap_hashtable = HashTable::<MyConfig, &mut [u8]>::init_in_place(mmap.as_mut(), max_items, load).unwrap();

memmap fill: 199ms modify: 234ms
size = 296M

But:

Record size has to be fixed at compile time, so would work for SkaDict but not MergedSkaDict
Without flushing, memory use is the same as the file size.
Flushing is awkward, makes things slower, doesn't decrease memory use

Basically memory map seems to be the wrong choice when writing the whole file. If it could be changed to be a BTree of blocks it might work. Going to leave this for now

johnlees · 2023-07-11T13:35:14Z

https://github.com/wspeirs/btree looks like it might work here

johnlees added the enhancement New feature or request label Feb 17, 2023

johnlees mentioned this issue Feb 18, 2023

Memory optimisation: don't store whole build array #19

Merged

johnlees closed this as completed in #19 Feb 18, 2023

johnlees mentioned this issue Mar 10, 2023

Do you have an estimated RAM usage per sample? #23

Closed

johnlees mentioned this issue May 8, 2023

add back ska distance and ska annotate? #25

Closed

johnlees reopened this Jul 11, 2023

johnlees mentioned this issue Jul 11, 2023

Partial loading of skf #22

Closed

johnlees changed the title ~~Think about streaming the merge step to save on memory use~~ Alternatives for .skf format Jul 11, 2023

johnlees closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternatives for .skf format #17

Alternatives for .skf format #17

johnlees commented Feb 17, 2023

johnlees commented Feb 17, 2023 •

edited

Loading

johnlees commented Feb 18, 2023

johnlees commented Feb 18, 2023 •

edited

Loading

johnlees commented Feb 18, 2023

johnlees commented Feb 19, 2023

johnlees commented Jul 11, 2023

Alternatives for .skf format #17

Alternatives for .skf format #17

Comments

johnlees commented Feb 17, 2023

johnlees commented Feb 17, 2023 • edited Loading

johnlees commented Feb 18, 2023

johnlees commented Feb 18, 2023 • edited Loading

johnlees commented Feb 18, 2023

johnlees commented Feb 19, 2023

johnlees commented Jul 11, 2023

johnlees commented Feb 17, 2023 •

edited

Loading

johnlees commented Feb 18, 2023 •

edited

Loading