Optimise mem usage, disallowing random ref order #10

jakobnissen · 2022-11-21T09:41:25Z

Currently, pycoverm is not particularly memory efficient. It keeps coverages in an name=>Vec IndexMap, then creates a matrix and fills the matrxi from the map.
This allows the order of reference sequences in the BAM files to be in different order, but it consumes a lot of memory. Further, it's not very common to have a situation where BAM files are mapped to the exact same set of references, but in a different order, as BAM reference order is deterministic, and reflects the order of the reference FASTA file.

This commit checks that all BAM files have the same headers in the same order. It then computes an old_refindex => new_refindex map, which can be stored as a single Vec. The coverage matrix can then be preallocated and filled in. This approach is nearly perfectly memory efficient.

The only remaining issue is that the headers might take up a lot of memory, which is unnecessary in the case that the users already know the headers. This might seem trivial, but if there are 10 million sequences that each has to be heap allocated, it can take some time. A solution to this, which I employ in Vamb, is to pass a hash of the headers to the coverage computation. The headers can then be verified without consuming any memory.

Currently, pycoverm is not particularly memory efficient. It keeps coverages in an name=>Vec<f32> IndexMap, then creates a matrix and fills the matrxi from the map. This allows the order of reference sequences in the BAM files to be in different order, but it consumes a lot of memory. Further, it's not very common to have a situation where BAM files are mapped to the exact same set of references, but in a different order, as BAM reference order is deterministic, and reflects the order of the reference FASTA file. This commit checks that all BAM files have the same headers in the same order. It then computes an old_refindex => new_refindex map, which can be stored as a single Vec<u32>. The coverage matrix can then be preallocated and filled in. This approach is nearly perfectly memory efficient.

apcamargo · 2022-11-22T18:33:20Z

Thanks a lot, @jakobnissen! This looks great!

Just found one possible bug:

fn index_map(headers: &[String], names: &HashSet<&str>) -> Vec<u32> {
    // Make sure there are no names in names which are not in headers.
    let mut n_seen: usize = 0;
    let mut result = vec![u32::MAX; names.len()]; // ← should be `let mut result = vec![u32::MAX; headers.len()];`, right?
    let mut new_index: u32 = 0;
    for (i, header) in headers.iter().enumerate() {
        if names.contains(&header.as_str()) {
            result[i] = new_index;
            new_index += 1;
            n_seen += 1
        }
    }
    if n_seen != names.len() {
        panic!("Some names in `names` were not found in headers")
    }
    result
}

Also, I think it would be better if the error message was "Some names in contig_set were not found in headers", since contig_set is the parameter name that is exposed to the user.

jakobnissen · 2022-11-22T18:46:06Z

Good points - you're right. Fixed in this commit.

apcamargo · 2022-11-22T18:48:09Z

Thanks! I'll merge this and get a new version out as soon as possible

Address comments

cd34749

apcamargo merged commit a2d87c8 into apcamargo:main Nov 22, 2022

jakobnissen deleted the memory branch November 22, 2022 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise mem usage, disallowing random ref order #10

Optimise mem usage, disallowing random ref order #10

jakobnissen commented Nov 21, 2022

apcamargo commented Nov 22, 2022

jakobnissen commented Nov 22, 2022

apcamargo commented Nov 22, 2022

Optimise mem usage, disallowing random ref order #10

Optimise mem usage, disallowing random ref order #10

Conversation

jakobnissen commented Nov 21, 2022

apcamargo commented Nov 22, 2022

jakobnissen commented Nov 22, 2022

apcamargo commented Nov 22, 2022