Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise mem usage, disallowing random ref order #10

Merged
merged 2 commits into from
Nov 22, 2022

Conversation

jakobnissen
Copy link
Collaborator

Currently, pycoverm is not particularly memory efficient. It keeps coverages in an name=>Vec IndexMap, then creates a matrix and fills the matrxi from the map.
This allows the order of reference sequences in the BAM files to be in different order, but it consumes a lot of memory. Further, it's not very common to have a situation where BAM files are mapped to the exact same set of references, but in a different order, as BAM reference order is deterministic, and reflects the order of the reference FASTA file.

This commit checks that all BAM files have the same headers in the same order. It then computes an old_refindex => new_refindex map, which can be stored as a single Vec. The coverage matrix can then be preallocated and filled in. This approach is nearly perfectly memory efficient.

The only remaining issue is that the headers might take up a lot of memory, which is unnecessary in the case that the users already know the headers. This might seem trivial, but if there are 10 million sequences that each has to be heap allocated, it can take some time. A solution to this, which I employ in Vamb, is to pass a hash of the headers to the coverage computation. The headers can then be verified without consuming any memory.

Currently, pycoverm is not particularly memory efficient. It keeps coverages
in an name=>Vec<f32> IndexMap, then creates a matrix and fills the matrxi from
the map.
This allows the order of reference sequences in the BAM files to be in different
order, but it consumes a lot of memory. Further, it's not very common to have a
situation where BAM files are mapped to the exact same set of references, but
in a different order, as BAM reference order is deterministic, and reflects the
order of the reference FASTA file.

This commit checks that all BAM files have the same headers in the same order.
It then computes an old_refindex => new_refindex map, which can be stored as a
single Vec<u32>. The coverage matrix can then be preallocated and filled in.
This approach is nearly perfectly memory efficient.
@apcamargo
Copy link
Owner

Thanks a lot, @jakobnissen! This looks great!

Just found one possible bug:

fn index_map(headers: &[String], names: &HashSet<&str>) -> Vec<u32> {
    // Make sure there are no names in names which are not in headers.
    let mut n_seen: usize = 0;
    let mut result = vec![u32::MAX; names.len()]; // ← should be `let mut result = vec![u32::MAX; headers.len()];`, right?
    let mut new_index: u32 = 0;
    for (i, header) in headers.iter().enumerate() {
        if names.contains(&header.as_str()) {
            result[i] = new_index;
            new_index += 1;
            n_seen += 1
        }
    }
    if n_seen != names.len() {
        panic!("Some names in `names` were not found in headers")
    }
    result
}

Also, I think it would be better if the error message was "Some names in contig_set were not found in headers", since contig_set is the parameter name that is exposed to the user.

@jakobnissen
Copy link
Collaborator Author

Good points - you're right. Fixed in this commit.

@apcamargo
Copy link
Owner

Thanks! I'll merge this and get a new version out as soon as possible

@apcamargo apcamargo merged commit a2d87c8 into apcamargo:main Nov 22, 2022
@jakobnissen jakobnissen deleted the memory branch November 22, 2022 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants