Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up malva using more threads #8

Open
ivargr opened this issue May 5, 2021 · 2 comments · Fixed by #9
Open

Speeding up malva using more threads #8

ivargr opened this issue May 5, 2021 · 2 comments · Fixed by #9
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@ivargr
Copy link

ivargr commented May 5, 2021

First, thanks for a very nice tool!

I have a couple of questions:

  1. Is it possible to speed up Malva using more threads? I know that KMC easily can use more threads, meaning that the first step of Malva (getting kmers from the reads) can be parallelized, but can malva-geno (computing the signatures and performing the genotyping) also be parallelized in any way?

  2. Assume I want to genotype many samples. I guess that one could potentially save a lot oftime by not computing the variant signatures and reference kmers for each sample (since these should be the same, given the input variants). It seems now that malva-geno does this every time a new sample is genotyped. Is it possible to re-use this data so that genotyping of new samples would become faster?

Thanks!

@mpre
Copy link
Member

mpre commented May 5, 2021

Hi @ivargr, thanks for your comments, we are glad you're testing our tool.

Both requests are sensible but unfortunately they are not part of the current version of malva.

The first point might need some major rework and it's not to be expected in the foreseeable future.

Regarding the second point, I just pushed a new branch to this repository that tries to solve it (branch split_main). Using that branch you can now use malva-geno index to create the index of your reference genome/vcf and then malva-geno call to produce the output vcf that includes the genotypes. You still need to pass all the arguments to both steps since I didn't rework the interface yet (arguments that don't affect the index/call step will simply be ignored). If you use the MALVA script it will also check if the index is available already and, if so, skip the indexing step.

This version is not published on bioconda yet and it will probably take a while before we will be able to clean up the code and push it to bioconda, so you'll need to compile it yourself and take care of the dependecies.

Finally, I want to stress that this version is experimental, I tested it on the example we provide in the repo and it works but I didn't check `whether there's some performance hit on big datasets or if it breaks in some edge cases. The index might also not be portable since it's basically a serialization of the in-memory index.

@mpre mpre self-assigned this May 5, 2021
@mpre mpre added enhancement New feature or request question Further information is requested labels May 5, 2021
@mpre mpre linked a pull request May 5, 2021 that will close this issue
@ivargr
Copy link
Author

ivargr commented May 6, 2021

Thanks a lot for the quick response!

I've tested the new branch and it seems to work great on the data I have tried it with!

No worries about Malva not being able to multithread fully for now, I was mostly just curious on whether was possible or not, but it would be a cool potential improvement in a future version of Malva :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants