Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize the loading step #13

Closed
arq5x opened this issue Apr 18, 2012 · 0 comments
Closed

parallelize the loading step #13

arq5x opened this issue Apr 18, 2012 · 0 comments

Comments

@arq5x
Copy link
Owner

arq5x commented Apr 18, 2012

Even with all of the Cython optimizations, loading very large (many variants & many samples) VCF files into the DB still takes a tremendous amount of time.

One option for speeding this up would be to use Python's multiprocessing module to assign specific chunks of the VCF file to individual processes. Each process would populate it's own temporary version of each table in order to avoid deadlocks. At the end, the final tables would be created by taking a union of the temp tables. Index creation would have to be delayed until the end.

Open questions:

  1. What is the best way to assign arbitrary chunks of Records from a VCF file to each process? Does the pysam tabix API support this? We will most likely have to roll our own.
@arq5x arq5x closed this as completed Mar 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant