parallelize the loading step #13

arq5x · 2012-04-18T18:33:51Z

Even with all of the Cython optimizations, loading very large (many variants & many samples) VCF files into the DB still takes a tremendous amount of time.

One option for speeding this up would be to use Python's multiprocessing module to assign specific chunks of the VCF file to individual processes. Each process would populate it's own temporary version of each table in order to avoid deadlocks. At the end, the final tables would be created by taking a union of the temp tables. Index creation would have to be delayed until the end.

Open questions:

What is the best way to assign arbitrary chunks of Records from a VCF file to each process? Does the pysam tabix API support this? We will most likely have to roll our own.

The text was updated successfully, but these errors were encountered:

arq5x closed this as completed Mar 9, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize the loading step #13

parallelize the loading step #13

arq5x commented Apr 18, 2012

parallelize the loading step #13

parallelize the loading step #13

Comments

arq5x commented Apr 18, 2012