You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Even with all of the Cython optimizations, loading very large (many variants & many samples) VCF files into the DB still takes a tremendous amount of time.
One option for speeding this up would be to use Python's multiprocessing module to assign specific chunks of the VCF file to individual processes. Each process would populate it's own temporary version of each table in order to avoid deadlocks. At the end, the final tables would be created by taking a union of the temp tables. Index creation would have to be delayed until the end.
Open questions:
What is the best way to assign arbitrary chunks of Records from a VCF file to each process? Does the pysam tabix API support this? We will most likely have to roll our own.
The text was updated successfully, but these errors were encountered:
Even with all of the Cython optimizations, loading very large (many variants & many samples) VCF files into the DB still takes a tremendous amount of time.
One option for speeding this up would be to use Python's multiprocessing module to assign specific chunks of the VCF file to individual processes. Each process would populate it's own temporary version of each table in order to avoid deadlocks. At the end, the final tables would be created by taking a union of the temp tables. Index creation would have to be delayed until the end.
Open questions:
Record
s from a VCF file to each process? Does the pysam tabix API support this? We will most likely have to roll our own.The text was updated successfully, but these errors were encountered: