-
Notifications
You must be signed in to change notification settings - Fork 1
Add Chromosome based file splitting #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Chromosome based file splitting #9
Conversation
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> clang format Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> code organization Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> clang changes Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> Add region query benchmark (compiler-research#8) * query performance Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> * clang format Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> * code organization Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> * clang changes Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> --------- Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> add chromosome based file splitting Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> delete example sam file Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> clang changes Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> test file changes Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com> clang changes Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
|
Added benchmark files |
|
I have updated with the benchmark results for this implementation |
|
Okay, do we know what takes us so long? |
|
The SAM to BAM file conversion is really faster due to the indexing. The conversion time has been slower for our implementation along all aspect as of now. |
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Can you elaborate what indexing means here? |
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
|
BAM is BGZF-compressed and “coordinate-sorted.” An index (.bai or .csi) maps genomic positions to file offsets, so tools can seek directly to the blocks that contain a region, instead of scanning the whole file. These lines highlight this |
I think we can organize that with root files, too. You can tell the root file how to store the file, vertically so that the search times are faster. Have you tried to do that? |
|
No, but I am working on it with these details. Will update you if I see performance gains. |
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
6d3d6bd to
81a68c8
Compare
|
Have you considered using TBufferMerger. There are some strategies discussed here: https://root-forum.cern.ch/t/producing-root-files-in-parallel/22003 |
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
|
Hello @vgvassilev I dont think TBufferMerger does well with RNTuple. From the docs I tried to implement |
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
|
@vgvassilev I implemented threading for samtools benchmark this was the result |
vgvassilev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
This PR adds Chromosome-Based RAM File Splitting.
-splitoption tosamtoramntupletool that splits SAM files into separate RAM files per chromosome.but SAM to BAM method is faster.
The result is attached here