Skip to content

Conversation

@AdityaPandeyCN
Copy link

@AdityaPandeyCN AdityaPandeyCN commented Oct 5, 2025

This PR adds Chromosome-Based RAM File Splitting.

  1. Adds -split option to samtoramntuple tool that splits SAM files into separate RAM files per chromosome.
  2. Adds benchmark files for BAM vs RAM files chromosome based splitting to show that RAM one provides better compession
    but SAM to BAM method is faster.

The result is attached here

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_SamtoolsSplit/100000          6878 ms         14.1 ms           10 reads/s=711.163k/s size_MB=6.25871
BM_SamtoolsSplit/500000          6566 ms         29.2 ms            1 reads/s=17.1186M/s size_MB=30.619
BM_SamtoolsSplit/1000000         7958 ms         26.1 ms            1 reads/s=38.3489M/s size_MB=60.669
BM_ChromosomeSplit/100000        4780 ms         3583 ms            1 reads/s=27.9123k/s size_MB=4.9771
BM_ChromosomeSplit/500000       13316 ms        12077 ms            1 reads/s=41.4011k/s size_MB=24.1665
BM_ChromosomeSplit/1000000      22668 ms        21471 ms            1 reads/s=46.5735k/s size_MB=48.1444

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

clang format

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

code organization

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

clang changes

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

Add region query benchmark (compiler-research#8)

* query performance

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

* clang format

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

* code organization

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

* clang changes

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

---------

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

add chromosome based file splitting

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

delete example sam file

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

clang changes

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

test file changes

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>

clang changes

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
@AdityaPandeyCN
Copy link
Author

Added benchmark files

@AdityaPandeyCN
Copy link
Author

I have updated with the benchmark results for this implementation

@vgvassilev
Copy link

Okay, do we know what takes us so long?

@AdityaPandeyCN
Copy link
Author

The SAM to BAM file conversion is really faster due to the indexing. The conversion time has been slower for our implementation along all aspect as of now.

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
@vgvassilev
Copy link

The SAM to BAM file conversion is really faster due to the indexing. The conversion time has been slower for our implementation along all aspect as of now.

Can you elaborate what indexing means here?

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
@AdityaPandeyCN
Copy link
Author

AdityaPandeyCN commented Oct 7, 2025

BAM is BGZF-compressed and “coordinate-sorted.” An index (.bai or .csi) maps genomic positions to file offsets, so tools can seek directly to the blocks that contain a region, instead of scanning the whole file.
Like their indexing is superior to currently what we have implemented now.

These lines highlight this
https://github.com/AdityaPandeyCN/ramtools/blob/develop/benchmark/chromosome_split_benchmark.cxx#L59C7-L72C8

@vgvassilev
Copy link

BAM is BGZF-compressed and “coordinate-sorted.” An index (.bai or .csi) maps genomic positions to file offsets, so tools can seek directly to the blocks that contain a region, instead of scanning the whole file. Like their indexing is superior to currently what we have implemented now.

These lines highlight this https://github.com/AdityaPandeyCN/ramtools/blob/develop/benchmark/chromosome_split_benchmark.cxx#L59C7-L72C8

I think we can organize that with root files, too. You can tell the root file how to store the file, vertically so that the search times are faster. Have you tried to do that?

@AdityaPandeyCN
Copy link
Author

No, but I am working on it with these details. Will update you if I see performance gains.

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
@vgvassilev
Copy link

Have you considered using TBufferMerger. There are some strategies discussed here: https://root-forum.cern.ch/t/producing-root-files-in-parallel/22003

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
@AdityaPandeyCN
Copy link
Author

Hello @vgvassilev I dont think TBufferMerger does well with RNTuple. From the docs I tried to implement RNTupleParallelWriter and this was the benchmark results.

----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_SamtoolsSplit/100000                   3685 ms         16.2 ms            1 reads/s=6.15587M/s size_MB=6.25878
BM_SamtoolsSplit/500000                  23859 ms         70.5 ms            1 reads/s=7.08798M/s size_MB=30.6191
BM_SamtoolsSplit/1000000                 21946 ms         82.8 ms            1 reads/s=12.0733M/s size_MB=60.669
BM_ChromosomeSplitThreads/100000/2        3345 ms          915 ms            1 reads/s=109.327k/s size_MB=4.84682 threads=2
BM_ChromosomeSplitThreads/100000/4        2480 ms          553 ms            1 reads/s=180.763k/s size_MB=4.87449 threads=4
BM_ChromosomeSplitThreads/100000/8        1937 ms          524 ms            1 reads/s=190.784k/s size_MB=4.87457 threads=8
BM_ChromosomeSplitThreads/500000/2        6829 ms         2903 ms            1 reads/s=172.252k/s size_MB=22.9747 threads=2
BM_ChromosomeSplitThreads/500000/4        6746 ms         2749 ms            1 reads/s=181.866k/s size_MB=23.2019 threads=4
BM_ChromosomeSplitThreads/500000/8        5753 ms         2691 ms            1 reads/s=185.789k/s size_MB=23.2059 threads=8
BM_ChromosomeSplitThreads/1000000/2      12836 ms         5541 ms            1 reads/s=180.47k/s size_MB=45.2593 threads=2
BM_ChromosomeSplitThreads/1000000/4      11436 ms         5448 ms            1 reads/s=183.569k/s size_MB=45.6385 threads=4
BM_ChromosomeSplitThreads/1000000/8      10894 ms         5344 ms            1 reads/s=187.121k/s size_MB=45.6402 threads=8

Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
Signed-off-by: AdityaPandeyCN <adityapand3y666@gmail.com>
@AdityaPandeyCN
Copy link
Author

AdityaPandeyCN commented Oct 16, 2025

@vgvassilev I implemented threading for samtools benchmark this was the result

----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_SamtoolsSplit/100000                   7264 ms         34.2 ms            1 reads/s=2.92049M/s size_MB=6.25878
BM_SamtoolsSplit/500000                  20529 ms          114 ms            1 reads/s=4.3819M/s size_MB=30.6191
BM_SamtoolsSplit/1000000                 31397 ms          161 ms            1 reads/s=6.21878M/s size_MB=60.669
BM_SamtoolsSplitThreaded/100000/2         5543 ms         18.6 ms            1 reads/s=5.38264M/s size_MB=6.2591 threads=2
BM_SamtoolsSplitThreaded/100000/4         2630 ms         18.6 ms            1 reads/s=5.38277M/s size_MB=6.25915 threads=4
BM_SamtoolsSplitThreaded/500000/2         7685 ms         64.8 ms            1 reads/s=7.71797M/s size_MB=30.6194 threads=2
BM_SamtoolsSplitThreaded/500000/4         8498 ms         73.4 ms            1 reads/s=6.81138M/s size_MB=30.6195 threads=4
BM_SamtoolsSplitThreaded/1000000/2       17999 ms          142 ms            1 reads/s=7.05627M/s size_MB=60.6694 threads=2
BM_SamtoolsSplitThreaded/1000000/4       20290 ms          121 ms            1 reads/s=8.27261M/s size_MB=60.6694 threads=4
BM_ChromosomeSplitThreads/100000/2        6510 ms         1323 ms            1 reads/s=75.5897k/s size_MB=4.84642 threads=2
BM_ChromosomeSplitThreads/100000/4        4137 ms          616 ms            1 reads/s=162.44k/s size_MB=4.87285 threads=4
BM_ChromosomeSplitThreads/500000/2       11823 ms         3598 ms            1 reads/s=138.947k/s size_MB=22.9749 threads=2
BM_ChromosomeSplitThreads/500000/4        7491 ms         3234 ms            1 reads/s=154.624k/s size_MB=23.1967 threads=4
BM_ChromosomeSplitThreads/1000000/2      19540 ms         6507 ms            1 reads/s=153.669k/s size_MB=45.2595 threads=2
BM_ChromosomeSplitThreads/1000000/4      22434 ms         5792 ms            1 reads/s=172.641k/s size_MB=45.6304 threads=4

Copy link

@vgvassilev vgvassilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@vgvassilev vgvassilev merged commit 80d10b0 into compiler-research:develop Oct 16, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants