diff --git a/.gitignore b/.gitignore index b12a606..4502af8 100644 --- a/.gitignore +++ b/.gitignore @@ -4,7 +4,6 @@ ramexample.root *.perf *.bam *.sam -*.png *.bam.bai *.root *.root.idx diff --git a/README.md b/README.md index 0540ea5..fee725f 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,131 @@ -ROOT scripts to convert a SAM file to a RAM (ROOT Alignment/Map) file and to work on RAM files. +# RAMTools - ROOT Alignment/Map Format Tools - - To convert a SAM file to a RAM file do: +RAMTools provides efficient tools for converting SAM files to ROOT's modern, columnar RNTuple format (RAM - ROOT Alignment/Map) and working with genomic alignment data. +## Features + +- High-performance SAM to RAM conversion with an RNTuple backend +- Chromosome-based splitting for parallel processing +- Region-based querying capabilities + +## Requirements + +- ROOT 6.26+ +- C++17 compatible compiler +- CMake 3.16+ + +## Quick Start +```bash +# 1. Build the tools +mkdir build && cd build +cmake .. +make -j$(nproc) + +# 2. Convert a SAM file to the RAM format +./tools/samtoramntuple ../test/samexample.sam output.root + +# 3. Query a specific region from the command line +./tools/ramntupleview output.root "chr1:15700-15800" +``` + +## Command-Line Tools + +The primary way to interact with RAMTools is through these command-line executables. + +### SAM to RAM Conversion + +Convert a standard SAM file into the optimized RNTuple-based RAM format. ```bash - $ root - root [0] .x samtoram.C - root [1] .q +# Basic conversion +./tools/samtoramntuple input.sam output.root + +# Split by chromosome for parallel processing +# (Creates output-chr1.root, output-chr2.root, etc.) +./tools/samtoramntuple input.sam output -split ``` - - To test read a RAM file do: +### Region Querying (RNTuple) + +Query a specific genomic region from a RAM file, similar to samtools view. +```bash +# Usage: ./tools/ramntupleview [input.root] "[chromosome]:[start]-[end]" +./tools/ramntupleview output.root "chr1:10150-10300" +``` + +## Benchmark Results + +Tested with HG00154 sample from the 1000 Genomes Project (196M reads, 72GB SAM file): + +![File Size Comparison](./img/file-size-comparison.jpg) + +### Region Query Performance(LZMA Compression) + +| Format | Region | Time (s) | CPU (s) | Reads/sec | Total Reads | +|--------|--------|----------|---------|-----------|-------------| +| **TTree** | Small (100bp) | 4.18 | 1.50 | 4.7 | 7 | +| | Gene (BRCA2) | 1.87 | 1.79 | 24,010 | 42,961 | +| | 10Mb | 41.2 | 39.7 | 75,105 | 2,977,922 | +| | 100Mb | 36.3 | 35.4 | 80,492 | 2,852,438 | +| **RNTuple** | Small (100bp) | 4.10 | 2.55 | 2.7 | 7 | +| | Gene (BRCA2) | 1.20 | 1.15 | 37,308 | 42,961 | +| | 10Mb | 7.40 | 6.67 | 446,331 | 2,977,922 | +| | 100Mb | 6.46 | 6.36 | 448,688 | 2,852,438 | + +### Region Query Performance (LZ4 Compression) + +| Format | Region | Time (s) | CPU (s) | Reads/sec | Total Reads | +|--------|--------|----------|---------|-----------|-------------| +| **TTree** | Small (100bp) | 3.23 | 1.86 | 3.8 | 7 | +| | Gene (BRCA2) | 0.941 | 0.842 | 51,030 | 42,961 | +| | 10Mb | 14.5 | 11.0 | 269,797 | 2,977,922 | +| | 100Mb | 9.05 | 9.05 | 315,088 | 2,852,438 | +| **RNTuple** | Small (100bp) | 2.98 | 1.67 | 4.2 | 7 | +| | Gene (BRCA2) | 1.01 | 0.948 | 45,321 | 42,961 | +| | 10Mb | 7.43 | 6.68 | 445,709 | 2,977,922 | +| | 100Mb | 6.47 | 6.29 | 453,736 | 2,852,438 | + +### Region Query Performance (ZLIB Compression) + +| Format | Region | Time (s) | CPU (s) | Reads/sec | Total Reads | +|--------|--------|----------|---------|-----------|-------------| +| **TTree** | Small (100bp) | 4.19 | 2.31 | 3.0 | 7 | +| | Gene (BRCA2) | 1.22 | 1.11 | 38,661 | 42,961 | +| | 10Mb | 18.2 | 16.3 | 183,021 | 2,977,922 | +| | 100Mb | 14.4 | 14.4 | 197,815 | 2,852,438 | +| **RNTuple** | Small (100bp) | 2.85 | 1.73 | 4.0 | 7 | +| | Gene (BRCA2) | 1.19 | 1.14 | 37,529 | 42,961 | +| | 10Mb | 7.40 | 6.62 | 449,599 | 2,977,922 | +| | 100Mb | 6.49 | 6.41 | 445,148 | 2,852,438 | + +**Key Findings**: +- RNTuple demonstrates **1.4-2.5x faster** query performance for large regions compared to TTree +- LZ4 compression provides the best query performance among all compression algorithms +- For a 100Mb region query: RNTuple processes **453,736 reads/sec** vs TTree+ZLIB's **197,815 reads/sec** + +## TTree Implementation (Legacy) + +ROOT scripts to convert a SAM file to a RAM (ROOT Alignment/Map) file using the older TTree format and to work with those files. + +### Convert SAM to RAM with TTree +```bash +$ root +root [0] .x samtoram.C +root [1] .q +``` +### Read a RAM file (TTree) ```bash - $ root - root [0] .x ramreader.C - root [1] .q +$ root +root [0] .x ramreader.C +root [1] .q ``` - - To view a region, the equivalent of `samtools view bamexample.bam chr1:10150-10300`, do: +### View a specific region (TTree) +To view a region, the equivalent of `samtools view bamexample.bam chr1:10150-10300`: ```bash - $ root - root [0] .x ramview.C("ramexample.root","chr1:10150-10300") - root [1] .q +$ root +root [0] .x ramview.C("ramexample.root","chr1:10150-10300") +root [1] .q ``` diff --git a/img/file-size-comparison.jpg b/img/file-size-comparison.jpg new file mode 100644 index 0000000..1bbbc0c Binary files /dev/null and b/img/file-size-comparison.jpg differ