Skip to content

Add ramsort tool: coordinate and name sort for RAM (RNTuple) files#30

Open
swetank18 wants to merge 5 commits intocompiler-research:developfrom
swetank18:add-ramsort-tool
Open

Add ramsort tool: coordinate and name sort for RAM (RNTuple) files#30
swetank18 wants to merge 5 commits intocompiler-research:developfrom
swetank18:add-ramsort-tool

Conversation

@swetank18
Copy link
Copy Markdown

Summary

This PR introduces a ramsort command-line tool that sorts RAM (RNTuple) files by genomic coordinate (refid, pos) or by query name (QNAME). It is the RNTuple-based equivalent of samtools sort and samtools sort -n.

Motivation

Sorted alignment files are a prerequisite for most downstream bioinformatics analyses — region queries, variant calling, duplicate marking, and indexing all require coordinate-sorted input. This PR brings native sort capability to RAMTools without requiring conversion back to SAM/BAM.

Changes

New files:

  • inc/ramcore/RAMSort.hramsortntuple() function declaration
  • src/ramcore/RAMSort.cxx — implementation using RNTupleReader field views and std::stable_sort
  • tools/ramsort.cxx — command-line entry point
  • test/ramsorttests.cxx — 5 Google Test cases

Modified files:

  • CMakeLists.txt — added RAMSort.h and RAMSort.cxx to the ramcore library
  • tools/CMakeLists.txt — added ramsort executable using ROOT_EXECUTABLE macro
  • test/CMakeLists.txt — added ramsorttests test target

Usage

# Coordinate sort (default) — equivalent to samtools sort
./tools/ramsort input.root output.root

# Name sort — equivalent to samtools sort -n
./tools/ramsort input.root output.root --by-name

Implementation Notes

  • Sort keys (refid, pos, qname) are read upfront into vectors for cache-efficient sorting
  • Uses std::stable_sort to preserve relative order of reads with identical keys
  • Records are written in sorted order using RNTupleReader views — no full deserialization until write time
  • Reference name maps and index are preserved in the output file via WriteAllRefs() and WriteIndex()
  • Sort logic lives in src/ramcore/RAMSort.cxx (library) — tools/ramsort.cxx is a thin CLI wrapper

Tests

5 test cases in test/ramsorttests.cxx — all passing:

  • EntryCountPreserved — sorted file has same number of entries as input
  • CoordinateSortOrder — records are in non-decreasing (refid, pos) order after sort
  • NameSortOrder — records are in non-decreasing QNAME lexicographic order after --by-name
  • IdempotentSort — sorting an already sorted file produces identical coordinate order
  • MissingInputFileReturnsError — graceful failure returns non-zero on bad input

Test Results

100% tests passed, 0 tests failed out of 5
Total Test time (real) = 3.96 sec

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.27119% with 28 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4504216). Learn more about missing BASE report.

Files with missing lines Patch % Lines
test/ramsorttests.cxx 65.07% 0 Missing and 22 partials ⚠️
src/ramcore/RAMSort.cxx 89.09% 4 Missing and 2 partials ⚠️

❌ Your patch status has failed because the patch coverage (76.27%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             develop      #30   +/-   ##
==========================================
  Coverage           ?   57.42%           
==========================================
  Files              ?       18           
  Lines              ?     1543           
  Branches           ?      837           
==========================================
  Hits               ?      886           
  Misses             ?      525           
  Partials           ?      132           
Flag Coverage Δ
unittests 57.42% <76.27%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/ramcore/RAMSort.cxx 89.09% <89.09%> (ø)
test/ramsorttests.cxx 65.07% <65.07%> (ø)
Files with missing lines Coverage Δ
src/ramcore/RAMSort.cxx 89.09% <89.09%> (ø)
test/ramsorttests.cxx 65.07% <65.07%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 10 out of 25. Check the log or trigger a new build to see more.

Comment thread inc/ramcore/RAMSort.h
@@ -0,0 +1,11 @@
#ifndef RAMCORE_RAMSORT_H
#define RAMCORE_RAMSORT_H
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: header guard does not follow preferred style [llvm-header-guard]

Suggested change
#define RAMCORE_RAMSORT_H
#ifndef GITHUB_WORKSPACE_INC_RAMCORE_RAMSORT_H
#define GITHUB_WORKSPACE_INC_RAMCORE_RAMSORT_H

inc/ramcore/RAMSort.h:10:

- #endif // RAMCORE_RAMSORT_H
+ #endif // GITHUB_WORKSPACE_INC_RAMCORE_RAMSORT_H

Comment thread inc/ramcore/RAMSort.h
/// \param outputFile Path to output .root RAM file
/// \param byName If true, sort by QNAME; otherwise sort by (refid, pos)
/// \return 0 on success, 1 on error
int ramsortntuple(const char *inputFile, const char *outputFile, bool byName = false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: unknown type name 'bool' [clang-diagnostic-error]

int ramsortntuple(const char *inputFile, const char *outputFile, bool byName = false);
                                                                 ^

Comment thread inc/ramcore/RAMSort.h
/// \param outputFile Path to output .root RAM file
/// \param byName If true, sort by QNAME; otherwise sort by (refid, pos)
/// \return 0 on success, 1 on error
int ramsortntuple(const char *inputFile, const char *outputFile, bool byName = false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'false' [clang-diagnostic-error]

int ramsortntuple(const char *inputFile, const char *outputFile, bool byName = false);
                                                                               ^

Comment thread src/ramcore/RAMSort.cxx
@@ -0,0 +1,98 @@
#include "ramcore/RAMSort.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'ramcore/RAMSort.h' file not found [clang-diagnostic-error]

#include "ramcore/RAMSort.h"
         ^

Comment thread src/ramcore/RAMSort.cxx
#include <exception>
#include <iostream>
#include <memory>
#include <numeric>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: included header memory is not used directly [misc-include-cleaner]

Suggested change
#include <numeric>
#include <numeric>

Comment thread src/ramcore/RAMSort.cxx
#include <numeric>
#include <string>
#include <utility>
#include <vector>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: included header utility is not used directly [misc-include-cleaner]

Suggested change
#include <vector>
#include <vector>

Comment thread src/ramcore/RAMSort.cxx
#include <string>
#include <utility>
#include <vector>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: included header vector is not used directly [misc-include-cleaner]

Suggested change

Comment thread src/ramcore/RAMSort.cxx
#include <utility>
#include <vector>

int ramsortntuple(const char *inputFile, const char *outputFile, bool byName)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'ramsortntuple' can be made static or moved into an anonymous namespace to enforce internal linkage [misc-use-internal-linkage]

Suggested change
int ramsortntuple(const char *inputFile, const char *outputFile, bool byName)
static int ramsortntuple(const char *inputFile, const char *outputFile, bool byName)

Comment thread test/ramsorttests.cxx
@@ -0,0 +1,110 @@
#include <gtest/gtest.h>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]

#include <gtest/gtest.h>
         ^

Comment thread test/ramsorttests.cxx Outdated
class RAMSortTest : public ::testing::Test {
protected:
static constexpr int kNumReads = 200;
const char *kSamFile = "sort_test.sam";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: invalid case style for protected member 'kSamFile' [readability-identifier-naming]

Suggested change
const char *kSamFile = "sort_test.sam";
const char *m_kSamFile = "sort_test.sam";

test/ramsorttests.cxx:21:

-       GenerateSAMFile(kSamFile, kNumReads);
+       GenerateSAMFile(m_kSamFile, kNumReads);

test/ramsorttests.cxx:25:

-       samtoramntuple(kSamFile, kUnsortedFile, true, true, true, 505, 0);
+       samtoramntuple(m_kSamFile, kUnsortedFile, true, true, true, 505, 0);

test/ramsorttests.cxx:30:

-       std::remove(kSamFile);
+       std::remove(m_kSamFile);

@swetank18
Copy link
Copy Markdown
Author

Fixed clang-tidy warnings in RAMSort.cxx:

Fixed include order (llvm-include-order)
Added missing headers for std::unique_ptr and std::move
Added argument comment for SetCompression literal argument
Added tests for empty file and invalid output path edge cases to improve coverage

All tests passing locally. Coverage should now be above 85%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants