Ultra-fast nucleotide-level Aho numbering for antibody variable domains.
IggNition replaces ANARCI/HMMER with a purpose-built Rust aligner against pre-numbered human germline V/D/J genes. It assigns Aho scheme positions to every nucleotide in an antibody variable domain sequence, producing a fixed-length coordinate frame suitable for repertoire analysis and AbLLM training.
| ANARCI (Python/HMMER) | IgNition | |
|---|---|---|
| Throughput (single core) | ~4 seq/s | >50,000 seq/s |
| Language | Python + HMMER | Rust |
| Output | amino-acid level | nucleotide level |
| Parallelism | external (shell) | built-in (Rayon) |
pip install iggnitionPre-compiled wheels are available for Linux (x86_64, aarch64), macOS (Apple Silicon + Intel), and Windows.
import iggnition
df = iggnition.run(
nt_seq="CAGGTGCAGCTGGTGCAGTCTGGAGCT...",
aa_seq="QVQLVQSGAE...",
)
print(df)
# shape: (447, 7) ← 149 Aho positions × 3 nucleotides for a heavy chain
# ┌─────────────┬───────┬─────────────┬──────────────┬────────────────┬────────────┬───────────┐
# │ sequence_id ┆ chain ┆ nt_position ┆ aho_position ┆ codon_position ┆ nucleotide ┆ amino_acid│
# │ u32 ┆ str ┆ u32 ┆ u32 ┆ u32 ┆ str ┆ str │import polars as pl
import iggnition
df = pl.read_csv("airr_table.tsv", separator="\t")
results, errors = iggnition.run(
df,
nt_col="sequence",
aa_col="sequence_aa",
locus_col="locus", # e.g. "IGH", "IGK", "IGL"
)results, errors = iggnition.run(
df,
paired=True,
nt_col_heavy="sequence:0",
aa_col_heavy="sequence_aa:0",
nt_col_light="sequence:1",
aa_col_light="sequence_aa:1",
)# FASTA → DataFrame
results, errors = iggnition.run("input.fasta")
# Parquet → Parquet
iggnition.run("input.parquet", output="numbered.parquet")
# TSV → TSV
iggnition.run("input.tsv", output="numbered.tsv")iggnition.run(df) returns a (results, errors) tuple for batch input.
One row per nucleotide position — the most detailed output:
results, errors = iggnition.run(df)
# shape: (n_sequences × 447, 7) for heavy chains
# columns: sequence_id, chain, nt_position, aho_position, codon_position, nucleotide, amino_acid| Column | Type | Description |
|---|---|---|
sequence_id |
u32 | Row index from input |
chain |
str | H, K, or L |
nt_position |
u32 | Absolute nucleotide position (1-based) |
aho_position |
u32 | Aho amino acid position (1-based) |
codon_position |
u32 | Position within codon (1, 2, or 3) |
nucleotide |
str | A/T/G/C or - for gaps |
amino_acid |
str | Single-letter AA or - for gaps |
One row per Aho position:
results, errors = iggnition.run(df, per_codon=True)
# columns: sequence_id, chain, aho_position, codon, amino_acidWide format is the natural input shape for machine learning models: each sequence becomes a single fixed-length row with one column per nucleotide position (H1…H447 for heavy chains, L1…L444 for light chains). Both Kappa and Lambda light chains use the L prefix.
results, errors = iggnition.run(df, wide=True)Output shape: (n_sequences, 1 + 447 + 444) for paired heavy + light, with columns:
seq_name | H1 | H2 | … | H447 | L1 | L2 | … | L444
The first column is always seq_name. If you supply name_col, it takes the values from that column; otherwise it contains the integer row index.
By default (human_readable=True) nucleotide positions are stored as single-character strings — what you would expect:
results, errors = iggnition.run(df, wide=True)
# H1 column: "C", "A", "G", "-", ... (pl.Utf8)For large-scale work (millions of sequences) switch to the raw ASCII byte encoding with human_readable=False. Each position becomes a pl.UInt8 value (65=A, 84=T, 71=G, 67=C, 45=gap), which is the most memory-efficient representation and enables direct numerical operations without any decoding step:
results, errors = iggnition.run(df, wide=True, human_readable=False)
# H1 column: 67, 65, 71, 45, ... (pl.UInt8)
# Decode a column back to characters when needed:
results.with_columns(pl.col("H1").map_elements(chr, return_dtype=pl.Utf8))
# Or work directly with the numbers — e.g. one-hot encode, pass to PyTorch, etc.Memory note: The
UInt8wide path uses a compact Rust byte-array representation (~2.5 GB for 2.4 M paired antibodies) vs. ~400+ GB peak via a naive pivot on per-nucleotide rows. Always preferwide=Trueover post-hoc pivoting.
By default paired results are merged so each antibody occupies a single row (H columns + L columns). Set per_chain=True to keep one row per chain:
results, errors = iggnition.run(df, wide=True, per_chain=True)
# One row per chain; a "chain" column indicates H or L.
# H rows have L columns filled with null, and vice versa.Use name_col to carry an identifier from the input DataFrame into the results as seq_name:
results, errors = iggnition.run(
df,
wide=True,
name_col="clone_id", # any column in df
)
# First column: seq_name (values from df["clone_id"])
# Then: H1…H447, L1…L444IgNition ships a complete structural annotation layer accessible without running any alignment. All positions are in the Aho coordinate system.
import iggnition
# CDR / FR boundaries — Aho AA positions (1-based, inclusive)
iggnition.CDR_REGIONS["H"]["CDR3"] # → (109, 137)
iggnition.CDR_REGIONS["L"]["CDR1"] # → (26, 38)
# Same boundaries expressed as IgNition nt positions
iggnition.CDR_REGIONS_NT["H"]["CDR3"] # → (325, 411)
iggnition.CDR_REGIONS_NT["L"]["FR4"] # → (415, 444)# Convert Aho AA position → 3 IgNition nt positions
iggnition.aho_to_nt(106) # → (316, 317, 318) — the conserved FR3 Cys
# Convert an Aho range → (first_nt, last_nt)
iggnition.aho_range_to_nt(26, 38) # → (76, 114) — VH CDR1
# Which region does an Aho position fall in?
iggnition.region_of(43, "H") # → "FR2"
iggnition.region_of(120, "H") # → "CDR3"
# Convert a wide-format column name → (aho_pos, codon_pos)
iggnition.nt_col_to_aho("H76") # → (26, 1) — first nt of Aho pos 26 (CDR1 start)
iggnition.nt_col_to_aho("L444") # → (148, 3) — last nt of VL
# Build a list of column names for a given CDR/FR (for masking a wide DataFrame)
iggnition.cdr_mask("H", "CDR3") # → ['H325', 'H326', ..., 'H411']
iggnition.cdr_mask("L", "CDR1") # → ['L76', 'L77', ..., 'L114']Exact Aho positions for absolutely conserved residues:
iggnition.LANDMARKS["H"]["Cys_disulfide_N"]["aho"] # → 23 (IMGT 23 / Kabat ~22)
iggnition.LANDMARKS["H"]["Cys_disulfide_C"]["aho"] # → 106 (IMGT 104 / Kabat 92)
iggnition.LANDMARKS["H"]["Trp_FR2"]["aho"] # → 43 (IMGT 41 / Kabat 36)These positions are derived directly from the germline database embedded in the binary — they are guaranteed to be consistent with the numbering IgNition assigns.
# Vernier zone Aho positions (Foote & Winter 1992)
iggnition.VERNIER_ZONE_AHO["H"] # → [2, 27, 29, 30, 43, 54, 55, 56, 66, 68, 70, 77, 92, 93]
iggnition.VERNIER_ZONE_AHO["L"] # → [2, 25, 26, 27, 29, 33, 49, 51, 52, 73, 77, 78, 80]
# Full detail including Kabat cross-reference and notes
iggnition.VERNIER_ZONE["H"][71] # → {"aho": 70, "kabat": 71, "region": "FR3", "note": "..."}
# Chothia canonical structural positions (Al-Lazikani et al. 1997)
iggnition.CHOTHIA_CANONICAL["H"]["CDR1_H1"]
iggnition.CHOTHIA_CANONICAL["L"]["CDR2_L2"]# Verified Aho ↔ IMGT ↔ Kabat correspondences at landmark positions
iggnition.NUMBERING_CROSSREF["H"]
# [{"aho": 23, "imgt": 23, "kabat_h": 22, "residue": "Cys (intrachain SS, N-terminal)"}, ...]| Region | VH Aho | VH nt | VK/VL Aho | VK/VL nt |
|---|---|---|---|---|
| FR1 | 1–25 | 1–75 | 1–25 | 1–75 |
| CDR1 | 26–38 | 76–114 | 26–38 | 76–114 |
| FR2 | 39–49 | 115–147 | 39–49 | 115–147 |
| CDR2 | 50–64 | 148–192 | 50–66 | 148–198 |
| FR3 | 65–108 | 193–324 | 67–108 | 199–324 |
| CDR3 | 109–137 | 325–411 | 109–138 | 325–414 |
| FR4 | 138–149 | 412–447 | 139–148 | 415–444 |
VH CDR2 is shorter in the Aho frame (slots 62–64 for insertions) while VK/VL CDR2 has a larger insertion slot (59–66), reflecting the structural reality that H2 loops occupy different insertion positions than L2 loops.
# FASTA → TSV (stdout)
iggnition run input.fasta
# FASTA → TSV (file)
iggnition run input.fasta output.tsv
# Parquet → Parquet
iggnition run input.parquet output.parquet
# TSV (AIRR) → TSV, per-codon
iggnition run input.tsv output.tsv --per-codon
# Wide format, 8 threads, with progress bar
iggnition run input.fasta output.tsv --wide --threads 8 --verbose| Option | Default | Description |
|---|---|---|
--per-codon |
off | One row per codon instead of per nucleotide |
--wide |
off | Pivot to wide format |
--nt-col |
sequence |
NT column name (TSV/Parquet) |
--aa-col |
sequence_aa |
AA column name |
--locus-col |
locus |
Chain/locus column name |
--nt-col-heavy |
sequence:0 |
Heavy chain NT column (paired mode) |
--aa-col-heavy |
sequence_aa:0 |
Heavy chain AA column (paired mode) |
--nt-col-light |
sequence:1 |
Light chain NT column (paired mode) |
--aa-col-light |
sequence_aa:1 |
Light chain AA column (paired mode) |
--no-aa |
off | Auto-translate NT (fallback, emits warning) |
--threads |
all cores | Rayon worker threads |
--verbose / -v |
off | Show progress bar and summary statistics |
| Chain | Max Aho position | Max NT columns |
|---|---|---|
| H (heavy) | 149 | 447 (H1…H447) |
| K (kappa) | 148 | 444 (L1…L444) |
| L (lambda) | 148 | 444 (L1…L444) |
Kappa and Lambda share the L prefix in wide format — they map to the same coordinate frame.
- All human V/D/J germline genes are pre-numbered with Aho positions and embedded in the binary at compile time — no external database files.
- For each query: find the closest germline via Needleman-Wunsch alignment (amino acid level), then transfer Aho positions from the germline to the query.
- Map each occupied Aho position to its codon from the nucleotide sequence: position
N→ nucleotides(N-1)*3+1,(N-1)*3+2,(N-1)*3+3. Unoccupied positions become gaps (-). - Rayon parallelises across sequences in the batch; the Python GIL is never held during alignment.
git clone https://github.com/bnemoz/iggnition
cd iggnition
pip install .MIT. If you use IggNition in published research, please cite accordingly.
- Honegger, A. & Plückthun, A. (2001). Yet another numbering scheme for immunoglobulin variable domains. J Mol Biol, 309(3), 657–670.
- Dunbar, J. & Deane, C.M. (2016). ANARCI: antigen receptor numbering and receptor classification. Bioinformatics, 32(2), 298–300.