CODON-TOPO

Codon Geometry Validation & Prediction Engine

What is CODON-TOPO?

CODON-TOPO validates the algebraic structure of genetic codes when encoded as 6-bit binary vectors in GF(2)^6. It provides a complete, reproducible pipeline for the analyses described in:

Robust error-minimization in the genetic code across physicochemical metrics and variant codes: a graph-theoretic analysis in GF(2)^6 Paul Clayworth & Sergey Kornilov (2026). Manuscript prepared for submission to the Journal of Theoretical Biology; PDF compiles from output/manuscript.typ (Elsevier-Harvard reference style). Highlights, CRediT statement, generative-AI-use declaration, and ethical statement are included in the manuscript end-matter.

Key Findings

Status	Count	Highlights
Supported	4	Cross-metric coloring optimality (4 metrics, p ≤ 0.006); per-table preservation (26 of 27 NCBI tables, mean quantile 1.4%; standard-code-proximity audit confirms variant tables are independently optimal); ρ-robustness across the full Hamming graph H(3,4) = K₄ □ K₄ □ K₄; topology-avoidance depletion under both Q₆ (encoding-dependent) and encoding-independent H(3,4) adjacency (RR 0.28–0.33, permutation p ≤ 10⁻⁴, robust to clade exclusion and to both new-disconnection and Δβ₀>0 definitions)
Suggestive	1	tRNA enrichment for reassigned amino acid (worst-case MIS Stouffer p = 0.045 across 24 pairings; 18 tRNAscan-SE–verified genomes); the 4-pairing topology-breaking-restricted subset alone is underpowered (Stouffer p = 0.43)
Exploratory	4	Bit-position bias (deduplicated p = 0.075); mechanism boundary conditions (3-tier: gene duplication / stem shortening / anticodon modification); Atchley F3/Serine convergence; disconnection catalogue (Thr / Leu / Ala / Ser; Trp Table 32 = filtration-only exception)
Rejected	3	Serine min-distance-4 invariant (encoding-dependent); PSL(2,7); holomorphic embedding
Falsified	1	KRAS-Fano clinical prediction (p = 1.0 on n = 1,670 MSK-IMPACT mutations)
Tautological	2	Two-fold bit-5 filtration (encoding-dependent); four-fold prefix filtration

Notation: The full single-nucleotide mutation graph is consistently written as H(3,4) = K₄ □ K₄ □ K₄ (the Hamming graph; 64 vertices, regular degree 9, 288 undirected edges) rather than the ambiguous K₄³. Q₆ is a 192-edge subgraph of H(3,4); the remaining 96 within-nucleotide diagonal edges complete H(3,4). CLI flags retain the legacy k43 spelling (e.g. topology-avoidance-k43) for backward compatibility.

Encoding sensitivity (24 base-to-bit bijection sweep): The Q₆ topology-avoidance result is encoding-dependent — 8 of 24 bijections give a Q₆ candidate-landscape rate near 36% (rather than 73% under the default encoding) and no statistically significant depletion. The H(3,4) result is encoding-independent and is reported as the primary topology-avoidance test; Q₆ is now framed as a coordinate-dependent decomposition. Q₆ remains useful for the ρ-sweep (continuous interpolation between Q₆ and H(3,4)).

Conditional logit (M3 phys+topo) under both topology encodings: Decisively favored over single-feature models. Under encoding-dependent Q₆ topology: ΔAICc(M1→M3) = 108.2, ΔAICc(M2→M3) = 89.1. Under encoding-independent H(3,4) topology (verifying that the result is not an artifact of the Q₆ encoding): ΔAICc(M1→M3_H(3,4)) = 91.3, ΔAICc(M2_H(3,4)→M3_H(3,4)) = 95.1 — both decisive (>10) and similar in magnitude to the Q₆ counterparts. Adding the tRNA-distance proxy (M4) does not improve fit (LR = 0.12, p = 0.73). Spearman ρ between Δ_phys and Δ_topo across the 1,280-move candidate landscape = 0.15 (largely independent predictors). Conditional-logit clade-exclusion sensitivity (per Sengupta et al. 2007, refitting M1-M4 with each major clade dropped) and posterior-predictive validation (observed 0.076 vs simulated 0.077; pp p = 0.60) confirm robustness.

Restricted-candidate sensitivity: Refitting M1-M4 on candidate sets restricted to biologically plausible moves (target AA already accessible at Hamming distance ≤ d) shows the qualitative claim "topology adds value beyond physicochemistry" survives at every threshold tested. Under the primary d=2 filter (≈727 candidates per choice set), ΔAICc(M1→M3) = 60 and ΔAICc(M2→M3) = 77, both well above the conventional ΔAICc>10 reference. Under the most stringent d=1 filter (≈275 candidates), ΔAICc(M1→M3) shrinks to 14 but stays above 10; ΔAICc(M2→M3) stays at 73. The unrestricted ΔAICc magnitudes are upper bounds; the d=2 filter gives a more biologically-calibrated effect size.

Methodological caveats explicitly disclosed in Limitations:

Survivorship bias: cross-sectional NCBI data cannot distinguish "selection against attempting topology-breaking moves" from "selection against the lineages that attempted them"
Independence-of-irrelevant-alternatives (IIA) assumption in conditional logit (used as explanatory rather than predictive tool)
Family-wise multiple-comparison correction within prespecified analysis families (no spurious global-Bonferroni claim)
Tables 1/11 and 27/28 share identical sense-codon mappings (27 NCBI tables = 25 distinct sense-codon colorings)
Per-table block-preserving null is partly dominated by near-standard permutations for variants with few reassignments — addressed by standard-code-proximity audit (Supplement)

Run codon-topo claims for the full hierarchy with p-values and justifications.

Quick Start

Prerequisites

Python: 3.11+
Package manager: uv (recommended) or pip
Optional: R 4.5+ with ggplot2, ggpubr, viridis, patchwork for publication figures
Optional: tRNAscan-SE 2.0.12 for tRNA gene verification

Installation

git clone https://github.com/biostochastics/codontopo.git
cd codontopo

# With uv (recommended)
uv sync --all-extras
uv run codon-topo --help

# With pip
pip install -e ".[dev]"
codon-topo --help

Run the Full Pipeline

# Run everything and generate manuscript_stats.json
codon-topo all --output-dir=./output --seed=135325

# Individual analyses
codon-topo coloring --n=10000          # Coloring optimality Monte Carlo
codon-topo metric-sensitivity          # Cross-metric (Grantham, Miyata, PR, KD)
codon-topo rho-sweep                   # Rho robustness (Q6 -> K4^3)
codon-topo per-table                   # All 27 NCBI translation tables
codon-topo topology-avoidance          # Topology avoidance (Q6)
codon-topo topology-avoidance-k43      # Topology avoidance (K4^3, encoding-independent)
codon-topo condlogit                   # Conditional logit models (M1-M4)
codon-topo condlogit-restricted        # Restricted-candidate sensitivity (delta_trna<=1,2,3)
codon-topo trna                        # tRNA enrichment test
codon-topo mis-analysis                # Maximal independent set analysis
codon-topo phylo-sensitivity           # Clade-exclusion robustness
codon-topo claims                      # View claim hierarchy

Run the CodonSafe Cross-Study Reanalysis

# Requires raw data in data/codonsafe/ (see DATA_MANIFEST.md)
pip install -e ".[codonsafe]"
codon-topo codonsafe

Run the Test Suite

python3.11 -m pytest tests/ -q                    # all tests
python3.11 -m pytest tests/ --cov=codon_topo      # with coverage
python3.11 -m pytest tests/test_regression.py -v   # regression suite (105 tests)

Note: Use python3.11 -m pytest if your system default Python differs from where dev dependencies are installed.

Generate Publication Figures

Rscript src/codon_topo/visualization/R/strengthened_figures.R

Reproducibility

The core design principle: a user who clones this repo should be able to regenerate every number in the manuscript.

# Full reproducibility from scratch
git clone https://github.com/biostochastics/codontopo.git
cd codontopo
uv sync --all-extras
uv run codon-topo all --output-dir=./output --seed=135325
# -> generates output/manuscript_stats.json
# -> manuscript.typ reads all inline statistics from this JSON

The manuscript_stats.json file contains every statistic cited in the paper. The Typst manuscript (output/manuscript.typ) reads this file and renders all inline numbers dynamically:

#let stats = json("manuscript_stats.json")
// All tables and inline stats reference stats.* fields

Random seed: 135325 (all Monte Carlo analyses).

Architecture

codon-topo all
    |
    +-- Filtration (WS1) .................. Two-fold/four-fold degeneracy checks
    +-- Disconnections (WS1) .............. Persistent homology catalogue
    +-- Coloring Optimality (WS1) ......... Block-preserving Monte Carlo
    |     +-- Multi-metric sensitivity .... Grantham, Miyata, PR, KD
    |     +-- Rho robustness sweep ........ Q6 -> H(3,4) interpolation
    |     +-- Per-table optimality ........ 27 NCBI tables + BH-FDR
    |     +-- Per-table proximity audit ... dH-conditional vs unconditional quantile
    |     +-- Score decomposition ......... By nucleotide position
    +-- Reassignment Analysis (WS2) ....... Database, Hamming paths, bit bias
    +-- Topology Avoidance (WS6) .......... Q6 + H(3,4), 2x2 definitions audit,
    |                                       24-encoding sweep, denominator sensitivity
    +-- tRNA Evidence (WS1) ............... Fisher-Stouffer + MIS enumeration
    |                                       + topology-breaking subset (n=4)
    +-- Phylogenetic Sensitivity (WS6) .... Clade-exclusion robustness
    +-- Conditional Logit (WS6) ........... M1-M4 (Q6) + M2k43, M3k43 (H(3,4))
    |     +-- Encoding robustness ......... Q6 vs H(3,4) ΔAICc comparison
    |     +-- Clade-exclusion sensitivity . 7 clade regimes per Sengupta et al. 2007
    |     +-- Restricted-candidate sens. .. delta_trna<=1,2,3 biological-plausibility filter
    |     +-- Posterior predictive ........ Observed vs simulated topology rate
    +-- Depth Calibration (WS3) ........... Epsilon-age correlation
    +-- KRAS-Fano (WS4) .................. cBioPortal enrichment (negative)
    +-- Claims + Catalogue (WS5) .......... 15 claims, evidence grading
    |
    +-> output/manuscript_stats.json ...... Consolidated stats for Typst
    +-> output/*.json ..................... Per-analysis detailed results

Package Structure

Component	Path	Role
CLI	`src/codon_topo/cli.py`	Click-based CLI with 18 subcommands
Encoding	`src/codon_topo/core/encoding.py`	GF(2)^6, Hamming distance, all 24 encodings
Genetic codes	`src/codon_topo/core/genetic_codes.py`	All 27 NCBI translation tables (codes 1-6, 9-16, 21-33)
Filtration	`src/codon_topo/core/filtration.py`	Two-fold (bit-5) and four-fold (prefix) checks
Homology	`src/codon_topo/core/homology.py`	Connected components, disconnection catalogue
Embedding	`src/codon_topo/core/embedding.py`	Root-of-unity map GF(2)^6 -> C^3
Fano	`src/codon_topo/core/fano.py`	XOR triple computation
Coloring optimality	`src/codon_topo/analysis/coloring_optimality.py`	Monte Carlo, rho sweep, per-table, multi-metric
Null models	`src/codon_topo/analysis/null_models.py`	Models A/B/C/C_extended
Reassignment DB	`src/codon_topo/analysis/reassignment_db.py`	Database, Hamming paths, bit-position bias
Topology avoidance	`src/codon_topo/analysis/synbio_feasibility.py`	Q6 + K4^3 tests, phylogenetic sensitivity
Evolutionary simulation	`src/codon_topo/analysis/evolutionary_simulation.py`	Conditional logit M1-M4, order-averaging
tRNA evidence	`src/codon_topo/analysis/trna_evidence.py`	Fisher-Stouffer, MIS via Bron-Kerbosch
CodonSafe	`src/codon_topo/analysis/codonsafe/`	Cross-study reanalysis of 8 recoding datasets
Statistical utils	`src/codon_topo/analysis/statistical_utils.py`	Beta CIs, risk ratios, quantile CIs
Visualization	`src/codon_topo/visualization/`	CSV export + R ggplot2 scripts
Claims	`src/codon_topo/reports/claim_hierarchy.py`	Single source of truth for 15 claims
Catalogue	`src/codon_topo/reports/catalogue.py`	Evidence grading across workstreams

CLI Reference

Command	Description
`codon-topo all`	Run everything, generate `manuscript_stats.json`
`codon-topo filtration`	Two-fold/four-fold filtration checks
`codon-topo disconnections`	Disconnection catalogue (persistent homology)
`codon-topo coloring`	Hypercube coloring Monte Carlo
`codon-topo metric-sensitivity`	Cross-metric sensitivity (4 metrics)
`codon-topo rho-sweep`	Rho robustness (Q6 -> K4^3)
`codon-topo per-table`	Per-table optimality (27 NCBI tables)
`codon-topo decompose`	Score decomposition by nucleotide position
`codon-topo topology-avoidance`	Topology avoidance test (Q6)
`codon-topo topology-avoidance-k43`	Topology avoidance test (K4^3)
`codon-topo condlogit`	Conditional logit model comparison (M1-M4)
`codon-topo condlogit-restricted`	Restricted-candidate-set sensitivity (delta_trna ≤ d)
`codon-topo phylo-sensitivity`	Clade-exclusion sensitivity analysis
`codon-topo trna`	tRNA enrichment test
`codon-topo mis-analysis`	Maximal independent set enumeration
`codon-topo bit-bias`	Bit-position bias test
`codon-topo kras`	KRAS-Fano enrichment test (negative)
`codon-topo codonsafe`	CodonSafe cross-study reanalysis
`codon-topo claims`	View claim hierarchy

All subcommands support --json for machine-readable output. Interactive mode uses rich tables.

Workstreams

WS	Name	CLI commands	Status
WS1	Core Replication	`filtration`, `disconnections`, `coloring`, `metric-sensitivity`, `rho-sweep`, `per-table`, `decompose`	Complete
WS2	Reassignment Directionality	`bit-bias`	Complete
WS3	Evolutionary Depth	(in `all`)	Complete
WS4	KRAS/COSMIC	`kras`	Complete (negative)
WS5	Prediction Catalogue	`claims`	Complete
WS6	Topology & Synbio	`topology-avoidance`, `topology-avoidance-k43`, `condlogit`, `phylo-sensitivity`, `codonsafe`	Complete

Null Models

Model	What it tests	CLI
Freeland-Hurst	Is the coloring optimal? Block-preserving shuffle	`coloring`
Class-size	Weaker null (degeneracy-only, no block contiguity)	`coloring --null=class_size`
Model C	Is the encoding special? All 24 base-to-bit mappings	`disconnections --extended`
Table-preserving permutation	Does evolution avoid topology disruption?	`topology-avoidance`
Conditional logit	Is topology an independent predictor?	`condlogit`

Technology Stack

Python 3.11+, NumPy, SciPy for core computation
click + rich for CLI
pytest + hypothesis for property-based testing (432 tests, >=96% coverage)
ggplot2 + ggpubr (R) for publication figures (300 DPI, colorblind-friendly viridis)
Typst for manuscript typesetting (reads manuscript_stats.json for dynamic stats)
tRNAscan-SE 2.0.12 + Infernal 1.1.4 for tRNA verification (18 genomes across 5 variant codes + 3 standard-code controls)
Biopython for GenBank parsing (CodonSafe reanalysis)

Usage Examples

from codon_topo import (
    codon_to_vector, hamming_distance, STANDARD, get_code,
    analyze_filtration, disconnection_catalogue, embed_codon,
    is_fano_line, fano_partner, monte_carlo_null,
    CLAIM_HIERARCHY, supported_claims,
)

# Encode a codon as a 6-bit vector
codon_to_vector('GGU')  # (1, 1, 1, 1, 0, 1)

# Check the KRAS Fano line (XOR = 0)
is_fano_line('GGU', 'GUU', 'CAC')  # True

# Run the coloring optimality Monte Carlo
result = monte_carlo_null(n_samples=10000, seed=135325)
# {'quantile_of_observed': 0.6, 'p_value_conservative': 0.006, ...}

# Query the claim hierarchy
for claim in supported_claims():
    print(claim.id, claim.evidence_p_value)

Documentation

Document	Purpose
`CLAUDE.md`	AI/contributor guidance
`ARCHITECTURE.md`	Module dependency graph
`data/codonsafe/DATA_MANIFEST.md`	Raw data provenance for cross-study reanalysis

License

Released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You may share and adapt this work with attribution, but commercial use requires a separate license — contact the authors.

To cite, see CITATION.cff or the bibliography entry generated by GitHub's "Cite this repository" button.

Quick Start • CLI Reference • Reproducibility • Architecture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CODON-TOPO

What is CODON-TOPO?

Key Findings

Quick Start

Prerequisites

Installation

Run the Full Pipeline

Run the CodonSafe Cross-Study Reanalysis

Run the Test Suite

Generate Publication Figures

Reproducibility

Architecture

Package Structure

CLI Reference

Workstreams

Null Models

Technology Stack

Usage Examples

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
output		output
scripts		scripts
src/codon_topo		src/codon_topo
tests		tests
toposafe		toposafe
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

CODON-TOPO

What is CODON-TOPO?

Key Findings

Quick Start

Prerequisites

Installation

Run the Full Pipeline

Run the CodonSafe Cross-Study Reanalysis

Run the Test Suite

Generate Publication Figures

Reproducibility

Architecture

Package Structure

CLI Reference

Workstreams

Null Models

Technology Stack

Usage Examples

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages