Assembly-exact RefSeq ortholog CDS FASTA, alignment, and variant workflow.
Version: 0.1.5
Author: Chul Lee (chul.bioinfo@gmail.com)
refseq2cds is an assembly-exact NCBI RefSeq ortholog CDS FASTA builder with
optional coordinate matrices, codon alignments, and target-specific variant
outputs.
This repository builds assembly-exact ortholog CDS FASTA files from NCBI RefSeq
assembly annotations and NCBI Gene FTP orthology edges. It supports two
orthology parsing modes: strict N-way singleton extraction across every locked
species, and a reference-gene present-species mode for queries such as human
FOXP2 where only species with an unambiguous 1:1 relationship to the query
gene are retained. It also creates human CDS-position to genomic-position
matrices so codon/site-level results can later be converted to UCSC Genome
Browser BED coordinates. It can also create lightweight tree-free codon
alignments with a MAFFT protein MSA followed by PAL2NAL back-translation.
After codon alignment, refseq2cds variants can detect target-group-specific
amino acid variants and write coordinateable variant events as BED.
refseq2cds is not a de novo orthology inference program and it is not just a
CDS downloader. It treats NCBI Gene orthology records as input evidence, applies
auditable graph filters, selects CDS records from exact RefSeq GCF_* assembly
annotation packages, records rejection reasons, and emits FASTA, metadata,
report, and coordinate-matrix outputs.
This workflow is intended for vertebrates only. Although NCBI also distributes some orthology data outside vertebrates, this package is built and tested for vertebrate RefSeq genome annotation packages, standard nuclear coding genes, and vertebrate-style comparative CDS analyses. Do not use it as-is for insects, plants, fungi, bacteria, or other non-vertebrate projects.
The repository ships with a default 14 primate/Dermoptera manifest, but you can
also provide your own list of GCF_* RefSeq assembly accessions. Human
taxid 9606 is required only if you want human CDS-to-genome coordinate
matrices.
A GCF_* accession is an NCBI RefSeq genome assembly accession. In this
pipeline it answers the question: "Which exact RefSeq assembly annotation should
the CDS sequences come from?"
Example:
GCF_009914755.1 # human T2T-CHM13v2.0 RefSeq assembly
GCF_028858775.2 # chimpanzee RefSeq assembly
This pipeline uses datasets download genome accession <GCF> so the CDS, RNA,
protein, and GFF3 annotation all come from that exact assembly package.
A taxid is the numeric NCBI Taxonomy identifier for a species or taxon. For example:
9606 Homo sapiens
9598 Pan troglodytes
NCBI Gene orthology is keyed by taxid and GeneID, so this pipeline uses
taxids internally to decide whether an ortholog component contains exactly one
gene from each species.
You can look up a taxid from a species name with NCBI Datasets:
refseq2cds download-tools
bin/datasets summary taxonomy taxon "Homo sapiens" \
--report ids_only \
--as-json-linesExpected output:
{"query":["Homo sapiens"],"taxonomy":{"tax_id":9606}}If you start from GCF_* assembly accessions, you usually do not need to look
up taxids manually. refseq2cds manifest-from-gcf queries NCBI and fills them
in for you.
The manifest is the fixed table of species and assemblies for one analysis. It is stored at:
config/species_manifest.tsv
It tells the pipeline:
- which species are included
- which NCBI taxid belongs to each species
- which exact RefSeq
GCF_*assembly to download - what short token should be used in FASTA headers
- optionally, which species is an outgroup for later tree/alignment work
Example:
token taxid scientific_name common_name gcf_accession outgroup
human 9606 Homo sapiens human GCF_009914755.1 false
chimpanzee 9598 Pan troglodytes chimpanzee GCF_028858775.2 false
You usually do not need to write this by hand. Put your GCF accessions in a file and run:
refseq2cds manifest-from-gcf --inputfile gcfs.txt --forceThat command queries NCBI and fills in taxids and species names automatically.
The reference species is the species whose gene symbols are used to name ortholog families and FASTA files. By default this is human:
refseq2cds run --steps all --reference-taxid 9606If you want chimpanzee symbols and filenames instead:
refseq2cds run --steps all --reference-taxid 9598This does not change the orthology graph itself. It changes which species must
be present as the naming anchor and which species provides the output symbols.
Human CDS genomic matrices are a separate optional output and require human
taxid 9606 to be present in the manifest.
In this tool, a strict singleton family means:
- the NCBI orthology graph contains exactly one gene from each manifest taxid
- there are no missing species
- there are no duplicated genes/paralogs within any species for that component
- all retained genes are protein-coding and non-mitochondrial
- every species has one valid CDS selected from its frozen RefSeq assembly
This conservative filter is why some genes are rejected.
Strict singleton mode is best when you want a genome-wide matrix where every
output FASTA has the same species set. Sometimes you instead care about one
reference gene, such as human FOXP2, and want every manifest species that has
a clean ortholog to that gene.
Use:
refseq2cds run \
--steps all \
--orthology-mode reference_gene_1to1_present_species \
--reference-taxid 9606 \
--reference-symbol FOXP2 \
--with-matricesIn this mode, each non-reference species is evaluated independently:
- if the species has no NCBI Gene orthology edge to the reference gene, it is excluded
- if the reference gene maps to multiple genes in that species, the species is excluded
- if the target gene maps back to multiple reference-species genes, the species is excluded
- if the target gene is not present in the locked
GCF_*annotation package, is non-protein-coding, mitochondrial, lacks a valid CDS, or fails CDS QC, it is excluded
The resulting FASTA therefore contains the reference sequence plus only the species that passed the unambiguous 1:1 graph and CDS checks. It does not force all manifest species to be present.
For large manifests, reference-gene mode first uses the NCBI Gene bulk files to
estimate the graph-level 1:1 target species and writes
reports/reference_gene_download_scope.json. Only the reference species and
graph-passing target species need assembly packages. This saves time for sparse
queries, but broadly conserved genes such as FOXP2 can still require hundreds
of large RefSeq annotation packages.
Main CDS outputs:
fastas/{REFERENCE_SYMBOL}.fasta
fastas/{REFERENCE_SYMBOL}.meta.tsv
fastas/manifest.tsv
Reference-gene present-species mode additionally writes a per-query directory:
results/reference_gene_1to1/{REFERENCE_SYMBOL}/
├── reference_gene.tsv
├── ortholog_candidates.tsv
├── ortholog_pass_graph_1to1.tsv
├── ortholog_rejected_graph.tsv
├── ortholog_coverage.json
├── selected_cds.tsv
├── cds_qc.tsv
├── cds_pass.tsv
├── cds_rejected.tsv
├── {REFERENCE_SYMBOL}.reference_1to1.cds.fasta
├── {REFERENCE_SYMBOL}.reference_1to1.meta.tsv
├── {REFERENCE_SYMBOL}.reference_1to1.rejected.tsv
├── summary.json
└── summary.html
Optional human coordinate matrix outputs, when --with-matrices is used and
human taxid 9606 is present in the manifest:
human_cds_matrices/{HUMAN_SYMBOL}.human_cds_genomic_matrix.tsv.gz
human_cds_matrices/manifest.tsv
human_cds_matrices/failed.tsv
Summary reports:
reports/summary.json
reports/summary.html
reports/human_cds_position_matrices.summary.json
Optional MAFFT+PAL2NAL codon alignment outputs:
alignments/mafft_pal2nal/codon/{REFERENCE_SYMBOL}.codon.fasta
alignments/mafft_pal2nal/aa/{REFERENCE_SYMBOL}.aa.fasta
alignments/mafft_pal2nal/aa_aligned/{REFERENCE_SYMBOL}.aa.aln.fasta
alignments/mafft_pal2nal/manifest.tsv
alignments/mafft_pal2nal/failed.tsv
alignments/mafft_pal2nal/summary.json
Optional target-specific variant outputs:
variants/events/{REFERENCE_SYMBOL}.aa_events.tsv.gz
variants/events/{REFERENCE_SYMBOL}.codon_events.tsv.gz
variants/events/{REFERENCE_SYMBOL}.nt_changes.tsv.gz
variants/matrices/{REFERENCE_SYMBOL}.variant_matrix.tsv.gz
variants/bed/{REFERENCE_SYMBOL}.{TARGET_SET}.variant.bed
variants/merged.bed
variants/manifest.tsv
variants/summary.json
In strict singleton mode, each FASTA contains exactly one CDS sequence per locked species. In reference-gene present-species mode, each FASTA contains the reference sequence plus the subset of manifest species that pass the unambiguous 1:1 query-gene checks. Headers are the locked species tokens, such as:
>human
>chimpanzee
>bonobo
...
>Philippine_flying_lemur
- Freeze NCBI Gene FTP files:
gene_orthologs.gzgene_info.gz
- Download exact annotation packages for the frozen
GCF_*assemblies:- CDS FASTA
- protein FASTA
- RNA FASTA
- GFF3
- sequence report
- Parse assembly-exact CDS/transcript/protein/GeneID indexes.
- Build the locked-species pairwise orthology graph from
gene_orthologs.gz. - Apply one orthology parsing mode:
- strict singleton: keep only connected components with exactly one GeneID per locked taxid
- reference-gene present-species: keep species with exactly one target GeneID linked to the query reference GeneID and no reverse M:1 ambiguity
- Reject MT and non-protein-coding genes.
- Select one assembly-derived representative CDS per gene/species.
- QC CDS sequences:
- length multiple of 3
- no internal stop codons
- valid DNA alphabet
- terminal stop removed as CDS normalization
- Write one
{REFERENCE_SYMBOL}.fastaper accepted singleton family. - Optionally write one human CDS genomic-position matrix per accepted family
when
--with-matricesis used and human is present. - Verify the run outputs with
refseq2cds verify. - Optionally align each CDS FASTA with
refseq2cds alignusing MAFFT for protein MSA and PAL2NAL for codon back-translation. - Optionally call target-specific variants with
refseq2cds variants, using the alignment codon maps plus coordinate matrices to write BED rows. - Verify alignment and variant output directories with
refseq2cds verify.
refseq2cds is intended to run on macOS and Linux. Windows is not currently
supported. The refseq2cds download-tools command detects the operating system
and downloads the matching NCBI Datasets CLI binaries into ./bin:
macOS -> https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/mac/
Linux -> https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux/
The automatic downloader currently supports macOS and Linux only. Even on macOS
or Linux, some environments may require extra setup, especially restricted HPC
systems, containers without curl/unzip, non-x86_64 CPU architectures, or
networks that block direct access to NCBI FTP/HTTPS endpoints. In those cases,
install NCBI Datasets CLI manually and place datasets and dataformat in
./bin or on your PATH.
Required command-line tools:
| Dependency | Purpose |
|---|---|
Python >=3.9 |
pipeline runtime |
curl |
download NCBI FTP files and NCBI Datasets binaries |
gzip / unzip |
decompress NCBI inputs |
NCBI Datasets CLI datasets |
download RefSeq assembly annotation packages |
NCBI Datasets CLI dataformat |
optional NCBI report conversion helper |
Required Python packages are declared in pyproject.toml and
requirements.txt:
biopython
dendropy
jinja2
networkx
pandas
pyarrow
Optional alignment tools for refseq2cds align --mode mafft-pal2nal:
| Dependency | Purpose |
|---|---|
| MAFFT | protein multiple sequence alignment |
PAL2NAL (pal2nal.pl) |
convert protein MSA + CDS FASTA into codon alignment |
| Perl | runtime for pal2nal.pl when installed as a Perl script |
Recommended install with conda/mamba:
conda install -c bioconda -c conda-forge mafft pal2nalOn macOS, MAFFT is also available through Homebrew:
brew install mafftHomebrew does not always provide PAL2NAL, so use conda/mamba, a system package
manager such as apt install pal2nal on Debian/Ubuntu, or pass an explicit
script path with --pal2nal /path/to/pal2nal.pl.
Disk requirements depend on the number of input GCF assemblies. The default 14-assembly run currently uses roughly tens of GB because raw NCBI packages and normalized indexes are kept for reproducibility.
git clone https://github.com/chulbioinfo/refseq2cds.git
cd refseq2cdspython3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .Alternatively, without editable install:
python -m pip install -r requirements.txtIn that mode, run commands as python refseq2cds.py ... instead of
refseq2cds ....
Conda/mamba environment:
mamba env create -f environment.yml
conda activate refseq2cdsThe environment file installs Python dependencies plus common command-line tools: NCBI Datasets CLI, MAFFT, and PAL2NAL.
The package can place datasets and dataformat in ./bin:
refseq2cds download-toolsor, if running directly from the repository:
python refseq2cds.py download-toolsThe workflow was tested with NCBI Datasets CLI 18.25.1.
Before downloading anything from NCBI, verify the installation with the packaged mini dataset:
refseq2cds test --example miniThis should finish in seconds. The fixture contains three RefSeq-like species, one accepted positive-strand family, one accepted negative-strand family, a missing-species rejection, a paralog-like rejection, and an internal-stop CDS rejection. It checks graph filtering, CDS QC, FASTA writing, and human coordinate matrix generation.
For development:
python -m pip install -e .[test]
pytest -qRun the full CDS + matrix workflow with the default 14-species manifest:
refseq2cds run --steps all --download-tools --with-matricesIf you installed with conda/mamba, ncbi-datasets-cli is already in the
environment, so --download-tools is optional:
refseq2cds run --steps all --with-matricesBy default, gene symbols and FASTA filenames come from human taxid 9606.
Choose a different reference species with --reference-taxid:
refseq2cds run --steps all --reference-taxid 9598 --with-matricesIn that example, singleton family IDs and FASTA filenames use chimpanzee gene
symbols. Human CDS genomic matrices are still generated if human taxid 9606
is present in the manifest.
Repository-local equivalent:
./run_cds_pipeline.shThis helper script creates .venv when needed, installs Python dependencies,
downloads the matching NCBI Datasets CLI for your OS, and runs the default full
workflow with human matrices.
The first run downloads large NCBI files. Expect tens of GB of local data:
raw/ NCBI bulk files and assembly annotation packages
indexes/ normalized Parquet/TSV indexes
fastas/ final singleton CDS FASTA files
human_cds_matrices/ human CDS-to-genome coordinate matrices
Recommended end-to-end command flow:
| Step | Command | Main input | Main output | Used by |
|---|---|---|---|---|
| 1 | refseq2cds manifest-from-gcf or init-manifest |
GCF list or packaged defaults | config/species_manifest.tsv |
run, verify, align |
| 2 | refseq2cds run --steps all --with-matrices |
manifest, NCBI Gene bulk files, locked GCF packages | fastas/, human_cds_matrices/, reports/ |
verify, align, variants |
| 3 | refseq2cds verify --full --matrix-rows sample |
run output root | JSON pass/fail check | checkpoint before alignment |
| 4 | refseq2cds align --mode mafft-pal2nal --map-token human |
fastas/ |
alignments/mafft_pal2nal/ and codon maps |
variants |
| 5 | refseq2cds variants |
codon alignments, codon maps, coordinate matrices | variants/ variant tables and BED files |
downstream analysis/UCSC |
| 6 | refseq2cds verify --alignment-dir ... --variant-dir ... |
completed output directories | JSON pass/fail check | final run audit |
The directory names are intentionally connected: run writes fastas/ and
human_cds_matrices/; align reads fastas/ and writes
alignments/mafft_pal2nal/; variants reads alignments/mafft_pal2nal/,
alignments/mafft_pal2nal/maps/, and human_cds_matrices/.
After FASTA generation, create lightweight tree-free codon alignments:
refseq2cds align --mode mafft-pal2nal --jobs 4 --threads-per-mafft 2 --map-token humanAfter codon alignment, call target-specific amino acid variants. This example finds human-specific variants while using human only as the coordinate reference for BED conversion:
refseq2cds variants \
--alignment-dir alignments/mafft_pal2nal \
--codon-map-dir alignments/mafft_pal2nal/maps \
--matrix-dir human_cds_matrices \
--coordinate-reference-token human \
--target-token human \
--min-background-informative 5 \
--forceThe coordinate reference token is not a privileged biological comparator during
event calling. The command calls codon sites only when target and background
amino acid state sets are mutually exclusive, then maps only events where the
coordinate reference has a real mappable CDS base. v0.1.5 reports coordinateable
substitution-like and gap-involving amino acid differences together as
bed_event_class=variant; it does not emit separate insertion/deletion BED
classes. Per-gene BED files are also concatenated and coordinate-sorted into
variants/merged.bed for downstream genome-browser and interval workflows. See
docs/variant_detection.md for the matrix and BED rules.
To query one reference gene, for example human FOXP2, across the current
manifest:
refseq2cds run \
--steps all \
--orthology-mode reference_gene_1to1_present_species \
--reference-taxid 9606 \
--reference-symbol FOXP2 \
--with-matrices \
--forceIf the symbol is ambiguous in the reference species, use the NCBI GeneID directly:
refseq2cds run \
--steps all \
--orthology-mode reference_gene_1to1_present_species \
--reference-taxid 9606 \
--reference-gene-id 93986 \
--with-matrices \
--forceThe default minimum output is two sequences, including the reference sequence.
Change this with --min-sequences. Use --exclude-reference only if you want
the reference sequence omitted from the FASTA; most comparative analyses should
keep the reference.
With very broad manifests, check disk space before the full assembly-exact run.
Even though reference-gene mode narrows downloads to graph-level 1:1 candidates,
conserved genes may still need many GCF annotation packages. The download-scope
summary is written to reports/reference_gene_download_scope.json.
For downstream site mapping, write an alignment-to-CDS codon map for the human sequence:
refseq2cds align \
--mode mafft-pal2nal \
--jobs 4 \
--threads-per-mafft 2 \
--map-token humanrefseq2cds verify is a checkpoint command. It does not regenerate data; it
checks that the files produced by earlier steps are internally consistent.
Fast verification of a run generated with --with-matrices:
refseq2cds verifyIf outputs are not in the current directory:
refseq2cds verify --output-root /path/to/run --manifest /path/to/species_manifest.tsvUse --manifest with the exact manifest used for the run. FASTA headers are
checked against the manifest species tokens. In strict singleton mode, each
FASTA must contain every manifest token. In reference-gene present-species
mode, the FASTA may contain a validated subset of manifest tokens.
If you ran without --with-matrices, skip matrix checks:
refseq2cds verify --matrix-rows noneFull FASTA verification plus sampled matrix row counts:
refseq2cds verify --full --matrix-rows sampleFull matrix row-count verification is slower because it reads all gzipped matrix files:
refseq2cds verify --full --matrix-rows fullAfter alignment and variant calling, pass those directories explicitly:
refseq2cds verify \
--full \
--matrix-rows sample \
--alignment-dir alignments/mafft_pal2nal \
--variant-dir variants/human_specificWhat the verify levels mean:
| Option | What is checked | Cost |
|---|---|---|
| default | Python source syntax, required reports, first 200 FASTA files, sampled matrix row counts | fast |
--full |
every FASTA file instead of a 200-file sample | moderate for large runs |
--matrix-rows none |
skip matrix manifests and matrix row counts | use when no matrices were built |
--matrix-rows sample |
count rows for selected matrix files and compare with matrix manifest | fast checkpoint |
--matrix-rows full |
count rows in every gzipped matrix file | slow but strongest matrix audit |
--alignment-dir PATH |
check alignment summary.json, manifest.tsv, failed.tsv, and codon alignment counts; relative paths are resolved under --output-root |
fast |
--variant-dir PATH |
check variant summary.json, manifest.tsv, failures, and BED row counts; relative paths are resolved under --output-root |
fast to moderate |
refseq2cds init-manifest --forceWrites:
config/species_manifest.tsv
Write a file with one RefSeq assembly accession per line:
GCF_009914755.1
GCF_028858775.2
GCF_029289425.2
GCF_029281585.2
GCF_037993035.2-RS_2025_03
GCF_964374335.1/
Then generate config/species_manifest.tsv automatically and run the pipeline:
refseq2cds manifest-from-gcf --inputfile gcfs.txt --download-tools --force
refseq2cds run --steps all --force --with-matricesYou can also pass accessions directly:
refseq2cds manifest-from-gcf \
GCF_009914755.1 GCF_028858775.2 GCF_029289425.2 \
--forceThe command queries NCBI Datasets for taxid, organism name, and accession
metadata. Known species from the default manifest receive readable tokens such
as human, chimpanzee, and bonobo; other species receive sanitized
scientific-name tokens. Annotation-release suffixes such as -RS_2025_03 and
accidental trailing slashes are normalized to the base GCF_<digits>.<version>
accession before querying NCBI.
Duplicate taxids are rejected because NCBI Gene orthology is taxid/GeneID based.
Important: changing the manifest changes the biological input set. Use
refseq2cds run --steps all --force --with-matrices after generating a new
manifest so old indexes/FASTA files are not reused.
When --manifest is omitted, refseq2cds run uses
./config/species_manifest.tsv if it exists in the current working directory;
otherwise it falls back to the packaged default 14-species manifest.
If you want filenames/symbols from a non-human reference species, pass its taxid during the run:
refseq2cds run --steps all --force --reference-taxid 9598 --with-matricesrefseq2cds run --steps preflight,bulk
refseq2cds run --steps assemblies,indexes,edges,singletons
refseq2cds run --steps select,qc,sanity,fastas,reportAvailable stages:
preflight
bulk
assemblies
indexes
edges
singletons
select
qc
sanity
fastas
report
refseq2cds build-matricesOverwrite existing matrices:
refseq2cds build-matrices --forceRun all generated FASTA files:
refseq2cds align --mode mafft-pal2nal --jobs 4 --threads-per-mafft 2Run selected genes:
refseq2cds align --mode mafft-pal2nal --symbols BRCA1 TP53 A1BGUse an explicit PAL2NAL script path:
refseq2cds align \
--mode mafft-pal2nal \
--pal2nal /path/to/pal2nal.plWrite a human codon map for alignment-site to CDS-site conversion:
refseq2cds align --mode mafft-pal2nal --map-token humanAlignment output:
alignments/mafft_pal2nal/codon/*.codon.fasta
alignments/mafft_pal2nal/aa/*.aa.fasta
alignments/mafft_pal2nal/aa_aligned/*.aa.aln.fasta
alignments/mafft_pal2nal/maps/*.codon_map.tsv.gz
alignments/mafft_pal2nal/manifest.tsv
alignments/mafft_pal2nal/failed.tsv
alignments/mafft_pal2nal/summary.json
variants consumes codon alignments, token-specific codon maps, and CDS-to-
genome matrices. It writes per-gene BED tracks plus merged.bed for all
coordinateable variants. The default --alignment-dir matches the default
align output directory:
refseq2cds variants \
--alignment-dir alignments/mafft_pal2nal \
--matrix-dir human_cds_matrices \
--coordinate-reference-token human \
--target-token human \
--min-background-informative 5 \
--output-dir variants/human_specific \
--forceIf --codon-map-dir is omitted, the command first looks for
<alignment-dir>/maps, which is where refseq2cds align --map-token human
writes human codon maps.
refseq2cds summaryFor a separate output directory:
refseq2cds summary --output-root /path/to/runThis repository includes packaging and reviewer-smoke-test scaffolding:
environment.yml reproducible conda/mamba environment
conda-recipe/meta.yaml noarch Python conda/Bioconda recipe
.github/workflows/test.yml GitHub Actions pytest workflow
examples/mini/ offline mini fixture
tests/ pytest tests for CLI and mini matrix logic
CITATION.cff citation metadata
The conda recipe is pinned to a GitHub release tarball and SHA256. During
development, update the recipe only after the GitHub tag exists so the final
release archive checksum can be recorded. For example, after tagging v0.1.5,
download the GitHub release tarball, compute its SHA256, update
conda-recipe/meta.yaml, and then run the local build checks. To test the
current recipe locally:
mamba install -c conda-forge -c bioconda conda-build
conda build conda-recipeFor a Bioconda submission, copy conda-recipe/meta.yaml into
bioconda-recipes/recipes/refseq2cds/meta.yaml.
Additional notes:
docs/options.mdsummarizes commonly used command options.docs/troubleshooting.mdcovers common installation and runtime issues.docs/conda_packaging.mdrecords conda/Bioconda packaging steps.
Each matrix is a gzipped TSV. Example:
human_cds_matrices/BRCA1.human_cds_genomic_matrix.tsv.gz
Important columns:
| Column | Meaning |
|---|---|
family_id |
reference-species symbol + reference-species GeneID family identifier |
human_symbol |
original human gene symbol |
human_GeneID |
NCBI human GeneID |
transcript_accession |
selected human transcript |
protein_accession |
selected human protein |
cds_nt_pos_1based |
1-based nucleotide position in normalized human CDS |
codon_index_1based |
1-based codon index in normalized human CDS |
codon_offset_1based |
1, 2, or 3 within codon |
cds_base_transcript_orientation |
CDS base in transcript orientation |
refseq_accession |
RefSeq genomic sequence accession, e.g. NC_060941.1 |
ucsc_chrom |
UCSC-style chromosome name, e.g. chr17 |
genomic_pos_1based |
1-based genomic coordinate |
bed_start_0based |
BED start coordinate |
bed_end_0based |
BED end coordinate |
strand |
human CDS strand |
gcf_accession |
human assembly accession |
annotation_release_id |
RefSeq annotation release ID |
The per-matrix rows describe human CDS nucleotide positions. The accompanying
human_cds_matrices/manifest.tsv also records the reference naming fields
(reference_symbol, reference_GeneID, reference_taxid, and
reference_token) so non-human reference-species runs remain traceable.
To create BED rows for a codon site, filter by codon_index_1based and emit:
ucsc_chrom bed_start_0based bed_end_0based name score strand
For a nucleotide site, filter by cds_nt_pos_1based.
gzip -dc human_cds_matrices/BRCA1.human_cds_genomic_matrix.tsv.gz | headrefseq2cds.py
workflow/scripts/run_cds_pipeline.py
workflow/scripts/build_human_cds_position_matrices.py
workflow/scripts/align_mafft_pal2nal.py
workflow/scripts/call_target_specific_variants.py
config/species_manifest.tsv
examples/mini/
tests/
docs/
conda-recipe/meta.yaml
environment.yml
requirements.txt
pyproject.toml
run_cds_pipeline.sh
Large generated directories are intentionally ignored by .gitignore:
raw/
indexes/
orthology/
selection/
qc/
fastas/
human_cds_matrices/
reports/
alignments/
gene_orthologs.gzis treated as a pairwise edge list, not a ready-made orthogroup table.- Strict singleton means the induced locked-taxid graph component has exactly one GeneID per species and exactly one gene per locked assembly.
- CDS provenance is assembly-exact: selected CDS records must come from the
frozen
GCF_*annotation package. - Human genomic positions use whichever human
GCF_*assembly is present inconfig/species_manifest.tsv. For the default manifest this isGCF_009914755.1/ T2T-CHM13v2.0, not GRCh38. - Terminal stop codons are removed before FASTA/matrix output and recorded in metadata.
refseq2cds align --mode mafft-pal2nalis a lightweight tree-free alignment mode. It is useful for fast baseline codon alignments and site-index mapping, but it is not a replacement for phylogeny-aware PAGAN2/PRANK analyses when a guide tree is central to the study design.