# CORAL Command-Line Tutorial

This tutorial provides a comprehensive guide to using CORAL via the command-line interface.

## Table of Contents

1. [Installation & Setup](#1-installation-setup)
2. [Three-Taxon Pipeline (Pairwise Sister-Taxa)](#2-three-taxon-pipeline-pairwise-sister-taxa)
3. [Multi-Species Pipeline](#3-multi-species-pipeline)
4. [PHYLIP Integration](#4-phylip-integration)
5. [Common Options](#5-common-options)
6. [Output Files](#6-output-files)

---

## 1. Installation & Setup

### Verify Installation

First, ensure CORAL is installed and external tools are available:

```bash
# Check CORAL installation
coral --help

# Check required external tools
samtools --version
bwa
datasets --version

# If using multi-species pipeline, verify PHYLIP:
dnapars  # Should show PHYLIP help if installed
```

### Required External Tools

CORAL requires the following tools to be installed and available in your PATH:

- **NCBI Datasets CLI** - For downloading genomes
- **SAMtools** - For BAM file processing and pileup generation
- **BWA or BWA-MEM2** - Default aligner (classic BWA or BWA-MEM2 can be used)

**Optional (for multi-species pipeline):**
- **PHYLIP** - Required for `coral run_multi` command (must be in PATH)

See the [README.md](README.md) for detailed installation instructions.

---

## 2. Three-Taxon Pipeline (Pairwise Sister-Taxa)

The `run_single` command processes 3 taxa: one outgroup (reference) and two ingroup species. This implements the pairwise sister-taxa analysis method.

### Basic Usage

```bash
coral run_single \
  --outgroup <outgroup_name> <outgroup_accession> \
  --species <species1_name> <species1_accession> <species2_name> <species2_accession> \
  --output <output_directory>
```

### Example: Saccharomyces Species

```bash
coral run_single \
  --outgroup Saccharomyces_mikatae_IFO_1815 GCF_947241705.1 \
  --species Saccharomyces_paradoxus GCF_002079055.1 \
            Saccharomyces_cerevisiae_S288C GCF_000146045.2 \
  --output ../test_output \
  --mapq 60 \
  --suffix test
```

### Example: Drosophila Species

```bash
coral run_single \
  --outgroup Drosophila_helvetica GCA_963969585.1 \
  --species Drosophila_pseudoobscura GCF_009870125.1 Drosophila_miranda GCF_003369915.1 \
  --output ../test_output \
  --mapq 60 \
  --cores 4
```

### Key Options

- `--outgroup`: Reference species (name and NCBI accession) - **Required**
- `--species`: Two ingroup species (4 arguments: name1, acc1, name2, acc2) - **Required**
- `--output`: Output directory - **Required**
- `--mapq`: Minimum mapping quality threshold (default: 60)
- `--suffix`: Optional suffix appended to run_id (e.g., `_test`)
- `--cores`: Number of CPU cores to use (default: all available)
- `--no-cache`: Force regeneration of intermediate files
- `--quiet`: Disable verbose logging
- `--divergence-time`: Divergence time in years (for mutation rate calculation)

### What It Does

The `run_single` command automatically:

1. Downloads genomes from NCBI
2. Indexes the reference genome
3. Simulates reads from ingroup species
4. Aligns reads to the reference
5. Generates multi-taxa pileup
6. Extracts mutations
7. Normalizes mutation counts
8. Generates plots and summary tables

---

## 3. Multi-Species Pipeline

The `run_multi` command processes multiple species using either a Newick tree or a species list.

### Option 1: From Newick Tree

```bash
coral run_multi \
  --newick-tree "<newick_tree_string>" \
  --outgroup <outgroup_name> \
  --output <output_directory> \
  --run-id <run_id>
```

**Example:**

```bash
coral run_multi \
  --newick-tree "(((Drosophila_melanogaster|GCF_000001215.4,Drosophila_sechellia|GCF_000006755.1),Drosophila_mauritiana|GCF_004382145.1),Drosophila_simulans|GCF_016746395.2);" \
  --outgroup Drosophila_simulans \
  --output ../test_output \
  --run-id drosophila_test \
  --mapq 60
```

**Newick Tree Format:**
- Species names and accessions are separated by `|`
- Format: `(species_name|accession)`
- Tree must end with `;`

### Option 2: From Species List (JSON)

```bash
coral run_multi \
  --species-list '<json_array>' \
  --outgroup <outgroup_name> \
  --output <output_directory> \
  --run-id <run_id>
```

**Example:**

```bash
coral run_multi \
  --species-list '[["Drosophila_melanogaster","GCF_000001215.4"],["Drosophila_sechellia","GCF_000006755.1"],["Drosophila_mauritiana","GCF_004382145.1"],["Drosophila_simulans","GCF_016746395.2"]]' \
  --outgroup Drosophila_simulans \
  --output ../test_output \
  --run-id drosophila_test \
  --mapq 60
```

**Species List Format:**
- JSON array of `[name, accession]` pairs
- Must be properly quoted for shell

### Key Options

- `--newick-tree`: Newick tree string with species and accessions
- `--species-list`: JSON array of species (alternative to newick-tree)
- `--outgroup`: Outgroup species name - **Required**
- `--output`: Output directory - **Required**
- `--run-id`: Custom run identifier (default: `multi_species_run`)
- `--mapq`: Minimum mapping quality threshold (default: 60)
- `--cores`: Number of CPU cores to use
- `--no-cache`: Force regeneration
- `--quiet`: Disable verbose logging

### Important Note

Multi-species analyses in CORAL are experimental. Results should be interpreted with caution and are best used for exploratory analyses or method development.

---

## 4. PHYLIP Integration

The `run_phylip` command runs PHYLIP phylogenetic analysis on mutation matrices generated by the multi-species pipeline.

### Usage

```bash
coral run_phylip \
  --df <path_to_matching_bases.csv.gz> \
  --tree <path_to_annotated_tree.nwk> \
  --mapping <path_to_species_mapping.json>
```

### Example

```bash
coral run_phylip \
  --df ../test_output/drosophila_test/matching_bases.csv.gz \
  --tree ../test_output/drosophila_test/annotated_tree.nwk \
  --mapping ../test_output/drosophila_test/species_mapping.json \
  --phylip-command dnapars \
  --prefix drosophila_phylip
```

### Key Options

- `--df`: Path to `matching_bases.csv.gz` from multi-species pipeline - **Required**
- `--tree`: Path to `annotated_tree.nwk` - **Required**
- `--mapping`: Path to `species_mapping.json` - **Required**
- `--phylip-command`: PHYLIP program to run (default: `dnapars`)
  - Options: `dnapars`, `dnapenny`, etc.
- `--prefix`: Output file prefix (default: `phylip_run`)
- `--input-string`: PHYLIP interactive input (default: `Y\n`)

### Requirements

- PHYLIP must be installed and available in PATH
- Input files must be generated by `coral run_multi`

---

## 5. Common Options

### Caching

By default, CORAL caches intermediate files. If outputs already exist, they are reused.

**Force regeneration:**
```bash
coral run_single ... --no-cache
```

### Verbosity

**Verbose logging (default):**
```bash
coral run_single ... --verbose
```

**Quiet mode:**
```bash
coral run_single ... --quiet
```

### MAPQ Filtering

**High quality (strict):**
```bash
coral run_single ... --mapq 60
```

**Lower quality (more permissive):**
```bash
coral run_single ... --mapq 30
```

### CPU Cores

**Use all available cores (default):**
```bash
coral run_single ... --cores $(nproc)
```

**Use specific number:**
```bash
coral run_single ... --cores 4
```

### Custom Aligner

**Use default BWA (classic):**
```bash
coral run_single ...
```

**Use BWA-MEM2:**
```bash
coral run_single ... --aligner-name bwa-mem2
```

**Use custom aligner command:**
```bash
coral run_single ... \
  --aligner-name minimap2 \
  --aligner-cmd "minimap2 -ax map-ont {ref} {reads} | samtools sort -o {bam}"
```

Placeholders: `{ref}`, `{reads}`, `{bam}`

---

## 6. Output Files

Each run produces a self-contained directory with organized output files.

### Directory Structure

```
<output_dir>/
  └── <run_id>/
      ├── *.pileup.gz               # Multi-taxa pileup
      ├── Mutations/                 # Mutation files directory
      │   ├── *_mutations.csv.gz    # Full mutation lists
      │   └── *_mutations.json      # Mutation context counts
      ├── Triplets/                  # Trinucleotide context files
      ├── Tables/                    # Normalized spectra tables
      ├── Plots/                     # Visualization plots
      ├── Intervals/                 # Read interval files
      └── pipeline_timings.json      # Execution timing info
```

### Key Output Files

**Pileup:**
- `<run_id>.pileup.gz` - Multi-taxa pileup file (in run root)

**Mutations:**
- `Mutations/<taxon1>__<taxon2>__<reference>__mutations.csv.gz` - Full mutation list
- `Mutations/<taxon1>__<taxon2>__<reference>__mutations.json` - Mutation counts

**Mutation File Naming:** Files are named `<taxon1>__<taxon2>__<reference>__mutations.*` where the file contains mutations inferred to have occurred on the branch leading to `<taxon1>` since its divergence from `<taxon2>`, using `<reference>` as the outgroup/reference genome.

**Tables:**
- `Tables/normalized_scaled.tsv` - Normalized mutation spectra
- `Tables/collapsed_mutations.tsv` - Raw mutation counts
- `Tables/scaled_raw.tsv` - Scaled raw counts
- `Tables/triplets.tsv` - Trinucleotide context counts

**Plots:**
- `Plots/*.png` - Mutation spectra, coverage, and density plots

For detailed information about output file naming conventions and formats, see [OUTPUT_FORMAT.md](OUTPUT_FORMAT.md).

---

## Quick Reference

### Commands

- `coral run_single` - 3-species pipeline (outgroup + 2 ingroups)
- `coral run_multi` - Multi-species pipeline
- `coral run_phylip` - PHYLIP phylogenetic analysis

### Common Options

- `--mapq <int>` - Mapping quality threshold (default: 60)
- `--cores <int>` - Number of CPU cores (default: all available)
- `--no-cache` - Force regeneration of all files
- `--quiet` - Disable verbose logging
- `--suffix <str>` - Append suffix to run_id
- `--run-id <str>` - Custom run identifier (multi-species only)

### Getting Help

```bash
coral --help
coral run_single --help
coral run_multi --help
coral run_phylip --help
```

---

## Additional Resources

- [README.md](README.md) - Main documentation
- [OUTPUT_FORMAT.md](OUTPUT_FORMAT.md) - Detailed output format and naming conventions
- [tutorial_step_by_step.ipynb](tutorial_step_by_step.ipynb) - Python library usage tutorial

