# Module 8: Next-Generation Sequencing Data Formats

## Common NGS Data Formats

Some of the most common NGS Data Formats include:

- FASTA
    - Reference Sequences
    - "Colorspace FASTA" is produced by SOLiD NGS platforms
- FASTQ
    - Raw sequences + base read quality
    - Produced by Illumina NGS platforms
- SAM/BAM/CRAM
    - Aligned sequences + many quality metrics
- VCF
    - Variant Calls
- Standard Flowgram Format (.sff)
    - Produced by LS454 NGS platforms

NGS Data is typically stored in plain text files (except BAM/CRAM), which store fields in a tab-separated format. Most of these plain text files are provided compressed in .GZ files, which are much more tightly compressed than .ZIP files.

So, it's very important to be able to handle tab-seaprated text files when handling NGS data. 

Windows, iOS, and Linux all use slightly different newline conventions, so you have to be mindful of those differences as well.

### FASTA Format

FASTA formats text that represent neucleotide or peptide sequences. Each letter represents a single nucleotide or peptide.

The "sequence name" line is followed by the "sequence data":
- Sequence Name starts with ">"
- A comment line (rarely used) starts with ";"
- **If there are "N" nucleotides, this indicates that the reference sequence is not yet known**

Each line should be no longer than 120 characters, though lines are usually 80 characters or less per line. Each line typically has a fixed width throughout the file.

FASTA is primarily used for reference or assembled sequences. It is an old file type, that was actually optimized for dot matrix printers.

The FAI (FASTA Index) file helps quickly locate a desired sequence.

It consists of chromosome name, the length of the chromosome (in bases), the byte offset of the first base, the number of bases per line, and the number of bytes per line (including the newline) in columns.

### FASTQ Format

Produced by Illumina NGS platforms, FASTQ stands for "FASTa + Quality." It is one of the de facto standards for **reads**.

There are 4 lines produced per read:
1. @Sequence ID (unique)
2. Base reads
3. "+" with optional sequence ID
4. Phred-Scale quality score for each base
    - Mapped to human-readable ASCII by adding 33, but different platforms may use different offsets/scales

Paired reads may be identified in line 3, but are usually provided in two separate FASTQ files, with the same line numbers for read pairs.

### SAM/BAM Format

BAM is the binary version of SAM (Sequenced Based Alignment). It is the de facto standard for mapped/aligned DNA.

SAM/BAM consists of two parts:
- Header (information that allows for reproducibility)
    - References the genome information (name, version, URL)
    - Lists the reference sequences (SQ)
    - Provides read group information (RG)
    - Lists programs (PG)
- Alignment section
    - Query template name
    - Bitwise FLAG
    - Ref.sequence name (SQ)
    - Mapping position
    - Mapping quality (Phred)
    - CIGAR string (base alignment)
        - informs of gaps and other alignment necessities
    - Paired mate information and template length
        - "=" means the same chromosome, which is most common
        - insert size can only be measured once mapping is done, and it measures the difference between matched ends
    - Base qualities

BAM uses a compression scheme to be more compact. BAM files can be sorted and indexed, which makes accessing the data a very fast process. A BAI file is the index for a BAM file.

CIGAR (Concise Idiosyncractic Gapped Alignment Report) Strings

| Op | BAM | Description                                                                         |
|----|-----|-------------------------------------------------------------------------------------|
| M  | 0   | Alignment match (can be a match or mismatch)                                        |
| I  | 1   | Insertion to the reference                                                          |
| D  | 2   | Deletion from the reference                                                         |
| *N | 3   | Skipped region from the reference. Indicaets a splicing event in tophat RNAseq BAMS |
| S  | 4   | Soft clipping (clipped sequences present in SEQ)                                    |
| *H | 5   | Hard clipping (clipped sequences NOT present in SEQ)                                |
| *P | 6   | Padding (silent deletion from padded reference)                                     |
| *= | 7   | Sequence match                                                                      |
| *X | 8   | Sequence mismatch                                                                   |

*rarer/newer

#### Tools to Read/Write SAM/BAM

- IGV (Integrative Genome Viewer)
    - GUI tool to view BAM files with reference and/or variant call (VCF) data
    - Powerful visualization tool
- samtools
    - Command-line tools to read/write/modify SAM/BAM files
    - Versatile tool that also has quick visualization of mapped reads on text terminals

## Alignment Quality

Most aligners will estimate how reliable an alignment in with a Mapping Quality, which is the Phred-scaled estimate of the probability the chosen mapping is wrong. As such, a "Q30" alignment quality would have 1 in 1000 reads placed incorrectly.

Reference Phred Score in Module 3 Notes