# End-To-End NGS Sequencing Analysis Pipeline

In this notebook, we'll demonstrate an end-to-end NGS sequencing analysis pipeline using Python and associated libraries.

The basic steps are:

1. Download Sequencing Data
2. Quality Control
3. Data Preprocessing
4. Read Alignment
5. Convert SAM to BAM and Sort
6. Variant Calling
7. Annotation
8. Visualization & reporting

In [33]:
import os
import subprocess
from pathlib import Path
from typing import Union

# Download Data

In [34]:
sra_id = 'SRR31041149'
DATA_DIR = Path('../data')

In [35]:
def download_sra_data(sra_id: str, DATA_DIR):
    """
    Download SRA data using SRA Toolkit

    Parameters:
        sra_id (str): Accession id of sequence
        DATA_DIR: Output directory
    """
    try:
        print(f'Downloading {sra_id}...')
        # use 'prefetch' command to download .sra file
        subprocess.run(f'prefetch {sra_id} -O {DATA_DIR}', shell=True, check=True)

        # convert .sra to .fastq using fastq-dump
        sra_file = f'{DATA_DIR}/{sra_id}/{sra_id}.sra'
        subprocess.run(f'fastq-dump --split-files {sra_file} -O {DATA_DIR}', shell=True, check=True)

        print(f'Downloaded and converted {sra_id} to FASTQ format.')
    except subprocess.CalledProcessError as e:
        print(f'Error during dowload or conversion: {e}')

In [37]:
subprocess.run(f'prefetch {sra_id} -O {DATA_DIR}', shell=True, check=True)

2024-10-20T04:08:50 prefetch.3.1.1: 1) Resolving 'SRR31041149'...
2024-10-20T04:08:51 prefetch.3.1.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores
2024-10-20T04:08:52 prefetch.3.1.1: 1) 'SRR31041149' is found locally 
2024-10-20T04:08:52 prefetch.3.1.1: 'SRR31041149' has 0 unresolved dependencies


CompletedProcess(args='prefetch SRR31041149 -O ../data', returncode=0)

# Quality Control Using FastQC

We need to convert the downloaded `.sra` file to a `.fastq` format. Use `fastq-dump` from **SRA Toolkit** for this task.

In [38]:
def convert_sra_to_fastq(sra_file: Union[str, Path], output_dir: Union[str, Path]) -> None:
    """
    Converts an SRA file to FASTQ format using fastq-dump

    Parameters:
        sra_file (str or Path): Path to input .sra file.
        output_dir (str or Path): Directory to save FASTQ files.
    
    Return:
        None
    """
    # ensure output directory exists
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    try:
        print(f'Converting {sra_file} to FASTQ...')
        cmd = f'fastq-dump --split-files --outdir {output_dir} {sra_file}'
        subprocess.run(cmd, shell=True, check=True)
        print('Conversion completed.')
    except subprocess.CalledProcessError as e:
        print(f'Error during conversion: {e}')

In [39]:
# convert sra to fastq
sra_file = DATA_DIR / sra_id / (sra_id + '.sra')
convert_sra_to_fastq(sra_file=sra_file, output_dir=DATA_DIR)


Converting ../data/SRR31041149/SRR31041149.sra to FASTQ...
Read 1127165 spots for ../data/SRR31041149/SRR31041149.sra
Written 1127165 spots for ../data/SRR31041149/SRR31041149.sra
Conversion completed.


For paired-end reads, the output above will yield 2 `.fastq` files. One file corresponds to the forward read, and the other to the reverse read.

In [40]:
def run_fastqc(input_file: Union[str, Path], output_dir: Union[str, Path]) -> None:
    """
    Run FastQC on the input FASTQ file and stores results in output directory

    Parameters:
        input_file (str or Path): Path to input FASTQ file.
        output_dir (str or Path): Directory to save FastQC report.
    
    Return:
        None
    """
    # create output directory if it doesn't exist
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    # run FastQC command
    try:
        print(f'Running FastQC on {input_file}...')
        cmd = f'fastqc {input_file} -o {output_dir}'
        subprocess.run(cmd, shell=True, check=True)
        print(f'FastQC report saved in {output_dir}')
    except subprocess.CalledProcessError as e:
        print(f'Error running FastQC: {e}')


In [41]:
# run fastqc analysis

input_files = [DATA_DIR / (sra_id + '_1.fastq'), DATA_DIR / (sra_id + '_2.fastq')]
QC_OUTPUT_DIR = DATA_DIR / 'results/qc'

for fastq_file in input_files:
    run_fastqc(fastq_file, QC_OUTPUT_DIR)



Running FastQC on ../data/SRR31041149_1.fastq...
null


Started analysis of SRR31041149_1.fastq
Approx 5% complete for SRR31041149_1.fastq
Approx 10% complete for SRR31041149_1.fastq
Approx 15% complete for SRR31041149_1.fastq
Approx 20% complete for SRR31041149_1.fastq
Approx 25% complete for SRR31041149_1.fastq
Approx 30% complete for SRR31041149_1.fastq
Approx 35% complete for SRR31041149_1.fastq
Approx 40% complete for SRR31041149_1.fastq
Approx 45% complete for SRR31041149_1.fastq
Approx 50% complete for SRR31041149_1.fastq
Approx 55% complete for SRR31041149_1.fastq
Approx 60% complete for SRR31041149_1.fastq
Approx 65% complete for SRR31041149_1.fastq
Approx 70% complete for SRR31041149_1.fastq
Approx 75% complete for SRR31041149_1.fastq
Approx 80% complete for SRR31041149_1.fastq
Approx 85% complete for SRR31041149_1.fastq
Approx 90% complete for SRR31041149_1.fastq
Approx 95% complete for SRR31041149_1.fastq


Analysis complete for SRR31041149_1.fastq




FastQC report saved in ../data/results/qc
Running FastQC on ../data/SRR31041149_2.fastq...
null


Started analysis of SRR31041149_2.fastq
Approx 5% complete for SRR31041149_2.fastq
Approx 10% complete for SRR31041149_2.fastq
Approx 15% complete for SRR31041149_2.fastq
Approx 20% complete for SRR31041149_2.fastq
Approx 25% complete for SRR31041149_2.fastq
Approx 30% complete for SRR31041149_2.fastq
Approx 35% complete for SRR31041149_2.fastq
Approx 40% complete for SRR31041149_2.fastq
Approx 45% complete for SRR31041149_2.fastq
Approx 50% complete for SRR31041149_2.fastq
Approx 55% complete for SRR31041149_2.fastq
Approx 60% complete for SRR31041149_2.fastq
Approx 65% complete for SRR31041149_2.fastq
Approx 70% complete for SRR31041149_2.fastq
Approx 75% complete for SRR31041149_2.fastq
Approx 80% complete for SRR31041149_2.fastq
Approx 85% complete for SRR31041149_2.fastq
Approx 90% complete for SRR31041149_2.fastq
Approx 95% complete for SRR31041149_2.fastq


Analysis complete for SRR31041149_2.fastq




FastQC report saved in ../data/results/qc


A report of quality statistics are presented within the output HTML files for each pair-end read.

**Interpreting the HTML Report**
1. **Per Base Sequence Quality**: Shows box and whisker plot of quality score for each base
2. **GC Content**: Verify if it matches expectations (i.e. ~50% if using an E. coli SRR accession)
3. **Adapter Content**: Alerts of adapter presence; Will trim them in the *Trimming and Filtering* step
4. **Sequence Duplication**: high levels of duplication might indicate PCR bias

# Data Processing (Trimming and Filtering)

After generating the HTML report with `fastqc` and if adapter contamination is present, we can apply the `trimmomatic` from the SRA Toolkit to remove adapters or low-qality bases at the sequence ends.

In [42]:
def run_trimmomatic(forward_file: Union[str, Path],
                    reverse_file: Union[str, Path],
                    output_dir: Union[str, Path],
                    trimmomatic_path: Union[str, Path],
                    adapter='TruSeq3-PE.fa') -> None:
    """
    Run Trimmomatic to trim adapters and filter low-quality reads

    Parameters:
        forward_file (str or Path): Path to the forward reads FASTQ file.
        reverse_file (str or Path): Path to the reverse reads FASTQ file.
        output_dir (str or Path): Directory to save trimmed FASTQ files.
        trimmomatic_path (str or Path): Path to trimmomatic.jar file.
    
    Return:
        None
    """
    # create output directory if it doesn't exist
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    # adapter file
    adapter_file = f'/Users/Akechi/Downloads/Trimmomatic-0.39/adapters/{adapter}'

    cmd = f"java -jar {trimmomatic_path} PE {forward_file} {reverse_file} {output_dir}/trimmed_1P.fastq {output_dir}/unpaired_1U.fastq {output_dir}/trimmed_2P.fastq {output_dir}/unpaired_2U.fastq ILLUMINACLIP:{adapter_file}:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50"
    try:
        print(f'Running Trimmomatic on {forward_file} and {reverse_file}...')
        subprocess.run(cmd, shell=True, check=True)
    except subprocess.CalledProcessError as e:
        print(f'Error running Trimmomatic: {e}')

In [43]:
# run trimmomatic
forward_file = f'../data/{sra_id}_1.fastq'
reverse_file = f'../data/{sra_id}_2.fastq'
output_dir = DATA_DIR / 'results'
trimmomatic_path = f'~/Downloads/Trimmomatic-0.39/trimmomatic-0.39.jar'

run_trimmomatic(f'../data/{sra_id}_1.fastq', f'../data/{sra_id}_2.fastq', output_dir=output_dir, trimmomatic_path=trimmomatic_path)


Running Trimmomatic on ../data/SRR31041149_1.fastq and ../data/SRR31041149_2.fastq...


TrimmomaticPE: Started with arguments:
 ../data/SRR31041149_1.fastq ../data/SRR31041149_2.fastq ../data/results/trimmed_1P.fastq ../data/results/unpaired_1U.fastq ../data/results/trimmed_2P.fastq ../data/results/unpaired_2U.fastq ILLUMINACLIP:/Users/Akechi/Downloads/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 1127165 Both Surviving: 1020543 (90.54%) Forward Only Surviving: 69010 (6.12%) Reverse Only Surviving: 11107 (0.99%) Dropped: 26505 (2.35%)
TrimmomaticPE: Completed successfully


# Align Reads to Reference Genome

Since we used a sequence from E. Coli, we need to download an E. Coli reference genome to align our reads in the trimmed, paired FASTQ files to.

E. Coli Reference Genome:
- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/

## Download Reference Genome

In [44]:
def get_reference_sequence(ftp_url: str, output_dir: Union[str, Path]) -> None:
    """
    Downloads reference sequence using FTP protocol

    Parameters:
        ftp_url (str): Url to reference sequence.
        output_dir (str or Path): Output directory.
    
    Return:
        None
    """
    # get zip file name
    zip_file = ftp_url.split('/')[-1]

    # make output directory if it doesn't exist
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    # change into output directory, download reference sequence, and unzip
    cmd = f'cd {output_dir} && wget {ftp_url} && gunzip {zip_file}'
    try:
        print(f'Downloading reference genome...')
        subprocess.run(cmd, shell=True, check=True)
    except subprocess.CalledProcessError as e:
        print(f'Error downloading reference genomd: {e}')
    

In [45]:
# download reference genomd
ref_genome_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz'
output_dir = '../reference_genome'

get_reference_sequence(ref_genome_url, output_dir=output_dir)

Downloading reference genome...


--2024-10-19 21:09:52--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
           => ‘GCF_000005845.2_ASM584v2_genomic.fna.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.11, 130.14.250.13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2 ... done.
==> SIZE GCF_000005845.2_ASM584v2_genomic.fna.gz ... 1379902
==> PASV ... done.    ==> RETR GCF_000005845.2_ASM584v2_genomic.fna.gz ... done.
Length: 1379902 (1.3M) (unauthoritative)

     0K .......... .......... .......... .......... ..........  3%  292K 4s
    50K .......... .......... .......... .......... ..........  7%  718K 3s
   100K .......... .......... .......... .......... .......... 11% 9.05M 2s
   150K ......

## Index Reference Genome

We need to index the reference genome so that the alignment software can perform alignment with the trimmed, paired reads faster. If not, then it would have to linearly scan entire reference genome for every read.

**Indexing Steps**:
1. Break reference genome into subsequences
2. Create hash table for the positions of these fragments
3. Compress & store auxiliary data to optimize searching

Index files allow alignment software to perform fast lookups and avoid linear searching for every read

**Common Indexing Tools**:
- BWA
- Bowtie2
- SAMtools

In [59]:
def index_reference_genome(reference_genome: Union[str, Path], bwa_path: Union[str, Path]) -> None:
    """
    Indexes a reference genome using BWA.

    Parameters:
        reference_genome (str or Path): Path to reference genome in FASTA format.
        bwa_path (str or Path): Path to BWA executable
    
    Return:
        None
    """
    # ensure input FASTA file exists
    reference_genome = Path(reference_genome)
    if not reference_genome.is_file():
        raise FileNotFoundError(f'FASTA file not found: {reference_genome}')
    
    # index command
    cmd = f'{bwa_path} index {str(reference_genome)}'

    # perform indexing
    try:
        print(f'Indexing reference genome: {reference_genome.name}')
        subprocess.run(cmd, shell=True, check=True)
    except subprocess.CalledProcessError as e:
        print(f'Error during genome indexing: {e}')

In [58]:
# index reference genome
reference_genome = f'../reference_genome/GCF_000005845.2_ASM584v2_genomic.fna'
bwa_path = Path.home() / 'Downloads/bwa/bwa'

index_reference_genome(fasta_file=reference_genome, bwa_path=bwa_path)

Indexing reference genome: GCF_000005845.2_ASM584v2_genomic.fna


[bwa_index] Pack FASTA... 0.02 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.50 seconds elapse.
[bwa_index] Update BWT... 0.01 sec
[bwa_index] Pack forward-only FASTA... 0.01 sec
[bwa_index] Construct SA from BWT and Occ... 0.16 sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: /Users/akechi/Downloads/bwa/bwa index ../reference_genome/GCF_000005845.2_ASM584v2_genomic.fna
[main] Real time: 0.697 sec; CPU: 0.699 sec


## Align The Reads To Reference Genome

In [None]:
def align_reads_with_bwa(reference_genome: Union[str, Path],
                         read_1: Union[str, Path],
                         read_2: Union[str, Path],
                         output_file: Union[str, Path]='aligned.sam',
                         bwa_path: Union[str, Path]='bwa') -> None:
    """
    Aligns reads to reference genome using BWA-MEM

    Parameters:
        reference_genome (str or Path): Path to the INDEXED reference genome in FASTA format.
        read_1 (str or Path): Path to the first FASTQ file (for single-ed or paired-end reads).
        read_2 (str or Path): Path to the second FASTQ file (for paired-end reads).
    """