# Submodule #2: Collect and Prepare Sequence Data

### Primary Objective
Introduce learners to the essential steps for data collection, quality control, and preparation required for phylogenetic analysis. This module focuses on organizing sequence data and ensuring readiness for downstream analysis.

### Overview
- **What You'll Learn:**
  - Explore datasets, including `sequences.fasta`, `sequences_subset.fasta`, and `sequence_sars_cov_PQ230960.fasta`.
  - Prepare sequence files for alignment using Biopython.
  - Perform quality control using tools like FASTQC and MultiQC.

- **Tools and Libraries:**
  - Biopython for data preprocessing.
  - FASTQC and MultiQC for quality assessment.
  - Cloud-based storage for scalable resource handling.

- **Why It Matters:**
  - Proper preparation ensures accuracy in alignment and tree construction.
  - Quality control avoids errors in downstream phylogenetic analysis.


# Learning Objectives:
In submodule 2 we will leverage the fundamental concepts from [**Submodule 1**](./SubModule1.ipynb) to do the following:
- Demonstrate different ways of extracting dataset
- Preparing those data for phylogenetic analysis
- Discuss how cloud solutions can be helpful on all these things
- Implement Quality Control checks on the dataset

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

<font color="green"> **Submodule #2: Collect and Prepare Sequence Data and Analysis** </font>

Submodule #3: Construct Phylogenetic Tree

 
Submodule #4: Analyze Phylogenetic Tree

----------------------------------------------------------------------------------------------------------------

## 2.1 Efficient Methods for Sourcing Pathogen Sequences

Efficient sourcing and preparation of sequence data are critical for accurate phylogenetic analysis.

### Steps to Source Pathogen Sequences
1. **Public Databases:**
   - Access public repositories like **GenBank**, **EMBL**, and **DDBJ** to retrieve high-quality, annotated genetic data.
2. **Automated Sequence Retrieval Tools:**
   - Tools like **Entrez Direct** and **Biopython** simplify sequence retrieval and minimize manual errors.
3. **Preprocessing Tools:**
   - Use tools like **Trimmomatic** or **Cutadapt** to remove low-quality reads and adapters from the sequences.
4. **Sequence Alignment:**
   - Use alignment tools like **MAFFT** or **ClustalW** to align sequences, ensuring proper alignment for downstream analysis.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Use advanced search options in public databases to filter results based on organism, region, or gene of interest.
</div>

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝 Validate the alignment using metrics like alignment scores to avoid errors during tree construction.
</div>

<div style="padding: 10px; border: 1px solid #ffccbc; border-radius: 5px; background-color: #ffebee;">
    <strong>Alert:</strong>⚠️ Poor preprocessing can lead to errors in alignment and tree construction.
</div>

## 2.2 Importance of Cloud-Based Storage Solutions

Cloud-based storage solutions simplify the management of large-scale sequence data.

### Benefits of Cloud-Based Solutions
1. **Scalability:** Handle large datasets without local storage constraints.
2. **Accessibility:** Enable real-time collaboration by making datasets accessible from any location.
3. **Cost-Effectiveness:** Pay-as-you-go models minimize costs by charging only for used resources.
4. **Data Security and Backup:** Automatic backups and encryption protect datasets from loss or breaches.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝 Consider using established platforms like AWS, Google Cloud, or Azure to streamline data storage and processing.
</div>

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Leverage genomic analysis services offered by cloud providers for faster processing.
</div>

## 2.3 Enhancing Analysis with Publicly Available Datasets

Publicly available datasets improve the depth and accuracy of phylogenetic analyses.

### Benefits of Public Datasets
1. **Increased Data Volume:** Larger datasets improve statistical power and robustness.
2. **Comparative Analysis:** Analyze trends across species or geographic regions using existing data.
3. **Validation and Reproducibility:** Public datasets enable reproducible research.
4. **Collaborative Research:** Shared datasets foster collaboration and knowledge-sharing among researchers.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝  Verify the format and metadata compatibility of public datasets before integrating them into your analysis pipeline.
</div>

## Understanding the SARS-CoV-2 Dataset

The **SARS-CoV-2 dataset** contains genetic sequence data for the virus responsible for the COVID-19 pandemic. This dataset is widely used in phylogenetic studies to track mutations, understand evolutionary relationships, and analyze the spread of the virus globally.

### **Key Components of the SARS-CoV-2 Dataset**
1. **Genetic Sequences:**
   - The dataset typically contains nucleotide or protein sequences for various strains of the virus.
   - These sequences are vital for studying how the virus evolves over time and across different regions.

2. **Metadata:**
   - Associated metadata includes information such as:
     - Geographic location of sample collection.
     - Date of sample collection.
     - Variant classification or lineage (e.g., Delta, Omicron).
   - Metadata enhances analysis by providing context for understanding variations across samples.

3. **Applications:**
   - **Phylogenetic Analysis:** Track how different strains of the virus are related.
   - **Mutation Analysis:** Identify specific mutations and their impacts on viral behavior (e.g., transmissibility or resistance to vaccines).
   - **Epidemiological Studies:** Analyze how the virus spreads in specific populations.

### **Sources of SARS-CoV-2 Data**
1. **Nextstrain Project:**
   - A global initiative providing curated datasets for viral evolution studies.
   - SARS-CoV-2 datasets from Nextstrain are preprocessed for downstream phylogenetic analysis.
   - Visit the [Nextstrain documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html) for more details.

2. **NCBI Virus Database:**
   - A comprehensive resource for viral sequences and metadata.
   - The NCBI database allows users to search by strain, variant, or geographic location.
   - Access the NCBI SARS-CoV-2 dataset [here](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=taxid:2697049&USAState_s=SD).

### **Why Use SARS-CoV-2 Data?**
1. **Scientific Relevance:**
   - SARS-CoV-2 remains a high-priority topic for research due to its global impact.
   - Understanding its genetic evolution helps predict future outbreaks and design effective vaccines.

2. **Real-World Applications:**
   - This dataset is used in vaccine development, drug resistance studies, and public health policies.
   - It is also a benchmark for testing bioinformatics tools in a real-world context.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Always cross-check the integrity and annotation quality of sequences before using them in your analysis.
</div>

## Dataset Description

To perform sequence alignment and phylogenetic analysis, we provide three sequence files in the `data/cov/sequence` directory. These files allow for flexibility based on the computational resources available and the analysis requirements.

### **1. sequences.fasta**
- This file contains the **full SARS-CoV-2 sequence dataset**.
- It is a comprehensive dataset, suitable for large-scale analyses but requires significant computational resources.
- **File Size:** Approximately 4.96 MB.

### **2. sequences_subset.fasta**
- This is a **smaller subset** of the `sequences.fasta` file.
- It is created for testing and tutorial purposes to reduce computational load during demonstrations.
- **File Size:** Approximately 308 KB.

### **3. sequence_sars_cov_PQ230960.fasta**
- This file contains **only one sequence** from the full dataset.
- It is intended for step-by-step demonstrations and quick testing of phylogenetic tools.
- **File Size:** Approximately 30.2 KB.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝  Use `sequences_subset.fasta` for the tutorial to follow along efficiently. The full `sequences.fasta` file is available for advanced analyses requiring detailed insights. Start with `sequence_sars_cov_PQ230960.fasta` for testing alignment or tree construction steps with minimal processing.
</div>

### Data Preparation
**Convert the .fasta file to .fastq format using the following Python code:**


In [12]:
from Bio import SeqIO


def fasta_to_fastq(fasta_file, fastq_file, quality=40):
    with open(fastq_file, "w") as output_handle:
        for record in SeqIO.parse(fasta_file, "fasta"):
            record.letter_annotations["phred_quality"] = [quality] * len(record.seq)
            SeqIO.write(record, output_handle, "fastq")


if __name__ == '__main__':
    fasta_file = "./data/cov/sequence/Coronavirus-Orf9-NCBI.fasta"
    fastq_file = "./data/cov/sequence/Coronavirus-Orf9-NCBI.fastq"
    fasta_to_fastq(fasta_file=fasta_file, fastq_file=fastq_file)


This code converts a .fasta file containing SARS-CoV-2 sequence data to .fastq format using Biopython. The fasta_to_fastq function takes a default quality score of 40 for all sequences, which is typical for high-quality sequence data.

## 2.4 Quality Control Checks

Quality control ensures the reliability and accuracy of sequence data.

### Steps for Quality Control
1. **Install and Configure FastQC:**
   - Ensure a **Java Runtime Environment (JRE)** is installed.
   - **Ubuntu Command:** `sudo apt install default-jre`
   - **RedHat Command:** `sudo yum install java-1.8.0-openjdk`
   - Verify installation:
     ```bash
     java --version
     fastqc --version
     ```

2. **Run FastQC:**
   - Analyze the quality of `.fastq` files with the following command:

<div style="padding: 10px; border: 1px solid #ffccbc; border-radius: 5px; background-color: #ffebee;">
    <strong>Alert:</strong>⚠️ Address low-quality sequences and high adapter content before proceeding to sequence alignment.
</div>

### Run FastQC

In [13]:
!fastqc -t 4 ./data/cov/sequence/Coronavirus-Orf9-NCBI.fastq -o ./data/cov/qc/

null
Started analysis of Coronavirus-Orf9-NCBI.fastq
Analysis complete for Coronavirus-Orf9-NCBI.fastq


## Interactive Quiz

Test your understanding of sequence data preparation and quality control using this interactive quiz:

In [14]:
from jupyterquiz import display_quiz
display_quiz('Quiz/QS2.json')

<IPython.core.display.Javascript object>

# More Datasets to explore

### KEGG Dataset:
Dataset 1: KEGG for Phylogenetic Tree
Downloading KEGG Dataset
**KEGG (Kyoto Encyclopedia of Genes and Genomes)** provides a wealth of data for understanding high-level functions and utilities of biological systems. To download KEGG data:
1. Access KEGG Dataset: KEGG provides a website for data retrieval.
    - https://www.genome.jp/kegg/seq/
2. Downloaded file from FASTA sequence files section on the website.
    - More resources on KEGG: https://www.genome.jp/kegg/


## Download KEGG dataset from URL

### Create Kegg folder to store data

In [15]:
import os

# Check if the directory exists
kegg_dir = os.path.isdir('./data/kegg')

# If the directory does not exist, create it
if not kegg_dir:
    try:
        os.makedirs('./data/kegg/')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

## Uniport Dataset
Dataset 2: UniProt for Phylogenetic Tree
Downloading UniProt Dataset
UniProt is a comprehensive resource for protein sequence and functional information. To download UniProt data:
1. Access UniProt Website: Visit the UniProt website and search for the desired protein sequences.
2. Retrieve Data: Downloaded the Isoform sequences fasta file from the website.
    - File name: uniprot_sprot_varsplic.fasta

## Download Uniport Dataset from URL

In [16]:
import os

# Check if the directory exists
uniport_dir = os.path.isdir('./data/uniport')

# If the directory does not exist, create it
if not uniport_dir:
    try:
        os.makedirs('./data/uniport/')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

### Summary
The notebook focuses on organizing and analyzing biological datasets for phylogenetic studies, specifically using KEGG and UniProt datasets. It provides detailed instructions for downloading FASTA sequences from KEGG for functional insights and isoform sequences from UniProt for protein information. The content includes Python scripts to automate directory creation (/data/kegg and /data/uniport) for dataset organization and references interactive quizzes related to phylogenetic trees, enhancing its utility for educational or research purposes.
In next module we will study about steps of constructing Phylogenetic Tree.