# Submodule2: 
## Collect and prepare sequence data and analysis:

### 2.1 Demonstrate Efficient Methods for Sourcing Pathogen Sequences and Preparing Data for Phylogenetic Analysis
Efficient methods for sourcing pathogen sequences and preparing data include:
* Public Databases: Utilize public repositories like GenBank, EMBL, and DDBJ for sourcing high-quality pathogen sequences. These databases provide comprehensive and well-annotated genetic data.
* Sequence Retrieval Tools: Use tools like Entrez Direct and Biopython to automate the retrieval of sequences from public databases, ensuring efficiency and reducing the risk of manual errors.
* Data Cleaning and Preprocessing: Implement tools such as Trimmomatic for trimming low-quality reads and removing adapters from raw sequence data. This step is crucial for ensuring the integrity of the sequences used in downstream analysis.
* Sequence Alignment: Use alignment tools like MAFFT or ClustalW to align sequences, correcting for gaps and mismatches to maximize homology. Proper alignment is essential for accurate phylogenetic tree construction.


### 2.2 Discuss the Importance of Cloud-Based Storage Solutions in Managing Metagenomic Sequence Data
Cloud-based storage solutions are vital for managing metagenomic sequence data due to:
- **Scalability:** Cloud storage offers scalable solutions to handle the vast amounts of data generated by metagenomic studies. It allows researchers to store large datasets without worrying about local storage limitations.
- **Accessibility:** Cloud storage ensures that data can be accessed from anywhere, facilitating collaboration among researchers across different geographical locations.
- **Cost-Effectiveness:** Cloud services often operate on a pay-as-you-go model, making it cost-effective as researchers only pay for the storage and computational resources they use.
- **Data Security and Backup:** Cloud providers offer robust security measures and automatic backups, protecting valuable data from loss or unauthorized access.


### 2.3 Explain How the Incorporation of Publicly Available Datasets from Reputable Metagenomic Databases Enhances the Depth of Analysis
Incorporating publicly available datasets from reputable metagenomic databases enhances analysis by:
- **Increased Data Volume:** Access to a large volume of data increases the statistical power and robustness of analyses, enabling more accurate and comprehensive studies.
- **Comparative Analysis:** Public datasets provide a wealth of comparative data, allowing researchers to identify patterns, variations, and evolutionary trends across different studies and datasets.
- **Validation and Reproducibility:** Using standardized, publicly available data ensures that results can be validated and reproduced by other researchers, enhancing the credibility of the findings.
- **Resource Sharing:** Public databases foster a collaborative environment where researchers can share resources, tools, and data, accelerating the pace of scientific discovery.

### KEGG Dataset:
Dataset 1: KEGG for Phylogenetic Tree
Downloading KEGG Dataset
**KEGG (Kyoto Encyclopedia of Genes and Genomes)** provides a wealth of data for understanding high-level functions and utilities of biological systems. To download KEGG data:
1. Access KEGG Dataset: KEGG provides a website for data retrieval.
    - https://www.genome.jp/kegg/seq/
2. Downloaded file from FASTA sequence files section on the website.
    - More resources on KEGG: https://www.genome.jp/kegg/


## Download KEGG dataset from URL

In [4]:
!wget -o "./data/kegg/br01553.fasta" "https://firebasestorage.googleapis.com/v0/b/reactfirebase-142f5.appspot.com/o/nosi-phylogeny%2Fbr01553.fasta?alt=media&token=08d87736-6f5b-4512-94c8-16531c87aa38"

## Uniport Dataset
Dataset 2: UniProt for Phylogenetic Tree
Downloading UniProt Dataset
UniProt is a comprehensive resource for protein sequence and functional information. To download UniProt data:
1. Access UniProt Website: Visit the UniProt website and search for the desired protein sequences.
2. Retrieve Data: Downloaded the Isoform sequences fasta file from the website.
    - File name: uniprot_sprot_varsplic.fasta

## Download Uniport Dataset from URL

In [None]:
!wget -o "./data/uniport/uniprot_sprot_varsplic.fasta" "https://firebasestorage.googleapis.com/v0/b/reactfirebase-142f5.appspot.com/o/uniprot_sprot_varsplic.fasta?alt=media&token=a7695fee-82a3-4100-983c-5eae5db474e9"

### SARS-CoV-2 Dataset
Downloaded SARS-CoV-2 sequence data from the Nextstrain project: 
1. https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
2. The specific file used is: data/nextstrain/sars-cov-2/wuhan-hu-1/proteins/sequences.fasta

## Download SARS-Cov-2 Dataset from URL

In [None]:
!wget -o "./data/cov/sequences.fasta" "https://firebasestorage.googleapis.com/v0/b/reactfirebase-142f5.appspot.com/o/nosi-phylogeny%2Fsequences.fasta?alt=media&token=4de7706b-0084-4846-b829-79d8ea35d327"

### Data Preparation
**Convert the .fasta file to .fastq format using the following Python code:**


In [None]:
from Bio import SeqIO


def fasta_to_fastq(fasta_file, fastq_file, quality=40):
    with open(fastq_file, "w") as output_handle:
        for record in SeqIO.parse(fasta_file, "fasta"):
            record.letter_annotations["phred_quality"] = [quality] * len(record.seq)
            SeqIO.write(record, output_handle, "fastq")


if __name__ == '__main__':
    fasta_file = "data/cov/sequences.fasta"
    fastq_file = "data/cov/sequences_converted.fastq"
    fasta_to_fastq(fasta_file=fasta_file, fastq_file=fastq_file)


### 2.4 Implement Quality Control Checks Using Tools Like MultiQC and FastQC to Ensure Data Integrity
**Using FASTQC for Quality Control.**
- Follow installation and setup from:
    - https://raw.githubusercontent.com/s-andrews/FastQC/master/INSTALL.txt
- First need to install JAVA as FastQC uses JAVA.
- RUN FASTQC command line:
    - 	fastqc -t 4./data/uniport/ uniprot_sprot_varsplic.fastq
