# Submodule2: Collect and prepare sequence data and analysis:

# Learning Objectives:
In submodule 2 we will leverage the fundamental concepts from [**Submodule 1**](./SubModule1.ipynb) to do the following:
- Demonstrate different ways of extracting dataset
- Preparing those data for phylogenetic analysis
- Discuss how cloud solutions can be helpful on all these things
- Implement Quality Control checks on the dataset

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

<font color="green"> **Submodule #2: Collect and Prepare Sequence Data and Analysis** </font>

Submodule #3: Construct Phylogenetic Tree

 
Submodule #4: Analyze Phylogenetic Tree

----------------------------------------------------------------------------------------------------------------

### 2.1 Demonstrate Efficient Methods for Sourcing Pathogen Sequences and Preparing Data for Phylogenetic Analysis
Efficient methods for sourcing pathogen sequences and preparing data include:
* Public Databases: Utilize public repositories like GenBank, EMBL, and DDBJ for sourcing high-quality pathogen sequences. These databases provide comprehensive and well-annotated genetic data.
* Sequence Retrieval Tools: Use tools like Entrez Direct and Biopython to automate the retrieval of sequences from public databases, ensuring efficiency and reducing the risk of manual errors.
* Data Cleaning and Preprocessing: Implement tools such as Trimmomatic for trimming low-quality reads and removing adapters from raw sequence data. This step is crucial for ensuring the integrity of the sequences used in downstream analysis.
* Sequence Alignment: Use alignment tools like MAFFT or ClustalW to align sequences, correcting for gaps and mismatches to maximize homology. Proper alignment is essential for accurate phylogenetic tree construction.


### 2.2 Discuss the Importance of Cloud-Based Storage Solutions in Managing Metagenomic Sequence Data
Cloud-based storage solutions are vital for managing metagenomic sequence data due to:
- **Scalability:** Cloud storage offers scalable solutions to handle the vast amounts of data generated by metagenomic studies. It allows researchers to store large datasets without worrying about local storage limitations.
- **Accessibility:** Cloud storage ensures that data can be accessed from anywhere, facilitating collaboration among researchers across different geographical locations.
- **Cost-Effectiveness:** Cloud services often operate on a pay-as-you-go model, making it cost-effective as researchers only pay for the storage and computational resources they use.
- **Data Security and Backup:** Cloud providers offer robust security measures and automatic backups, protecting valuable data from loss or unauthorized access.


### 2.3 Explain How the Incorporation of Publicly Available Datasets from Reputable Metagenomic Databases Enhances the Depth of Analysis
Incorporating publicly available datasets from reputable metagenomic databases enhances analysis by:
- **Increased Data Volume:** Access to a large volume of data increases the statistical power and robustness of analyses, enabling more accurate and comprehensive studies.
- **Comparative Analysis:** Public datasets provide a wealth of comparative data, allowing researchers to identify patterns, variations, and evolutionary trends across different studies and datasets.
- **Validation and Reproducibility:** Using standardized, publicly available data ensures that results can be validated and reproduced by other researchers, enhancing the credibility of the findings.
- **Resource Sharing:** Public databases foster a collaborative environment where researchers can share resources, tools, and data, accelerating the pace of scientific discovery.

# Create data directory to store data

In [None]:
import os
data_dir = os.path.isdir('./data')
if not data_dir:
    !mkdir data

This code checks if a directory named data exists in the current working directory. If it does not exist, it creates the directory. This is useful for organizing and storing datasets.

### SARS-CoV-2 Dataset
Downloaded SARS-CoV-2 sequence data from the Nextstrain project: 
1. https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
2. The specific file used is: data/nextstrain/sars-cov-2/wuhan-hu-1/proteins/sequences.fasta
3. Official Download:
    1. Open: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=taxid:2697049&USAState_s=SD
    2. Choose any sequence you want and download it as fasta file.

## Download SARS-Cov-2 Dataset from URL

In [None]:
import os

# Check if the directory exists
cov_dir = os.path.isdir('./data/cov')

# If the directory does not exist, create it
if not cov_dir:
    try:
        os.makedirs('./data/cov/')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
%pwd

### Download SARS-CoV-2 sequence file from the specified URL and save it in the data/cov folder.

In [None]:
!wget -O "./data/cov/sequences.fasta" "https://firebasestorage.googleapis.com/v0/b/reactfirebase-142f5.appspot.com/o/nosi-phylogeny%2Fsequences.fasta?alt=media&token=4de7706b-0084-4846-b829-79d8ea35d327"

#### Download just subset of the data.
You can uncomment and download the subset to test if required.

In [None]:
# !wget -O "./data/cov/sequence_PQ230960.fasta" "https://firebasestorage.googleapis.com/v0/b/reactfirebase-142f5.appspot.com/o/nosi-phylogeny%2Fsequence_sars_cov_PQ230960.fasta?alt=media&token=ba5ba0c3-a12c-4752-a4bc-63f355761984"

### Data Preparation
**Convert the .fasta file to .fastq format using the following Python code:**


In [None]:
!pip install biopython

In [None]:
from Bio import SeqIO


def fasta_to_fastq(fasta_file, fastq_file, quality=40):
    with open(fastq_file, "w") as output_handle:
        for record in SeqIO.parse(fasta_file, "fasta"):
            record.letter_annotations["phred_quality"] = [quality] * len(record.seq)
            SeqIO.write(record, output_handle, "fastq")


if __name__ == '__main__':
    fasta_file = "./data/cov/sequences_subset.fasta"
    fastq_file = "./data/cov/sequences_subset_converted.fastq"
    fasta_to_fastq(fasta_file=fasta_file, fastq_file=fastq_file)


This code converts a .fasta file containing SARS-CoV-2 sequence data to .fastq format using Biopython. The fasta_to_fastq function takes a default quality score of 40 for all sequences, which is typical for high-quality sequence data.

### 2.4 Implement Quality Control Checks Using Tools Like MultiQC and FastQC to Ensure Data Integrity
**Using FASTQC for Quality Control.**
- FastQC is a java application.  In order to run it needs your system to have a suitable Java Runtime Environment (JRE) installed.  Before you try to run FastQC you should therefore ensure that you have a suitable JRE.  There are a number of different JREs available however the ones we have tested are the latest Oracle runtime environments and those from the adoptOpenJDK project (https://adoptopenjdk.net/).  You need to download and install a suitable 64-bit JRE and make sure that the java application is in your path (most installers will take care of this for you).

On linux most distributions will have java installed already so you might not need to do anything.  If java isn't installed then you can add it by doing:

- Ubuntu / Mint: **sudo apt install default-jre**

- CentOS / Redhat: **sudo yum install java-1.8.0-openjdk**

You can check whether java is installed by opening the 'cmd' program on windows, or any shell on linux and typing:

**java --version**

You should see something like:

- java -version
- openjdk version "11.0.2" 2019-01-15
- OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
- OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.2+9, mixed mode)


### Installation of FastQC

### Download and Install FastQC

In [None]:
!conda install bioconda::fastqc -y

### Verify the Installation
To verify that FastQC is installed correctly, you can check its version:

In [None]:
!fastqc --version

### Using FastQC
You can now use FastQC to analyze your sequence data. For example:

In [None]:
!fastqc -t 4 ./data/cov/sequences_subset_converted.fastq

# More Datasets to explore

### KEGG Dataset:
Dataset 1: KEGG for Phylogenetic Tree
Downloading KEGG Dataset
**KEGG (Kyoto Encyclopedia of Genes and Genomes)** provides a wealth of data for understanding high-level functions and utilities of biological systems. To download KEGG data:
1. Access KEGG Dataset: KEGG provides a website for data retrieval.
    - https://www.genome.jp/kegg/seq/
2. Downloaded file from FASTA sequence files section on the website.
    - More resources on KEGG: https://www.genome.jp/kegg/


## Download KEGG dataset from URL

### Create Kegg folder to store data

In [None]:
import os

# Check if the directory exists
kegg_dir = os.path.isdir('./data/kegg')

# If the directory does not exist, create it
if not kegg_dir:
    try:
        os.makedirs('./data/kegg/')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
!wget -O "./data/kegg/br01553.fasta" "https://firebasestorage.googleapis.com/v0/b/reactfirebase-142f5.appspot.com/o/nosi-phylogeny%2Fbr01553.fasta?alt=media&token=08d87736-6f5b-4512-94c8-16531c87aa38"

## Uniport Dataset
Dataset 2: UniProt for Phylogenetic Tree
Downloading UniProt Dataset
UniProt is a comprehensive resource for protein sequence and functional information. To download UniProt data:
1. Access UniProt Website: Visit the UniProt website and search for the desired protein sequences.
2. Retrieve Data: Downloaded the Isoform sequences fasta file from the website.
    - File name: uniprot_sprot_varsplic.fasta

## Download Uniport Dataset from URL

In [None]:
import os

# Check if the directory exists
uniport_dir = os.path.isdir('./data/uniport')

# If the directory does not exist, create it
if not uniport_dir:
    try:
        os.makedirs('./data/uniport/')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
!wget -O "./data/uniport/uniprot_sprot_varsplic.fasta" "https://firebasestorage.googleapis.com/v0/b/reactfirebase-142f5.appspot.com/o/uniprot_sprot_varsplic.fasta?alt=media&token=a7695fee-82a3-4100-983c-5eae5db474e9"