# **University of South Dakota: Phylogenetic Analysis**

## **Submodule #2: Collect and Prepare Sequence Data**

### **Primary Objective**
Introduce learners to the essential steps for data collection, and preparation required for phylogenetic analysis. This module focuses on organizing sequence data and ensuring readiness for downstream analysis.

### **Overview**
- **What You'll Learn:**
  - Explore datasets, including `sequences.fasta`.
    




### **Learning Objectives:**
In submodule 2 we will leverage the fundamental concepts from [**Submodule 1**](./SubModule1.ipynb) to do the following:
- Demonstrate different ways of extracting dataset
- Preparing those data for phylogenetic analysis


----------------------------------------------------------------------------------------------------------------

### **2.1 Efficient Methods for Sourcing Pathogen Sequences**

Efficient sourcing and preparation of sequence data are critical for accurate phylogenetic analysis.

#### **Steps to Source Pathogen Sequences**
1. **Public Databases:**
   - Access public repositories like *GenBank*, *EMBL*, and *DDBJ* to retrieve high-quality, annotated genetic data.
2. **Automated Sequence Retrieval Tools:**
   - Tools like *Entrez Direct* and *Biopython* simplify sequence retrieval and minimize manual errors.
4. **Sequence Alignment:**
   - Use alignment tool like *MAFFT* to align sequences, ensuring proper alignment for downstream analysis.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Use advanced search options in public databases to filter results based on organism, region, or gene of interest.
</div>

### **2.2 Enhancing Analysis with Publicly Available Datasets**

Publicly available datasets improve the depth and accuracy of phylogenetic analyses.

#### **Benefits of Public Datasets**
1. <u>Increased Data Volume</u>:  Larger datasets improve statistical power and robustness.
2. <u>Comparative Analysis</u>:  Analyze trends across species or geographic regions using existing data.
3. <u>Validation and Reproducibility</u>:  Public datasets enable reproducible research.
4. <u>Collaborative Research</u>:  Shared datasets foster collaboration and knowledge-sharing among researchers.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝  Verify the format and metadata compatibility of public datasets before integrating them into your analysis pipeline.
</div>

### **2.3 Understanding the SARS-CoV-2 Dataset**

The **SARS-CoV-2 dataset** contains genetic sequence data for the virus responsible for the COVID-19 pandemic. This dataset is widely used in phylogenetic studies to track mutations, understand evolutionary relationships, and analyze the spread of the virus globally.

#### **Key Components of the SARS-CoV-2 Dataset**
1. **Genetic Sequences:**
   - The dataset typically contains nucleotide or protein sequences for various strains of the virus.
   - These sequences are vital for studying how the virus evolves over time and across different regions.

2. **Metadata:**
   - Associated metadata includes information such as:
     - Geographic location of sample collection.
     - Date of sample collection.
     - Variant classification or lineage (e.g., Delta, Omicron).
   - Metadata enhances analysis by providing context for understanding variations across samples.

3. **Applications:**
   - <u>Phylogenetic Analysis</u>: Track how different strains of the virus are related.
   - <u>Mutation Analysis</u>: Identify specific mutations and their impacts on viral behavior (e.g., transmissibility or resistance to vaccines).
   - <u>Epidemiological Studies</u>: Analyze how the virus spreads in specific populations.

#### **Sources of SARS-CoV-2 Data**
1. **Nextstrain Project:**
   - A global initiative providing curated datasets for viral evolution studies.
   - SARS-CoV-2 datasets from Nextstrain are preprocessed for downstream phylogenetic analysis.
   - Visit the [Nextstrain documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html) for more details.

2. **NCBI Virus Database:**
   - A comprehensive resource for viral sequences and metadata.
   - The NCBI database allows users to search by strain, variant, or geographic location.
   - Access the NCBI SARS-CoV-2 dataset [here](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=taxid:2697049&USAState_s=SD).

#### **Why Use SARS-CoV-2 Data?**
1. **Scientific Relevance:**
   - SARS-CoV-2 remains a high-priority topic for research due to its global impact.
   - Understanding its genetic evolution helps predict future outbreaks and design effective vaccines.

2. **Real-World Applications:**
   - This dataset is used in vaccine development, drug resistance studies, and public health policies.
   - It is also a benchmark for testing bioinformatics tools in a real-world context.

### **2.4 Let’s Do an Example: Downloading SARS-CoV-2 Data from NCBI**
To help you get started, let’s walk through an example of downloading SARS-CoV-2 genetic sequence data from NCBI.


**Step 1: Access NCBI Virus Database**
1. Open your web browser and go to the [NCBI Virus Website](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/)
2. The homepage will show a search bar and various filtering options.
   
**Step 2: Search for SARS-CoV-2 Sequences**
1. In the search bar, type *Severe acute respiratory syndrome coronavirus 2*, or use the `taxonomy ID: 2697049`.

2. Use the filters on the left sidebar to refine your search:
   - **Geographic Region**: Scroll down to the Geographic Region section in the left sidebar. Click on Search All Geo Locations, then expand the list for USA. From the USA list, select your specific state or region. For this example, select South Dakota (SD).

   - **Collection Date** (Optional): Select a date range, such as 2023-01-01 to 2023-03-31, to include only sequences collected within a specific timeframe. 
     ***Note***:
              Choosing a larger date range will result in more data, which may require more computational resources to process. For example:
     1. A full year's data will generate a significantly larger dataset, requiring multi-core CPUs and at least 16GB of RAM to handle alignment or phylogenetic analysis efficiently.
     2. Selecting a smaller date range, such as two months (e.g., 2023-01-01 to 2023-02-28), will produce a smaller dataset that can execute faster and work well on machines with limited resources.
   ***It's recommended to tailor your date range based on the scope of your analysis and the computational power available.***
   - **Sequence Type**: Choose "Nucleotide" for genomic sequence
     
**Step 3: Select the Desired Sequences**
1. Look through the search results and check the metadata, like location, collection date, or lineage, to make sure the sequences match your needs.
2. If you want to select specific sequences, click the checkbox next to them.
3. Alternatively, let’s say I select all sequences instead of individual ones. In this case, I can skip selecting specific checkboxes and go directly to the Download button at the top of the page. This will include all sequences from the filtered search results.


**Step 4: Download the Sequences**
1. Click the Download button at the top of the page.
2. After clicking the Download button, a new window will appear, as shown in the screenshot.

3. In Step 1 of 3, select Nucleotide under the Sequence Data (FASTA format) option. This ensures that the downloaded file contains nucleotide sequences in the standard FASTA format, which is commonly used for bioinformatics analyses.
4. Click Next to proceed.
5. Choose Download All Records to include all the filtered sequences (e.g., 554 records in this example). Alternatively, you can choose Download Selected Records if you have manually selected specific sequences in the results list.
6. After making your choice, click Next to proceed.
7. In this final step, you’ll be asked to choose the FASTA definition line format.
8. Select the Use default option, which includes the Accession and GenBank Title, to proceed with standard formatting for the downloaded sequences.
9. Click the Download button to save the FASTA file to your system.

   The following image provides a step-by-step visual guide on how to download the sequence.fasta file. Each screenshot is labeled in sequential order (1, 2, 3) for clarity.
   ![image-ncbi](images/ncbi3.png)

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Always cross-check the integrity and annotation quality of sequences before using them in your analysis.
</div>

#### **Cross-Checking Sequence Integrity & Annotation Quality**

Before analysis, verify the following:

- **File Integrity & Format** – Ensure the FASTA file is correctly formatted, with headers (> identifier) and valid nucleotide bases (A, T, G, C, N).
  
- **Completeness** – Check for truncated sequences or excessive "N" characters. Ensure reasonable sequence length.

- **Annotation Accuracy** – Confirm taxonomy (txid2697049), metadata (e.g., "South Dakota"), and consistency in annotations.

**Quick Check Example:**

In [None]:
!head -n 10 ./data/cov/sequence/sequences.fasta


Suppose it produces results in the following format:

">PQ649471.1 | Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/SD-UW-23020804223/2023 ORF1ab polyprotein (ORF1ab), ORF1a polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds

TCGTCCGTGTTGCAGCCAATCATCAGCACATCTAGGTTTTGTCCGGGTGTGACCGAAAGG
TAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCC
TGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTT
ATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGG
CGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGC
ACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCG
TAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCG
CAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGCCGA
TCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGA"

- Header Format – The header starts with > and contains a unique identifier (PQ649471.1), organism information (SARS-CoV-2/human/USA/SD-UW-23020804223/2023), and gene descriptions. It appears well-annotated.
- Sequence Integrity – Only valid nucleotide bases (A, T, G, C) are present, with no excessive "N" characters or gaps.
- Metadata Consistency – Matches known SARS-CoV-2 sequences, including genes such as ORF1ab, S, M, N, etc.

#### **Alternative CLI method for advance user (recommend to run in terminal instead of jupyter notebook)**

Downloading FASTA sequences from NCBI using the Command Line Interface (CLI) involves using the Entrez Direct (EDirect) tools. While this method is powerful, there are limitations in achieving highly specific results, such as filtering by virus name, date range, and geographical location.

Use CLI for automation and quick downloads, with the understanding that metadata filtering may not always work perfectly.

**Step 1**: Install Entrez Direct
To use EDirect, first install it with the following command:

In [None]:
!yes | sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

**Step 2**: Search for SARS-CoV-2 sequences:


In [None]:
# Create the folder structure
import os

# Check if the directory exists
alignment_dir = os.path.isdir('./data/CLI')

# If the directory does not exist, create it
if not alignment_dir:
    try:
        os.makedirs('./data/CLI')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
!esearch -db nucleotide -query 'txid2697049[Organism:exp] AND South Dakota[All Fields] AND (2023/01/01:2023/03/31[PDAT])' | efetch -format fasta > data/CLI/sequences_25.fasta


Above command downloads nucleotide sequences related to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), specifically from samples collected in South Dakota between January 1, 2023, and March 31, 2023.

**Potential Issues to Consider:**
Executing the above command without filtering specific data or limiting the number of sequences could result in downloading a large volume of data, potentially overwhelming system memory, storage, or network bandwidth. This may lead to performance issues or even cause the system to crash.

#### **Alternative Methods to Download Sequence Files from the API**

In [None]:
from Bio import Entrez
stream = Entrez.esearch(db="nucleotide", term="txid2697049[Organism:exp] AND South Dakota[All Fields] AND (2023/01/01:2023/03/31[PDAT])", idtype="acc")
record = Entrez.read(stream)
record["Count"]

In [None]:
record["IdList"]

In [None]:
import os
from Bio import Entrez

test_id = record["IdList"][0]
filepath = os.path.join("./data", "CLI", test_id + ".fasta")
if not os.path.isfile(filepath):
    # Downloading...
    stream = Entrez.efetch(db="nucleotide", id=test_id, rettype="fasta", retmode="text")
    output = open(filepath, "w")
    output.write(stream.read())
    output.close()
    stream.close()
    print("Saved")

### **2.5 Dataset Description**

The question of why we download data in FASTA format instead of other formats often arises. FASTA is a widely accepted standard format for storing biological sequences, providing only the sequence data without additional metadata like quality scores. This makes it lightweight and easy to work with, especially for tasks like sequence alignment and reference-based analysis. Additionally, FASTA files are universally supported by most bioinformatics tools and workflows. Other formats, such as FASTQ, include quality scores but are larger and primarily used for raw sequencing data, which may not always be necessary for downstream analyses. The simplicity and accessibility of FASTA make it the preferred choice for many applications.

To perform sequence alignment and phylogenetic analysis, we provide one sequence file in the `data/cov/sequence` directory. This file allow for flexibility based on the computational resources available and the analysis requirements.

##### **1. sequences.fasta**
- We downloaded the sequences.fasta file from NCBI, which contains the full SARS-CoV-2 sequence dataset, including data collected from 01/01/2023 to 03/31/2023.
- It is a comprehensive dataset, suitable for medium-scale analyses but requires significant computational resources.
- **File Size:** Approximately 9.66 MB.


### **2.6 Other Data Sources Beyond NCBI**
While NCBI is a primary source for nucleotide sequence data, other repositories offer complementary datasets that can enhance phylogenetic studies:

#### **2.6.1 KEGG Dataset:**
Dataset 1: KEGG for Phylogenetic Tree

Downloading KEGG Dataset

KEGG (Kyoto Encyclopedia of Genes and Genomes) provides extensive data for understanding the functions and interactions of biological systems. You can retrieve KEGG data through both the website and API.

1. **Downloading from the KEGG Website: To access KEGG data from their website:**

Visit the KEGG Sequence Retrieval page: https://www.genome.jp/kegg/seq/
Navigate to the FASTA Sequence Files section on the page and download the required files.
For more resources and datasets, check KEGG Main Website.

2. **Downloading from KEGG API:**

Follow these steps to download data using the KEGG API:

- Step 1: Visit the KEGG website: https://www.genome.jp/kegg/
- Step 2: In the search bar, type "cov" (or the gene/pathway you're interested in).
- Step 3: Click on "KEGG Genes" or any entry of interest. For this example, we select hsa:10495.
- Step 4: Scroll down to find the sequence type. For this example, we choose "NTseq" (nucleotide sequence).
- Step 5: For NT sequence, the API URL will be: https://rest.kegg.jp/get/hsa:10495/ntseq
  
To download the file using wget, use the following command:




In [None]:
!wget -O data/kegg_fasta.fasta "https://rest.kegg.jp/get/hsa:10495/ntseq"


This command will download the NT sequence for the specified gene hsa:10495.

**Breaking Down the Command**

- wget → Command to download files from the web.

- -O data/kegg_fasta.fasta → Saves the output as kegg_fasta.fasta in the data directory.

- "https://rest.kegg.jp/get/hsa:10495/ntseq" → Retrieves the nucleotide sequence for gene hsa:10495 (replace 10495 with another KEGG gene ID to get different sequences).

**Modify this query:**

To download another gene, replace hsa:10495 with your desired gene ID.
You can find KEGG gene IDs [here](https://www.genome.jp/kegg/).

#### **2.6.2 Uniport Dataset**
Dataset 2: UniProt for Phylogenetic Tree

Downloading UniProt Dataset

UniProt is a comprehensive resource for protein sequences and functional information. You can retrieve data from UniProt through both their website and API.

1. **Downloading from the UniProt Website:**

To download data from the UniProt website:

Visit the UniProt website: https://www.uniprot.org/.
In the search bar, type the name of the protein (e.g., "sars-cov-2") or any protein you are interested in.
For specific protein sequences, click on the protein entry you want, then click the "Share" button. If you want to select multiple entries or all, click the "Share" button directly.
Download the Isoform Sequences in FASTA format from the website.
These datasets can complement NCBI data and provide additional insights for advanced research needs.

2. **Downloading from UniProt API:**

Follow these steps to download data using the UniProt API:

- Step 1: Visit the UniProt website: https://www.uniprot.org/.
- Step 2: In the search bar, type "sars-cov-2" (or any protein name of your interest).
- Step 3: To select specific entries, click on the entry you want, then click the "Share" button.  Alternatively, to select all, click the "Share" button directly.
- Step 4: Click on the "Generate API for URL" option.
- Step 5: Choose "Compressed" for the download option.
- Step 6: Copy the link provided for the API URL using the streaming endpoint.
To download the desired FASTA file, use the following command:


In [None]:
!wget -O data/uniprot_fasta.fasta "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=accession%3AA0A0M3Q1Q3+OR+accession%3AQ2FHV1+OR+accession%3AQ8IUC6+OR+accession%3AQ92985"

This will download the specified FASTA file containing the sequences for the given accession numbers.

**Breaking Down the Command**

- "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=..." → Queries the UniProt database.

  
- accession%3AA0A0M3Q1Q3+OR+accession%3AQ2FHV1+OR+... → Requests multiple protein sequences using their accession numbers.

- -O data/uniprot_fasta.fasta → Saves the output.

  
**Modify this query:**

- Replace the accession numbers (A0A0M3Q1Q3, Q2FHV1, etc.) with other UniProt accessions.
- To search for more proteins, visit [UniProt](https://www.uniprot.org/) .


### **Summary**

In this module, we focused on the critical first step in constructing phylogenetic trees: data collection. Learners were introduced to retrieving sequence data from various sources, such as NCBI, KEGG, and UniProt, using appropriate database queries and APIs. We explored methods for interacting with these databases, fetching relevant sequence data, and subsetting it for further analysis. Additionally, we provided an overview of the FASTA file format, demonstrating its structure and significance in bioinformatics. By examining the contents of a FASTA file, learners gained a better understanding of how sequence data is organized and prepared for downstream analysis. This module sets the stage for the next step in phylogenetic analysis, where we will perform sequence alignment and explore different algorithms and tools for constructing phylogenetic trees.



### **Interactive Quiz**

Test your understanding of sequence data preparation and quality control using this interactive quiz:

In [None]:
from IPython.display import IFrame
IFrame("Quiz/QS2.html", width=800, height=350)