# **University of South Dakota: Phylogenetic Analysis**

## **Submodule #4: Analyze Phylogenetic Tree**

### **Primary Objective**
This module will teach you how to analyze and interpret phylogenetic trees to uncover evolutionary insights. You'll learn to:
1. Visualize trees and understand what they reveal about relationships between species or genes.
2. Use tools like BLAST to compare sequences and link them to tree data.

### **Overview**
- **What You'll Learn:**
1. Visualize and interpret phylogenetic trees using IQ-TREE (a tree-building and visualization tool) and iTOL (a customizable online tree viewer).
Example: Understand how bird species in different regions evolved from a common ancestor.
2. Conduct comparative metagenomics using BLAST and Biopython to identify functional relationships.
Example: Compare sequences of antibiotic-resistant bacteria to study the spread of resistance genes.


**Integration with Submodule 3:**

- Interpret the output folder created by Nextclade in Submodule 3.
- Visualize phylogenetic trees from files generated using IQ-TREE, and Nextclade tools in Submodule 3.


- **Tools and Libraries:**
  - **IQ-TREE and iTOL**: For creating and visualizing phylogenetic trees.
  - **BLAST**: To compare sequences with databases and identify similarities.


- **Why It Matters:**

- **Revealing Evolutionary Relationships**:
Phylogenetic trees help us see how species, genes, or traits are connected through evolution.
Real-world Example: During the COVID-19 pandemic, phylogenetic trees were used to track the evolution of new variants of the virus.



### **Learning Objectives**
In Submodule 4, we will build upon the phylogenetic trees constructed in Submodule 3 to analyze and interpret their significance. This includes:

#### 1.**Interpret and Visually Represent Phylogenetic Trees:**

Understand the structure of phylogenetic trees, including branch lengths, bootstrap values, and evolutionary relationships.
Example: Explain how two species, like humans and chimpanzees, share a common ancestor and how they diverged over time.
#### 2.**Conduct Comparative Metagenomics Analysis:**

Use BLAST and Biopython to compare sequences and extract meaningful biological insights.
Example: Study how genes associated with photosynthesis are conserved across plants and cyanobacteria.


----------------------------------------------------------------------------------------------------------------
### **Training Plan** 

Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data

Submodule #3: Alignment and Phylogenetic Reconstruction

 
<font color="green"> **Submodule #4: Analyze Phylogenetic Tree** </font>

--------------------------------------------------------------------------------------------------------------

In [None]:
## Import All Libraries Used in This Submodule
import os
from Bio import Phylo
import matplotlib.pyplot as plt
import pandas as pd


### **4.1 Interpret and Visually Represent Phylogenetic Trees**

Below we will depict the visualization of the phylogenetic tree created from Nextclade and IQ-TREE tools. For visualizing, we will use 
- iTOL, an external website for proper understanding of the phylogenetic tree, and
- Bio.Phylo, which visualizes the phylogenetic tree directly in this Jupyter Notebook.

#### **1. Analyzing the Output Folder of Nextclade from Submodule 3**
After running Nextclade in Submodule 3, an output folder is generated. This folder contains several important files essential for downstream analysis. Below is an explanation of the key files to help you understand their purpose:
1.  **nextclade.aligned.fasta:**
- Contains the multiple sequence alignment of your input sequences against a reference genome.
- Ensures sequences are aligned correctly for downstream analyses, such as phylogenetic tree construction.

  
2. **nextclade.auspice.json:**

- A JSON-formatted file designed for tools like Auspicious or Nextstrain.
- Includes data for visualizing the phylogenetic tree and associated metadata interactively.

3. **nextclade.cds_translation.E.fasta:**

- Contains translated coding sequences (CDS) for the "E" gene (envelope protein) or other specified regions.
- Useful for analyzing amino acid changes and their potential functional impact.

4. **nextclade.csv:**

- A summary file in CSV format providing:
    - Quality control (QC) results.
    - Clade assignments.
     - Mutation information for each sequence.
- Ideal for quickly reviewing sequence quality and evolutionary classifications.

5. **nextclade.json:**

- A JSON file with a detailed breakdown of sequence alignments, mutations, and metadata.
- It is often used for programmatic analysis or integration into custom workflows.

6. **nextclade.nwk:**

- A Newick-formatted file containing the phylogenetic tree structure.
- This format is widely used for tree visualization in various tools, including Auspicious, iTOL, and other phylogenetic software.
- Enables direct visualization or editing of the tree.

7. **nextclade.tsv:**

- A tab-delimited summary file similar to nextclade.csv, but in TSV format.
- Includes sequence information, QC metrics, and mutations.
- Useful for users preferring tab-delimited data for integration with pipelines or for manual inspection in text editors.


  

#### **1.1. Next Steps: Interpret and Visualize Nextclade Output**
 Once you have identified the purpose of each file in the Nextclade output folder, follow these steps to analyze and visualize the data effectively:

**Step 1: Visualize the Phylogenetic Tree from `nextclade.nwk` File Using iTOL**

1. **Download the `nextclade.nwk` File**:
   - Locate the `nextclade.nwk` file from the Nextclade output folder and download it to your computer.

2. **Visit the iTOL Website**:
   - Open your browser and go to the [iTOL website](https://itol.embl.de/).

3. **Upload the Tree File**:
   - Log in to iTOL or create a free account if you don’t have one.
   - Navigate to the **dashboard** and click on the **"Upload Tree"** option.
   - Select the `nextclade.nwk` file and upload it. ![image.png](attachment:afff322f-2241-476f-9078-86ef68ccc20d.png)

4. **View the Tree Image**:
   - Once uploaded, iTOL will display an interactive image of the phylogenetic tree.
   - Explore the branching patterns and relationships between sequences.![image.png](attachment:4c6c8bd1-00b9-443d-8b31-a914645045f2.png)

5. **Use the Filter Options**:
   - iTOL provides various filtering and customization options:
     - Highlight specific clades.
     - Apply metadata-based color coding.
     - Zoom in and out on specific branches.
   - Experiment with these options to gain insights into your tree structure.

6. **Save or Share the Tree**:
   - You can export the tree as an image or share it via a link for collaboration.

---

By following these steps, you can easily visualize and analyze the `nextclade.nwk` file using iTOL.




**Step 2: Visualize the Phylogenetic Tree Using Auspice with `nextclade.auspice.json`**

##### **Instructions:**
1. **Visit the Auspice Website**:
   - Open your browser and navigate to the [Auspice website](https://auspice.us/).
   
2. **Upload the File**:
   - Upload the `./data/cov/phylogenetic_tree/nextclade/nextclade.auspice.json` file generated in the output folder.
   - Auspice will process the file and display a phylogenetic tree visualization.

##### **Example Visualization:**
The resulting visualization might look like the image below, showcasing a phylogenetic tree with various clades, branch points, and evolutionary patterns. ![image.png](attachment:c24a1404-c13e-4132-ba93-7f9891f18dbe.png)

##### **Analysis of the Image:**
1. **Clade Representation**:
   - The tree depicts major SARS-CoV-2 clades such as **Delta**, **Omicron**, and others (e.g., 20A, 22B, 24B).
   - Each clade is represented with nodes (circles) connected by branches, illustrating evolutionary relationships.

2. **Branch Lengths**:
   - Branch lengths indicate the genetic distance between sequences.
   - Short branches suggest high similarity, while longer branches indicate significant genetic divergence.

3. **Color Coding**:
   - The clades are color-coded, helping to distinguish different groups easily.
   - Example: Delta and Omicron clades are visually distinct, aiding in quick identification.

4. **Filtering Options**:
   - The filter options in Auspice allow you to refine the tree view, focusing on specific clades, mutations, or other metadata.
   - Customizable layouts (e.g., radial, rectangular) enable better visualization depending on your analysis goals.

5. **Key Observations**:
   - **Clade Distribution**: The clustering of nodes within a clade shows genetic similarity and can hint at geographical or temporal patterns.
   - **Mutation Events**: Divergent branches may indicate significant mutations or new variants emerging over time.



**Step3: Information Provided by the `nextclade.tsv` File**

In [None]:
file = "data/cov/phylogenetic_tree/nextclade/nextclade.tsv"
df = pd.read_csv(file, sep='\t')  # Use sep='\t' since it is a TSV file
print(df.head())


The `nextclade.tsv` file provides detailed insights into the analyzed sequences, including:

1. **Sequence Metadata**:
   - Names and identifiers of the sequences.

2. **Clade Assignments**:
   - Classification of sequences into clades (e.g., `22E`, `23F`) and their WHO-recognized names (e.g., `Omicron`).

3. **Quality Control (QC) Metrics**:
   - Overall sequence quality status (`good`, `mediocre`, or `bad`).
   - QC scores for evaluating the reliability of each sequence.

4. **Mutations and Substitutions**:
   - Total number of mutations and nucleotide substitutions in each sequence.

5. **Pango Lineages**:
   - Lineage assignments for each sequence (e.g., `BQ.1.5`, `EG.5.1.16`).

6. **Warnings and Errors**:
   - Any issues detected during the analysis, such as alignment errors or quality concerns.

7. **Primer Binding Changes**:
   - Details of primer-binding site changes relevant for PCR assays, if present.

This file serves as a comprehensive summary of sequence quality, clade assignments, and mutation profiles, aiding in downstream analysis and interpretation.



##### **Conclusion from Nextclade Visualization of the Phylogenetic Tree**
Using the outputs from Nextclade, we constructed a phylogenetic tree to analyze the evolutionary relationships between SARS-CoV-2 sequences. Our findings indicate that SARS-CoV-2 has undergone significant evolutionary changes, particularly in the transition from the Delta variant to the Omicron variant and its numerous sublineages. This evolution is characterized by an increasing number of mutations, ranging from approximately 40–60 in Delta to 60–160 in Omicron sublineages such as 21L, 22B, 22D, 23D, 24A, and 24B. These changes highlight the virus’s adaptability through genetic diversification, likely driven by immune pressure, global transmission dynamics, and other selective forces.



------------------------------------------------------------------------------------------------------------------------

#### **2. Visualizing the IQ-TREE Output Using Biopython's Phylo Module**

As we know, in **Submodule 3**, we introduced additional tools for phylogenetic tree construction, such as **Usher**, **IQ-TREE**, and **FastTree**, as alternative options to Nextclade. Students can choose any of these tools based on their preferences or specific analysis needs. 

Now, we will visualize the phylogenetic tree using the output file **`aligned_sequences.treefile`** generated by the **IQ-TREE** tool. We will use the **Biopython Phylo module** to visualize the tree in this Jupyter Notebook.




In [None]:
# Create the folder structure

# Check if the directory exists
uniport_dir = os.path.isdir('./data/cov/visualization')

# If the directory does not exist, create it
if not uniport_dir:
    try:
        os.makedirs('./data/cov/visualization')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:

# File paths
treefile = "./data/cov/phylogenetic_tree/IQ-Tree/aligned_sequences.treefile"
output_path = "./data/cov/phylogenetic_tree/visualization/phylogenetic_tree.svg"

try:
    # Read the tree
    tree = Phylo.read(treefile, "newick")

    # Dynamically calculate figure size based on the number of labels
    num_labels = len(tree.get_terminals())
    fig_width = 50
    fig_height = max(10, num_labels // 2)  # Adjust height to fit the tree

    # Create a figure and axis
    fig, ax = plt.subplots(figsize=(fig_width, fig_height))

    # Draw the tree
    Phylo.draw(tree, do_show=False, axes=ax)

    # Adjust label rotation and font size dynamically to reduce overlap
    font_size = max(5, 200 // num_labels)  # Smaller font for more labels
    for label in ax.get_yticklabels():
        label.set_rotation(45)  # Rotate labels by 45 degrees
        label.set_fontsize(font_size)

    # Ensure the output directory exists
    os.makedirs(os.path.dirname(output_path), exist_ok=True)

    # Save the figure in SVG format
    fig.savefig(output_path, dpi=200, bbox_inches='tight', format='svg')

    print(f"Tree successfully saved at: {output_path}")

    plt.show()

except FileNotFoundError:
    print(f"Error: The file '{treefile}' does not exist.")
except Exception as e:
    print(f"An error occurred: {e}")




#### **Explanation of the Tree**
The tree generated by IQ-TREE represents evolutionary relationships, with branch lengths corresponding to genetic divergence. Users can interpret the topology to understand lineage splits and evolutionary distances.

For customization, users may optionally use iTOL (Interactive Tree of Life), which offers publication-quality visualizations.
Using iTOL is especially recommended as a second option if the phylogenetic tree is not properly visualized, particularly for large datasets, as iTOL provides advanced tools for better clarity and annotation.

**Note:** In this way, students can experiment with other files generated by the FastTree and Usher tools in Submodule 3, similar to how we did with IQ-TREE.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Use iTOL for publication-quality visualizations and advanced customization options.
</div>

### **4.2 Importance of Visual Representation**

Visual representation is critical for:

1. **Interpreting Results:** Simplifies understanding of evolutionary relationships.
2. **Communication:** Makes findings accessible to a broader audience.
3. **Highlighting Features:** Emphasizes key evolutionary events and patterns.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝  A well-designed visualization can make complex evolutionary relationships easier to comprehend.
</div>

### **4.3 Conduct Comparative Metagenomics Along Different Branches**

#### **Overview**
Comparative metagenomics involves analyzing genetic content along different branches of a phylogenetic tree. By mapping genetic sequences to branches of the tree, researchers can uncover patterns of variation and evolution across species or strains. This method provides critical insights into how genetic features are distributed and how they have evolved in different lineages.

For instance, variations in genes between closely related species might indicate adaptations to specific environmental pressures or unique evolutionary events. Comparative metagenomics can also help identify conserved genetic elements or functions, highlighting essential genes shared across lineages.

#### **Steps for Comparative Metagenomics**

**1. BLAST:**
BLAST (Basic Local Alignment Search Tool) is a powerful tool for comparing biological sequences, such as DNA or protein sequences. It identifies regions of similarity, helping to link sequences from different species to a common ancestor or highlight differences that have emerged over time.

Using BLAST in comparative metagenomics allows you to:
- Compare query sequences against a reference database.
- Identify homologous sequences, which are indicative of shared ancestry.
- Detect unique sequences that might explain species-specific traits or adaptations.

##### **Example Use Case:**
Suppose you have a set of genetic sequences from species represented in a phylogenetic tree. Using BLAST:
1. You can create a database from the sequences.
2. Compare a query sequence (e.g., a gene of interest) to identify its presence and variation across the branches of the tree.
3. Analyze the results to determine patterns of conservation or divergence.

The outputs of BLAST (alignment scores, e-values, etc.) can then be visualized and interpreted in the context of the phylogenetic tree to draw meaningful biological conclusions.


**1. Create a BLAST Database**

The tutorial use the alignment of South Dakota sequence file from data/cov/alignment/aligned_sequences.fasta folder to create a BLAST database (Reference to Submodule 2).

In [None]:
# Create the folder structure

# Check if the directory exists
uniport_dir = os.path.isdir('./data/cov/blast_db')

# If the directory does not exist, create it
if not uniport_dir:
    try:
        os.makedirs('./data/cov/blast_db')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
import time
starttime = time.time()
!makeblastdb -in data/cov/alignment/aligned_sequences.fasta -dbtype nucl -out ./data/cov/blast_db/reference_db
endtime = time.time()
execution_time = endtime - starttime
print(f"Execution Time: {execution_time} seconds")


**Explanation of BLAST Command**
1. **-in ./data/sars-cov-2/reference.fasta** : Input FASTA file containing sequences
2. **-dbtype nucl** : Specifies that the database is for nucleotide sequences
3. **-out ./data/cov/blast_db/reference_db**:  Output directory where the BLAST database will be stored




 **Why we need to run blast**

We have already completed the phylogenetic tree construction and visualization. However, what happens if a new FASTQ file arrives and you need to determine whether it contains COVID-19 sequences or if it is contaminated? This is where blastn can help by identifying them.

**2. Run BLAST for Comparisons:**

The tutorial use the sequence_cs_subset_30.fasta file from data/cov/sequence folder to create a BLAST query. The sequence_cs_subset_30.fasta is random 30 samples extracted from NCBI Virus data which filtered on California from Jan 1, 2023 to Mar 31, 2023.
Run BLAST to compare query sequences against the database:

In [None]:
import time
starttime = time.time()
!blastn -query ./data/cov/sequence/sequences_ca_subset_30.fasta -db ./data/cov/blast_db/reference_db -out ./data/cov/blast_db/analysis_result_phylogenetic_tree_output_reference_sequences.txt -outfmt 6
endtime = time.time()
execution_time = endtime - starttime
print(f"Execution Time: {execution_time} seconds")

**3. Visualize the Blastn Result:**

Results are saved in `analysis_result_phylogenetic_tree_output_reference_sequences.txt.`

In [None]:
import pandas as pd
import plotly.express as px

# Load the BLAST results into a DataFrame
file_path = "./data/cov/blast_db/analysis_result_phylogenetic_tree_output_reference_sequences.txt"
columns = [
    "Query ID", "Hit ID", "% Identity", "Alignment Length", "Mismatches", "Gap Opens",
    "Query Start", "Query End", "Subject Start", "Subject End", "E-value", "Bit Score"
]

# Read the tab-separated BLAST results
df = pd.read_csv(file_path, sep="\t", names=columns)

# Convert necessary columns to numeric for visualization
df["% Identity"] = pd.to_numeric(df["% Identity"], errors='coerce')
df["Alignment Length"] = pd.to_numeric(df["Alignment Length"], errors='coerce')
df["E-value"] = pd.to_numeric(df["E-value"], errors='coerce')
df["Bit Score"] = pd.to_numeric(df["Bit Score"], errors='coerce')

# Create an interactive scatter plot (Bit Score vs. Alignment Length)
fig_scatter = px.scatter(
    df, 
    x="Alignment Length", 
    y="Bit Score", 
    color="% Identity", 
    hover_data=["Query ID", "Hit ID", "E-value"],
    title="BLAST Results: Bit Score vs Alignment Length",
    labels={"Alignment Length": "Alignment Length", "Bit Score": "Bit Score"}
)

# Create an interactive histogram of % Identity
fig_hist = px.histogram(
    df, 
    x="% Identity", 
    nbins=30, 
    title="Distribution of % Identity in BLAST Results",
    labels={"% Identity": "Percentage Identity", "count": "Frequency"}
)

# Save the figures as images
fig_scatter.write_image("./data/cov/blast_db/bit_score_vs_alignment_length.png")
fig_hist.write_image("./data/cov/blast_db/percentage_identity_distribution.png")


# Show the plots
fig_scatter.show()
fig_hist.show()

#### **Interpreter the blastn result**

Based on the BLASTN results for the SARS-CoV-2 sequences isolated from South Dakota and California between January and March 2023, the findings are highly significant:

 - High Percent Identity and Bit Scores:

The sequences from both regions exhibited percent identities typically above 99%, with very high bit scores. This indicates that the genetic sequences are nearly identical, suggesting that the viral strains circulating in South Dakota and California during this period are highly similar.

 - Low E-values:

The reported e-values are extremely low (approaching zero), confirming that the observed matches are not due to random chance. In practical terms, this means that the alignment between the query sequences (from one region) and the subject sequences (from the other) is statistically very robust.

 - Consistent Alignment Lengths:
The alignment lengths span nearly the full length of the viral genome segments being compared. This reinforces that not only are key regions similar, but the overall genomic structure is conserved across the isolates from both states.

##### Interpretation:
The BLASTN data strongly suggest that the SARS-CoV-2 strains in South Dakota and California during the first quarter of 2023 are genetically very similar. This high degree of similarity implies a potential common origin or transmission pathway between these regions. It also suggests that the circulating virus variant maintained its genetic integrity across different geographical locations during this period. For public health, this information is valuable as it indicates that interventions (such as vaccination and antiviral strategies) effective in one region are likely to be equally effective in the other, given the genetic uniformity of the virus.

In summary, the BLASTN results provide compelling evidence that the SARS-CoV-2 isolates from South Dakota and California during January through March 2023 share high genetic similarity, reflecting either a common viral lineage or significant inter-regional transmission during that time frame.

<div style="padding: 10px; border: 1px solid #ffccbc; border-radius: 5px; background-color: #ffebee;">
    <strong>Alert:</strong>⚠️ Ensure your sequences are properly formatted and validated before running BLAST to avoid errors.
</div>

### **Summary**
In this module, learners explored the process of interpreting and visualizing phylogenetic trees using various tools and techniques. **Nextclade** outputs were analyzed to understand multiple file formats, including **aligned sequences, JSON metadata, mutation reports, and tree structures in Newick format**.  

Visualization techniques using **iTOL** and **Auspice** were introduced, allowing learners to interact with and interpret evolutionary relationships. Phylogenetic tree reconstruction was demonstrated using tools like **IQ-TREE** 

The module also covered **comparative metagenomics**, where **BLAST** was used to identify genetic similarities and variations across different lineages. Finally, **Biopython** was introduced as an automation tool, streamlining large-scale phylogenetic analysis and making workflows more efficient, reproducible, and scalable.  

### **Interactive Quiz**

Test your understanding of phylogenetic analysis and ancestral state reconstruction:

In [None]:
from IPython.display import IFrame
IFrame("Quiz/QS4.html", width=800, height=350)