# **University of South Dakota: Phylogenetic Analysis**

## **Submodule #1: Understanding the Basics of Phylogenetic Trees**

### **Introduction to Phylogenetics**

Phylogenetics is the study of evolutionary relationships among organisms using genetic or phenotypic data. It uses genetic data (like DNA sequences) or phenotypic data (traits like size or color) to understand these relationships. By comparing similarities and differences, scientists can create evolutionary trees (called phylogenetic trees) that show how species share common ancestors and have evolved over time.

To illustrate its importance, consider the COVID-19 pandemic. Researchers around the world have used phylogenetic trees to:
 - Track the spread and mutation of SARS-CoV-2.
 - Identify emerging variants and their ancestral relationships.
 - Understand how the virus evolves over time and across populations.

This module builds foundational skills to analyze and visualize phylogenetic trees using Python. By the end, these skills can be applied to real-world biological datasets, including COVID-19 genomic data.






### **Learning Objectives:**

Phylogenetics explores the evolutionary connections and ancestral relationships among organisms, providing insights into their genetic and evolutionary history. This submodule introduces the foundational concepts of phylogenetic trees, helping learners understand their structure, purpose, and applications in biological research. By the end of this module, learners will be able to define and interpret phylogenetic trees, recognize their importance in mapping genetic changes and understanding biodiversity, and apply practical skills to construct and analyze phylogenetic trees using Python for real-world biological data.

- **What You'll Learn:**
    - Basics of phylogenetic trees and their significance.
    - Steps to create and interpret different tree types (Rooted, Unrooted, Cladograms, Phylograms, Dendrograms).
    - Hands-on examples using Python for visualizations.
- **Tools and Libraries:** 
    - `Biopython` for phylogenetic analysis.
    - `Matplotlib` for visualization.
    - Real-world biological data in Newick format.
- **Why It Matters:**
    - Phylogenetic trees provide insights into evolutionary history, biodiversity, and genetic variation.
    - Understanding these trees is critical for applications in genomics, disease research, and conservation biology.



-------------------------------------------------------------------------------------------------------------------

### **1.1 What is a Phylogenetic Tree?**
A phylogenetic tree is a diagram that represents evolutionary relationships among organisms. Just as a family tree traces ancestry among relatives, a phylogenetic tree maps out how species are related through common ancestors, illustrating evolutionary pathways based on genetic, morphological, or molecular data.

Each branch point (or node) in a phylogenetic tree represents a common ancestor, while the branches indicate evolutionary divergence. Organisms that share a recent common ancestor appear closer together, while those that diverged earlier appear farther apart.


Example: Phylogenetic Tree of Primates


<center>
    <img src="images/primates.drawio.png" width="500">
</center>



This phylogenetic tree illustrates the evolutionary relationships among primates. It divides them into apes (hominoids)—including humans, chimpanzees, gorillas, orangutans, and gibbons—and monkeys (cercopithecoids) such as baboons, macaques, and colobus monkeys. Species on closer branches share a more recent common ancestor, with humans and chimpanzees being the most closely related. This tree highlights primate evolution based on genetic and morphological similarities.




### **1.2 Why Are Phylogenetic Trees Important?**

Phylogenetic trees are powerful tools for visualizing and understanding the relationships among organisms, genes, or pathogens. They play a central role in many fields, including evolutionary biology, disease research, and conservation. Here’s a detailed breakdown of their importance, with real-world examples like COVID-19:

1. **Tracing Evolutionary Pathways:** Phylogenetic trees help scientists understand the origins and evolutionary history of species, populations, or pathogens. They reveal how organisms are related to one another through a shared common ancestor and show the timeline of divergence.
   - **Example: COVID-19:** When SARS-CoV-2 first emerged, scientists constructed phylogenetic trees to trace its origins.  
     - By comparing the genetic sequence of SARS-CoV-2 to other coronaviruses, researchers found that it likely originated from a bat coronavirus and later jumped to humans.  
     - Phylogenetic analysis continues to track how the virus evolves, leading to new variants like Delta, Omicron, and their sublineages.  
3. **Mapping Genetic Changes:** Phylogenetic trees allow us to identify and track mutations and genetic divergence over time. This is particularly important for studying pathogens, as mutations can impact their transmissibility, severity, or resistance to treatment.
Example: Tracking SARS-CoV-2 Mutations
As SARS-CoV-2 spreads, it accumulates mutations in its genome. Scientists use phylogenetic trees to map these changes and understand their significance. For instance:
   - **The Delta variant**: Known for increased transmissibility.
   - **The Omicron variant**: Has numerous mutations in its spike protein, which affects vaccine efficacy.
By visualizing these changes on a phylogenetic tree, researchers can predict how future variants might evolve and prepare vaccines or treatments accordingly.
           
4. **Understanding Biodiversity:** Phylogenetic trees help us explore how species diversify, adapt, and evolve over time. They show relationships among organisms and help identify patterns of adaptation to different environments.
Example: Bird Species Adaptation
In evolutionary biology, phylogenetic trees have been used to study Darwin’s finches on the Galápagos Islands. The trees revealed how the finches adapted to different niches, leading to the evolution of distinct beak shapes that suited their food sources. This example highlights how evolutionary pressures drive biodiversity.
5. **Disease Research:** Phylogenetic trees are essential for tracking the spread and evolution of infectious diseases. They help identify the origin of outbreaks, monitor how pathogens evolve, and guide public health responses.
   Example: Global Spread of COVID-19
   During the COVID-19 pandemic, phylogenetic trees were used to:
   - **Trace Spread:** Show how SARS-CoV-2 traveled between countries. For example, variants found in the UK and          South Africa spread globally.
   - **Monitor Evolution:** Identify how and when new variants emerged. For instance, the Alpha variant emerged in        the UK, followed by Beta in South Africa and Delta in India.
   - **Inform Decisions:** Governments and health agencies used this data to implement travel restrictions, update        vaccines, and develop targeted interventions.


### **1.3 How Are Phylogenetic Trees Created?**
Phylogenetic trees are constructed using a combination of genetic data, computational tools, and mathematical models. This section outlines the key components involved in the process, including the types of data used, the sources of this data, the technologies that enable data generation, and the methods used to infer evolutionary relationships.
#### **1.3.1 Data Types and Sources for Phylogenetic Analysis**
Phylogenetic analysis relies on several types of data, which are often sourced from publicly available repositories or generated through sequencing technologies. Below is a breakdown of the primary data types and their common sources:

1. **Genetic Sequences:**
    - **Data Type:** DNA, RNA, or protein sequences.

    - **Role in Phylogenetics:** These sequences are compared across species or viruses to identify similarities and differences, which form the basis for inferring evolutionary relationships.

    - **Common Sources:**
      - **Public Databases**: Repositories like GenBank, EMBL, and DDBJ provide annotated genetic sequences for a wide range of organisms.

      - **Genomic Projects**: Large-scale initiatives, such as the Human Genome Project and the 1000 Genomes Project, generate extensive datasets that are valuable for phylogenetic studies.

#### **1.3.2 Sequencing Technologies Enabling Data Generation:** 
Advances in sequencing technologies have revolutionized the field of phylogenetics by making genetic data more accessible and affordable. Key technologies include:

- **Next-Generation Sequencing (NGS):**
 NGS allows for the rapid and cost-effective generation of high-quality genetic data from a wide range of organisms.
It has democratized access to sequencing, enabling researchers to study non-model organisms and large populations.

- **Third-Generation Sequencing:** Technologies like PacBio and Oxford Nanopore provide long-read sequencing, which is particularly useful for resolving complex genomic regions and improving the accuracy of phylogenetic trees.
These technologies have expanded the scope of phylogenetic studies by providing richer and more diverse datasets.
  
#### **1.3.3 Mathematical Modeling and Tree Construction**
The construction of phylogenetic trees involves applying mathematical models to genetic data to estimate evolutionary relationships. Key steps and methods include:

- **Sequence Alignment:**
   - Genetic sequences are aligned to identify homologous regions, which are then used for comparison.

  - Tools like Clustal Omega and MAFFT are commonly used for this purpose.

- **Model Selection:**
        - Evolutionary models (e.g., Jukes-Cantor, Kimura, or more complex models like GTR) are selected to account for variations in mutation rates and other evolutionary processes.

- **Tree Inference Methods:**

  - Distance-Based Methods: These methods, such as Neighbor-Joining, calculate pairwise distances between sequences and use them to construct trees.

  - Character-Based Methods: Maximum Parsimony and Maximum Likelihood approaches evaluate possible trees and select the one that best explains the observed data.

  - Bayesian Inference: This method uses probabilistic models to estimate the posterior distribution of trees, incorporating prior knowledge and uncertainty.

 - **Tree Visualization and Validation:**

   - Constructed trees are visualized using tools like FigTree or iTOL.

    - Bootstrap analysis or posterior probabilities are used to assess the robustness of the tree topology.

#### **1.3.4 Integrating Data and Methods for Accurate Phylogenetics**

The accuracy of phylogenetic trees depends on the quality of the data, the appropriateness of the evolutionary models, and the computational methods used. By integrating diverse data sources, leveraging advanced sequencing technologies, and applying robust mathematical models, researchers can reconstruct evolutionary relationships with greater confidence.

### **1.4 Newick Format**
The Newick format is a way of representing tree structures, especially phylogenetic trees, in a simple and compact text format. It is widely used in bioinformatics and evolutionary biology to store and exchange tree data.

**Syntax**

- A tree is represented using parentheses to denote branching and colons to indicate branch lengths.
- Leaf nodes (species or taxa) are labeled with names.
- Internal nodes (ancestral relationships) are represented by nested parentheses.
- The tree ends with a semicolon (;).

**Basic Example**

``(A,B,(C,D));  ``

This represents a tree where A and B are closely related, while C and D are also related, and together they form a larger grouping.

**Example with Branch Lengths**

``(A:0.5, B:0.6, (C:0.7, D:0.8):0.4);``

Here, A, B, C, and D are species.
The numbers after colons (e.g., 0.5, 0.6) represent branch lengths, which often correspond to evolutionary distances.

**Reading the Format**

1. Parentheses group related species together.
2. Comma (,) separates branches at the same level.
3. Colon (:) followed by a number indicates branch length.
4. The entire structure ends with a semicolon (;).


### **1.4 Types of Phylogenetic Trees and Their Applications**

#### 1. **Rooted Trees:**
A rooted tree is a type of phylogenetic tree that has a single common ancestor (root) at its base. All other branches in the tree descend from this root, representing evolutionary relationships over time. The direction of branches in a rooted tree is important because it indicates the passage of time and evolutionary divergence of species or variants.

In the context of COVID-19 Variant Tracking, a rooted tree can be used to show how different variants of the virus evolved from an original strain. The root represents the earliest known variant, and as mutations accumulate over time, new branches (variants) emerge.

**Key Features of a Rooted Tree:**

- Has a single root node that represents the most recent common ancestor.

- Branches represent evolutionary paths, and their lengths may indicate time or genetic distance.

- Nodes (branching points) represent divergence events, such as the emergence of new variants.

- The direction of branches shows the evolutionary history over time.

In [None]:
from Bio import Phylo
from io import StringIO
import matplotlib.pyplot as plt

# Hypothetical rooted tree for COVID-19 variants in Newick format
covid_rooted_tree = "((Alpha:0.2, Delta:0.3):0.5, (Omicron:0.4, Beta:0.6):0.3);"
tree = Phylo.read(StringIO(covid_rooted_tree), "newick")

# Visualization
fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
ax.set_title("Rooted Tree: COVID-19 Variants Evolution", fontsize=14, weight='bold')
plt.show()

- **Interpretation**: This rooted tree helps trace how different variants evolved from a common ancestor.

In [None]:
from Bio import Phylo
from io import StringIO
import matplotlib.pyplot as plt

# Newick format representing a rooted tree for mammals
rooted_tree = "((Human:0.6, Chimpanzee:0.6):0.4, (Dog:0.8, (Cat:0.7, Mouse:0.7):0.3):0.2);"
tree = Phylo.read(StringIO(rooted_tree), "newick")

# Visualization
fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
ax.set_title("Rooted Tree: Evolutionary Relationships Among Mammals", fontsize=14, weight='bold')
plt.show()


##### **Exercise 1**
##### **Create a Rooted Tree using the following dataset:**
##### Dataset 1: Avian Species Evolution
##### The dataset is provided in Newick format:
##### ((Sparrow:0.3, Crow:0.4):0.2, (Eagle:0.5, Hawk:0.6):0.3);


In [None]:
from IPython.display import IFrame
IFrame("Exercises/answers1.html", width=800, height=350)

In [None]:
## Implement your solution here


#### 2. **Unrooted Tree**

An unrooted tree is a type of phylogenetic tree that does not specify a common ancestor. Instead of showing evolutionary direction over time, it represents genetic relationships between different species or strains without assuming a starting point.

In the context of COVID-19 strain analysis, an unrooted tree can be used to visualize genetic similarity between different virus strains without assuming which one evolved first.

**Key Characteristics of Unrooted Trees**

-  **No Root (Common Ancestor):** Unlike rooted trees, unrooted trees do not show the earliest ancestor of all species/strains.

- **Focus on Genetic Distance:**  The branches indicate genetic similarity or dissimilarity, but they do not show which strain evolved from which.

- **Interpretation Based on Clustering:** Strains that are closer together on the tree are more genetically similar, while those further apart have accumulated more mutations.

- **Common in Genetic Similarity Studies:** Used when we have genetic data but do not know the exact evolutionary history.


**Example: Genetic Similarity Among COVID-19 Strains**
Let's visualize an unrooted phylogenetic tree of different COVID-19 strains using Biopython.


In [None]:
from Bio import Phylo
from io import StringIO
import matplotlib.pyplot as plt

# Hypothetical genetic similarity tree in Newick format (unrooted)
covid_unrooted_tree = "((Alpha:0.1, Beta:0.2, Delta:0.15, Omicron:0.25, Gamma:0.18));"

# Load the tree from the Newick string
tree = Phylo.read(StringIO(covid_unrooted_tree), "newick")

# Set tree as unrooted
fig, ax = plt.subplots(figsize=(8, 6))
Phylo.draw(tree, axes=ax, do_show=False)
ax.set_title("Unrooted Tree: Genetic Similarity Among COVID-19 Strains", fontsize=14, weight='bold')

# Display the unrooted tree
plt.show()


In [None]:
from Bio import Phylo
from io import StringIO
import matplotlib.pyplot as plt

# Define a more complex unrooted tree in Newick format
newick_unrooted_tree = "(((Human, Chimpanzee), (Dog, Wolf)), ((Cat, Tiger), (Elephant, Horse)), ((Frog, Lizard), (Eagle, Hawk)));"

# Read the tree using Biopython
tree = Phylo.read(StringIO(newick_unrooted_tree), "newick")

# Set the tree as unrooted and visualize it
fig, ax = plt.subplots(figsize=(10, 8))
Phylo.draw(tree, axes=ax)
ax.set_title("Unrooted Phylogenetic Tree with More Taxa", fontsize=14)
plt.show()


**Explanation of the Tree**

This unrooted tree contains three major clades:

- Primates & Canines: (Human, Chimpanzee) and (Dog, Wolf)
- Felidae & Large Mammals: (Cat, Tiger) and (Elephant, Horse)
- Reptiles & Birds: (Frog, Lizard) and (Eagle, Hawk)
Since it's unrooted, it does not show a common ancestor but still represents evolutionary relationships.

- **When to Use**: Unrooted trees are helpful when ancestry is unknown or not relevant.

##### **Exercise 2**
##### **Interactive Question: Rooting an Unrooted Tree**
##### Given the following unrooted tree for Microbial Communities:
##### ***(Bacteria, Archaea, Eukaryota);***
##### Tasks:
- Visualize the unrooted tree using Biopython.
- Root the tree again, but this time use "Archaea" as the outgroup.
- Compare the two rooted trees:
- How do the relationships between Bacteria, Archaea, and Eukaryota change?
- Which rooting method provides a more meaningful representation, and why?



In [None]:
from IPython.display import IFrame
IFrame("Exercises/answers2.html", width=800, height=350)

In [None]:
## Implement your solution here


#### 3. **Cladograms:**

A cladogram is a type of phylogenetic tree that illustrates the branching order (relationships between species) but does not show branch lengths or evolutionary distances. Instead, it focuses only on shared ancestry and divergence patterns.

**Key Features of a Cladogram**
- **No Branch Lengths:**  Unlike phylograms, cladograms do not represent evolutionary time or genetic differences.

- **Only Branching Order Matters:**  It shows which species are closely related, but not how much they differ genetically.

- **Represents Hypotheses:** It suggests possible relationships based on shared characteristics (clades).

- **Used in Evolutionary Biology:** Often constructed using morphological traits or genetic data.


   

In [None]:
cladogram = "(((Human, Chimpanzee), Gorilla), Orangutan);"
tree = Phylo.read(StringIO(cladogram), "newick")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
Phylo.draw(tree, axes=ax)
ax.set_title("Cladogram: Evolution of Primates", fontsize=14, weight='bold')
plt.show()

- **Why It Matters**: Cladograms focus on relationships rather than evolutionary time.



In [None]:
from Bio import Phylo
from io import StringIO
import matplotlib.pyplot as plt

# Newick format for a cladogram (branching order only)
cladogram = "(((Frog, Lizard), (Bird, Mammal)), Fish);"
tree = Phylo.read(StringIO(cladogram), "newick")

# Visualization
fig, ax = plt.subplots(figsize=(10, 8))
Phylo.draw(tree, axes=ax)
ax.set_title("Cladogram: Vertebrate Evolution", fontsize=14, weight='bold')
plt.show()

##### **Exercise 3**
##### **Using the following Newick format:**
(((Human, Chimpanzee), Gorilla), Orangutan);
Which species are most closely related in the cladogram?

In [None]:
from IPython.display import IFrame
IFrame("Exercises/answers3.html", width=800, height=350)

In [None]:
## Implement your solution here


#### 4. **Phylograms**
A phylogram is a type of phylogenetic tree that represents:

- **Branching Order:**  Which species are closely related.

- **Branch Lengths:** The amount of evolutionary change (e.g., mutation rates or genetic differences).

Unlike cladograms, phylograms use branch lengths to show the degree of genetic divergence between species.

**Example: Genetic Divergence in Fruit Flies**

A phylogram of fruit flies can illustrate how different Drosophila species (e.g., Drosophila melanogaster, Drosophila simulans, Drosophila yakuba) diverged over time.

In [None]:
from Bio import Phylo
from io import StringIO
import matplotlib.pyplot as plt

# Hypothetical phylogram of Fruit Fly species with branch lengths
fruit_fly_phylogram = "(Drosophila_melanogaster:0.1, (Drosophila_simulans:0.2, (Drosophila_yakuba:0.15, Drosophila_ananassae:0.3):0.25):0.2);"

# Read the tree from the Newick string
tree = Phylo.read(StringIO(fruit_fly_phylogram), "newick")

# Plotting the Phylogram
fig, ax = plt.subplots(figsize=(10, 6))
Phylo.draw(tree, axes=ax, do_show=False)  # Ensure we can modify the plot

# Add a title
ax.set_title("Phylogram: Genetic Divergence in Fruit Flies", fontsize=14, weight='bold')

# Display the plot
plt.show()


- **Why It Matters**: Useful for understanding evolutionary rates and genetic distances.



##### **Exercises 4**
###### Visualize the Phylogram: Use the provided Newick format to visualize the genetic divergence among fruit flies:
###### ((A:0.2,B:0.3):0.4,C:0.5);



In [None]:
from IPython.display import IFrame
IFrame("Exercises/answers4.html", width=800, height=350)

In [None]:
## Implement your solution here


#### 5. **Dendrograms**:
    
A dendrogram is a tree-like diagram used to illustrate the arrangement of clusters based on hierarchical relationships. Unlike phylogenetic trees that focus on evolutionary history, dendrograms are often used in hierarchical clustering, such as:

- Gene Expression Analysis (grouping genes with similar expression patterns).

- Linguistics (showing language relationships).

- Social Sciences (clustering similar behaviors or preferences).

**Example: Gene Expression Analysis**

In genomics, dendrograms help visualize how genes cluster based on expression levels across different samples.

In [None]:
import numpy as np
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt

# Hypothetical gene expression data (rows: genes, columns: samples)
np.random.seed(42)
gene_expression_data = np.random.rand(10, 5)  # 10 genes, 5 samples

# Compute hierarchical clustering using Euclidean distance and Ward's linkage
linkage_matrix = sch.linkage(gene_expression_data, method='ward', metric='euclidean')

# Plot the dendrogram
plt.figure(figsize=(10, 6))
dendro = sch.dendrogram(linkage_matrix, labels=[f"Gene {i+1}" for i in range(10)], leaf_rotation=45)

# Add title and labels
plt.title("Dendrogram: Hierarchical Clustering of Gene Expression", fontsize=14, weight='bold')
plt.xlabel("Genes")
plt.ylabel("Cluster Distance")

# Display the plot
plt.show()


### **1.5 Why Choose a Specific Tree Type?**

1. **Rooted Trees**: Ideal for understanding evolutionary direction and common ancestry.

2. **Unrooted Trees**: Best for visualizing relationships when ancestral information is unavailable.

3. **Cladograms**: Focus on relationships without evolutionary time.

4. **Phylograms**: Combine evolutionary relationships with branch lengths for quantitative analysis.

5. **Dendrograms**: Use clustering for hierarchical relationships in data.

### **Summary** ##

Phylogenetic trees are powerful tools for visualizing evolutionary relationships and understanding genetic changes. In the context of COVID-19, they help trace variants, study mutations, and inform public health strategies.

By mastering the tools and concepts in this module, you will:

- Interpret and analyze biological data.

- Construct phylogenetic trees using Python.

- Solve real-world problems like tracking disease evolution.


  In next module we will study about "Collect and Prepare Sequence Data". It is the preprocessing step for constructing phylogenetic Tree.



### **Interactive Quiz**
The following quiz will help reinforce the understanding of phylogenetics:

In [None]:
from IPython.display import IFrame
IFrame("Quiz/QS1.html", width=800, height=1000)