# Submodule #3: Construct Phylogenetic Tree
The process of creating a diagram that shows the evolutionary relationships among species or genes based on their similarities and differences in genetic or physical traits. It visually represents how they evolved from common ancestors. 

##### Primary Objective #####
Guide learners through the process of constructing phylogenetic trees from aligned sequence data. This module introduces tools and workflows for high-performance sequence alignment and tree construction.

### Overview
- **What You'll Learn:**
  - Perform sequence alignment using MAFFT and ClustalW.
  - Understand VCF format, why it is converted, and its relevance in USHER analysis.
  - Construct phylogenetic trees using USHER.


- **Tools and Libraries:**
  - **MAFFT**: Fast and scalable sequence alignment tool.
  - **ClustalW**: Alternative sequence alignment tool for comparison.
  - **USHER**: Tool for constructing phylogenetic trees.
  - **VCF format**: Essential data format for storing genetic variants.








## Learning Objectives ##
By the end of this submodule, learners will be able to:

1. Explain the importance of sequence alignment and phylogenetic tree construction.
2. Perform sequence alignment using MAFFT and compare it with ClustalW.
3. Understand the VCF format and its relevance in tree creation workflows.
4. Use USHER to construct large-scale phylogenetic trees.
5. Describe the role of cloud computing in reducing computational costs.

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data and Analysis

<font color="green"> **Submodule #3: Construct Phylogenetic Tree** </font>
 
Submodule #4: Analyze Phylogenetic Tree

## 3.1: Sequence Alignment ##
**Sequence Alignment** is a critical first step in phylogenetic tree construction. It involves arranging DNA, RNA, or protein sequences to identify regions of similarity, which may indicate evolutionary, functional, or structural relationships. Proper alignment ensures that homologous (evolutionarily related) positions are compared across sequences, allowing evolutionary relationships to be accurately represented.

***Why Use FASTQC in Submodule 2 if Alignment Happens Later?***
FASTQC is used in Submodule 2 to ensure the quality of sequence data before alignment, as the accuracy of sequence alignment and subsequent phylogenetic analyses heavily depends on the quality of the input data. It identifies sequencing errors, adapter contamination, and low-quality reads that could distort alignment, ensuring sequences are clean and trimmed for tools like MAFFT or ClustalW. By providing error-free input, FASTQC enhances the accuracy of evolutionary relationships in phylogenetic trees. Additionally, for students, it highlights the importance of quality control as a critical step in bioinformatics workflows, ensuring reliable and meaningful results.


We study two tools that can be used for sequence alignment:
1. MAFFT
2. ClustalW

### Tool 1: MAFFT (Multiple Alignment using Fast Fourier Transform)

#### What is MAFFT? ####
MAFFT is a bioinformatics tool used for multiple sequence alignment (MSA) of DNA or protein sequences. It is widely recognized for its speed, accuracy, and ability to handle large datasets.

**Key Features of MAFFT:**
1. **Fast Algorithm**: MAFFT uses advanced algorithms like Fast Fourier Transform (FFT) to quickly identify sequence similarities, making it faster than many other alignment tools.
2. **Scalable**: It can align thousands of sequences efficiently, which is ideal for large-scale studies.
3. **Multiple Strategies**: MAFFT offers different alignment methods, such as progressive alignment (quick) and iterative refinement (more accurate).
4. **User-Friendly**: It provides both command-line and web-based interfaces, making it accessible for beginners and experts.

**Command Used for MAFFT:** mafft input_sequences.fasta > aligned_sequences.fasta
- input_sequences.fasta: Input file containing gene sequences.
- aligned_sequences.fasta: Output file with aligned sequences.

**Example:** **Steps to Perform Sequence ALignment with MAFFT**
1. Align the sequences in your dataset using the following command:



In [None]:
# Create the folder structure
import os

# Check if the directory exists
alignment_dir = os.path.isdir('./data/cov/alignment')

# If the directory does not exist, create it
if not alignment_dir:
    try:
        os.makedirs('./data/cov/alignment')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
 !mafft --auto ./data/cov/sequence/sequences.fasta > ./data/cov/alignment/aligned_sequences.fasta


### Why MAFFT?
- MAFFT is fast and scalable.
- It works well for large datasets compared to ClustalW.
- It is computationally efficient.

### Tool 2 : ClustalW
ClustalW is a bioinformatics tool used for multiple sequence alignment (MSA) of DNA or protein sequences. It aligns sequences step-by-step using a progressive alignment method.
### Key Features of ClustalW
1. **Progressive Alignment**: Aligns sequences in pairs, then builds a guide tree to combine alignments progressively.
2. **Simple and Reliable**: Well-suited for aligning small to moderate datasets.
3. **Widely Used**: Popular for teaching, research, and constructing phylogenetic trees to study evolutionary relationships.
### Steps to Perform Sequence Alignment with ClustalW
1. **Define Paths for Files and Tools:**
 
   - fasta_file = "data/cov/sequences_subset.fasta"
   - clustalw_exe = "/path/to/clustalw2"
   - seq_algn_file = "data/cov/sequences_subset.aln"

### Process with Clustalw

In [None]:
import subprocess
import datetime 
import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "./data/cov/sequence/sequences.fasta"

clustalw_exe = "/opt/conda/bin/clustalw"
seq_algn_file = "./data/cov/alignment/sequences.aln"

start_time = datetime.datetime.now()
print(f"Process started at: {start_time}")

# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

end_time = datetime.datetime.now()
print(f"Process ended at: {end_time}")

# Calculate the duration
duration = end_time - start_time
print(f"Total time taken: {duration}")

### Comparison: MAFFT vs. ClustalW

| Feature    | MAFFT                          | ClustalW |
|----------  |----------                      |----------|
| Speed      | Very fast                      | Slower for large datasets |
| Scalability| Excellent for large inputs     | Moderate  |
| Usage      |Command-line, highly flexible   | Easier for beginners  |



### Performance Comparison: MAFFT vs. ClustalW
To help users understand the performance differences between MAFFT and ClustalW, we measured the time taken by each tool to align the same dataset (sequences.fasta, containing 3 months of data).
- **MAFFT**: Completed the alignment in approximately 20 minutes.
- **ClustalW**: Took over 1 hour to process the same dataset.


### Why is MAFFT Faster than ClustalW?
MAFFT is faster than ClustalW due to its algorithmic efficiency and use of advanced computational techniques. Below are the key reasons:
1. **Fast Fourier Transform (FFT):**
MAFFT uses FFT to identify sequence similarities efficiently, reducing computation time compared to ClustalW’s direct alignment.

2. **Iterative Refinement:**
MAFFT refines alignments progressively, improving accuracy without significant delays, unlike ClustalW's single-pass alignment.


3. **Optimized for Large Datasets:**
Designed to handle thousands of sequences quickly, MAFFT avoids unnecessary computations, making it faster for big datasets.

4. **Parallel Processing:**
MAFFT supports multi-core processing, speeding up tasks significantly. ClustalW lacks this capability.

5. **Heuristic Methods:**
MAFFT uses smart approximations to skip redundant calculations, ensuring faster performance.


## 3.2 Phylogenetic Tree Reconstruction Using USHER

#### Understanding VCF Format

###### **What is VCF?** 
The Variant Call Format (VCF) is a widely adopted format for storing genetic variant information. It records:

- Chromosome position of each variant.
- Type of variants: SNPs(Single Nucleotide Polymorphisms), insertions, deletions.
- Metadata: Quality scores, depth, and other annotations.
- VCF is highly efficient for handling genomic variant data and is compatible with tools like USHER.
  
### **Why convert to VCF**? 
Tools like USHER require VCF files as input for constructing trees. Converting raw sequence alignments into VCF format ensures compatibility.

It’s important to understand that tools like faToVcf do more than just change the file format from FASTA to VCF. They perform a process called variant calling, which identifies differences (variants) between the aligned sequences and a reference sequence. These differences, such as SNPs (Single Nucleotide Polymorphisms), insertions, or deletions, are then recorded in the VCF file.

**What is Variant Calling?**
Variant calling is an analytical process where:
- The aligned sequences (from FASTA) are compared to a reference sequence.
- Any differences or changes in the sequences, such as mutations or gaps, are identified.
- These differences are described and saved in a structured format (VCF).

### USHER:
For constructing a phylogenetic tree, we use USHER (Ultrafast Sample placement on Existing tRee). USHER is a bioinformatics tool specifically designed to place new genetic sequences onto an existing phylogenetic tree quickly and accurately. This allows us to study the evolutionary relationships of the sequences in the context of known data, making it particularly useful for analyzing large datasets or tracking genetic variations over time.

**Key Features of USHER**
- Speed: Processes large datasets quickly, even in real-time.
- Scalability: Handles complex phylogenies with thousands of sequences.
- Integration: Accepts input in common formats (e.g., VCF, Newick).

### Steps to Construct a Phylogenetic Tree with USHER
1. **Prepare Input Files:**
   - Aligned sequences in FASTA format.
   - VCF file containing sequence variants.
   - A reference tree in Newick format.


<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Always inspect the aligned sequences to ensure proper alignment before proceeding to phylogenetic tree construction.
</div>

In [None]:
# Create the folder structure
import os

# Check if the directory exists
uniport_dir = os.path.isdir('./data/cov/phylogenetic_tree')

# If the directory does not exist, create it
if not uniport_dir:
    try:
        os.makedirs('./data/cov/phylogenetic_tree')
        print("Directory created successfully")
    except Exception as e:
        print(f"An error occurred: {e}")

### 2. Convert Aligned Sequences to VCF:

In [None]:
 !faToVcf ./data/cov/alignment/aligned_sequences.fasta ./data/cov/phylogenetic_tree/phylogenetic_tree_aligned_sequences.vcf


### 3. Generate a Reference Tree in Newick Format:
This command is important because it generates a reference phylogenetic tree in Newick format from aligned sequences, which is essential for tools like USHER to place new sequences and analyze evolutionary relationships efficiently.

In [None]:
 !fasttree -nt ./data/cov/alignment/aligned_sequences.fasta > ./data/cov/phylogenetic_tree/phylogenetic_tree_reference_aligned_sequences.nwk


### 4. Run USHER to Construct the Phylogenetic Tree:

In [None]:
!usher -t ./data/cov/phylogenetic_tree/phylogenetic_tree_reference_aligned_sequences.nwk -v ./data/cov/phylogenetic_tree/phylogenetic_tree_aligned_sequences.vcf -o ./data/cov/phylogenetic_tree/phylogenetic_tree_output_aligned_sequences.nwk


### 5. Output:

- The constructed phylogenetic tree will be saved in the Newick file format `(.nwk).`
- You can visualize the tree using compatible tools.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Tools like MAFFT and USHER integrate well with cloud computing platforms for high-performance analysis. For more information refer to README Page.
</div>

## Summary
This notebook teaches how to construct phylogenetic trees by aligning sequences using tools like MAFFT and ClustalW, then creating trees with USHER. It explains the importance of sequence alignment, compares tools, and uses cloud computing for scalable analysis.
While this submodule focuses on constructing the phylogenetic tree, in Submodule 4, we will delve deeper into analyzing and visualizing the phylogenetic tree to extract meaningful insights. This transition ensures a comprehensive understanding of both the construction and interpretation of phylogenetic trees.



## Interactive Quiz

Test your understanding of phylogenetic tree construction with this interactive quiz:

In [None]:
from IPython.display import IFrame
IFrame("Quiz/QS3.html", width=800, height=350)