# Submodule #3: Construct Phylogenetic Tree
The process of creating a diagram that shows the evolutionary relationships among species or genes based on their similarities and differences in genetic or physical traits. It visually represents how they evolved from common ancestors. 

##### Primary Objective #####
Guide learners through the process of constructing phylogenetic trees from aligned sequence data. This module introduces tools and workflows for high-performance sequence alignment and tree construction.

### Overview
- **What You'll Learn:**
  - Perform sequence alignment using MAFFT and ClustalW.
  - Understand VCF format, why it is converted, and its relevance in USHER analysis.
  - Construct phylogenetic trees using USHER.
  - Use cloud platforms for scalable and cost-effective computations.

- **Tools and Libraries:**
  - **MAFFT**: Fast and scalable sequence alignment tool.
  - **ClustalW**: Alternative sequence alignment tool for comparison.
  - **USHER**: Tool for constructing phylogenetic trees.
  - **VCF format**: Essential data format for storing genetic variants.
  - **Cloud computing platforms**: For scalability and cost optimization.







## Learning Objectives ##
By the end of this submodule, learners will be able to:

1. Explain the importance of sequence alignment and phylogenetic tree construction.
2. Perform sequence alignment using MAFFT and compare it with ClustalW.
3. Understand the VCF format and its relevance in tree creation workflows.
4. Use USHER to construct large-scale phylogenetic trees.
5. Describe the role of cloud computing in reducing computational costs.

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data and Analysis

<font color="green"> **Submodule #3: Construct Phylogenetic Tree** </font>
 
Submodule #4: Analyze Phylogenetic Tree

## 3.1: Sequence Alignment ##
**Sequence Alignment** is the first step in phylogenetic tree construction. Aligning gene sequences ensures that evolutionary relationships are accurately represented. It arranges sequences to highlight similarities and differences, providing the foundation for accurate tree construction.
We study two tools that can be used for sequence alignment:
1. MAFFT
2. ClustalW

### Tool 1: MAFFT (Multiple Alignment using Fast Fourier Transform)

#### What is MAFFT? ####
MAFFT is a bioinformatics tool used for multiple sequence alignment (MSA) of DNA or protein sequences. It is widely recognized for its speed, accuracy, and ability to handle large datasets.

**Key Features of MAFFT:**
1. **Fast Algorithm**: MAFFT uses advanced algorithms like Fast Fourier Transform (FFT) to quickly identify sequence similarities, making it faster than many other alignment tools.
2. **Scalable**: It can align thousands of sequences efficiently, which is ideal for large-scale studies.
3. **Multiple Strategies**: MAFFT offers different alignment methods, such as progressive alignment (quick) and iterative refinement (more accurate).
4. **User-Friendly**: It provides both command-line and web-based interfaces, making it accessible for beginners and experts.

**Command Used for MAFFT:** mafft input_sequences.fasta > aligned_sequences.fasta
- input_sequences.fasta: Input file containing gene sequences.
- aligned_sequences.fasta: Output file with aligned sequences.

**Example:** **Steps to Perform Sequence ALignment with MAFFT**
1. Align the sequences in your dataset using the following command:



In [38]:
!mafft --auto ./data/cov/sequence/Coronavirus-Orf9-NCBI.fasta > ./data/cov/alignment/aligned_Coronavirus-Orf9-NCBI.fasta

outputhat23=16
treein = 0
compacttree = 0
stacksize: 10240 kb
generating a scoring matrix for nucleotide (dist=200) ... done
All-to-all alignment.
tbfast-pair (nuc) Version 7.526   3 / 18    4 / 18    5 / 18    6 / 18    7 / 18    8 / 18    9 / 18   10 / 18   11 / 18   12 / 18   13 / 18   15 / 18
alg=L, model=DNA200 (2), 2.00 (6.00), -0.10 (-0.30), noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
   10 / 18
done.

Progressive alignment ... 
STEP    17 /17 
done.
tbfast (nuc) Version 7.526
alg=A, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
generating a s

### Why MAFFT?
- MAFFT is fast and scalable.
- It works well for large datasets compared to ClustalW.
- It is computationally efficient.

### Tool 2 : ClustalW
ClustalW is a bioinformatics tool used for multiple sequence alignment (MSA) of DNA or protein sequences. It aligns sequences step-by-step using a progressive alignment method.
### Key Features of ClustalW
1. **Progressive Alignment**: Aligns sequences in pairs, then builds a guide tree to combine alignments progressively.
2. **Simple and Reliable**: Well-suited for aligning small to moderate datasets.
3. **Widely Used**: Popular for teaching, research, and constructing phylogenetic trees to study evolutionary relationships.
### Steps to Perform Sequence Alignment with ClustalW
1. **Define Paths for Files and Tools:**
 
   - fasta_file = "data/cov/sequences_subset.fasta"
   - clustalw_exe = "/path/to/clustalw2"
   - seq_algn_file = "data/cov/sequences_subset.aln"

### Process with Clustalw

In [57]:
import subprocess
import datetime 
import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "./data/cov/sequence/sequences_subset.fasta"
# clustalw_exe = "/home/ec2-user/anaconda3/envs/python3/bin/clustalw"
clustalw_exe = "/opt/conda/bin/clustalw"
seq_algn_file = "./data/cov/alignment/sequences_subset.aln"

start_time = datetime.datetime.now()
print(f"Process started at: {start_time}")

# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

end_time = datetime.datetime.now()
print(f"Process ended at: {end_time}")

# Calculate the duration
duration = end_time - start_time
print(f"Total time taken: {duration}")

Process started at: 2024-12-18 17:22:33.146277



 CLUSTAL 2.1 Multiple Sequence Alignments


Sequence format is Pearson
Sequence 1: England/QEUH-AD0825/2020                29903 bp
Sequence 2: AUS/VIC8093/2020                        29822 bp
Sequence 3: USA/OSPHL09179/2023                     29688 bp
Sequence 4: England/MILK-74F4E4/2020                29903 bp
Sequence 5: England/MILK-516E26/2020                29903 bp
Sequence 6: USA/UNKNOWN-UW-1403/2020                29882 bp
Sequence 7: USA/OSPHL09176/2023                     29688 bp
Sequence 8: USA/MA-CDCBI-CRSP_MLUB7OEWDYUWSKGQ/2021 29833 bp
Sequence 9: OY754681                                29847 bp
Sequence 10: USA/CA-LACPHL-AY03266/2023              29640 bp
Start of Pairwise alignments
Aligning...

Sequences (1:2) Aligned. Score:  99
Sequences (1:3) Aligned. Score:  97
Sequences (1:4) Aligned. Score:  97
Sequences (1:5) Aligned. Score:  98
Sequences (1:6) Aligned. Score:  99
Sequences (1:7) Aligned. Score:  97
Sequences 

### Comparison: MAFFT vs. ClustalW

| Feature    | MAFFT                          | ClustalW |
|----------  |----------                      |----------|
| Speed      | Very fast                      | Slower for large datasets |
| Scalability| Excellent for large inputs     | Moderate  |
| Usage      |Command-line, highly flexible   | Easier for beginners  |


### Why is MAFFT Faster than ClustalW?
MAFFT is faster than ClustalW due to its algorithmic efficiency and use of advanced computational techniques. Below are the key reasons:
1. **Fast Fourier Transform (FFT):**
MAFFT uses FFT to identify sequence similarities efficiently, reducing computation time compared to ClustalW’s direct alignment.

2. **Iterative Refinement:**
MAFFT refines alignments progressively, improving accuracy without significant delays, unlike ClustalW's single-pass alignment.


3. **Optimized for Large Datasets:**
Designed to handle thousands of sequences quickly, MAFFT avoids unnecessary computations, making it faster for big datasets.

4. **Parallel Processing:**
MAFFT supports multi-core processing, speeding up tasks significantly. ClustalW lacks this capability.

5. **Heuristic Methods:**
MAFFT uses smart approximations to skip redundant calculations, ensuring faster performance.


## 3.2 Phylogenetic Tree Reconstruction Using USHER

#### Understanding VCF Format

###### **What is VCF?** 
The Variant Call Format (VCF) is a widely adopted format for storing genetic variant information. It records:

- Chromosome position of each variant.
- Type of variants: SNPs(Single Nucleotide Polymorphisms), insertions, deletions.
- Metadata: Quality scores, depth, and other annotations.
- VCF is highly efficient for handling genomic variant data and is compatible with tools like USHER.
- 
### **Why convert to VCF**? 
Tools like USHER require VCF files as input for constructing trees. Converting raw sequence alignments into VCF format ensures compatibility.

### USHER:
For constructing a phylogenetic tree, we use USHER (Ultrafast Sample placement on Existing tRee). USHER is a bioinformatics tool specifically designed to place new genetic sequences onto an existing phylogenetic tree quickly and accurately. This allows us to study the evolutionary relationships of the sequences in the context of known data, making it particularly useful for analyzing large datasets or tracking genetic variations over time.

**Key Features of USHER**
- Speed: Processes large datasets quickly, even in real-time.
- Scalability: Handles complex phylogenies with thousands of sequences.
- Integration: Accepts input in common formats (e.g., VCF, Newick).

### Steps to Construct a Phylogenetic Tree with USHER
1. **Prepare Input Files:**
   - Aligned sequences in FASTA format.
   - VCF file containing sequence variants.
   - A reference tree in Newick format.


<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Always inspect the aligned sequences to ensure proper alignment before proceeding to phylogenetic tree construction.
</div>

### 2. Convert Aligned Sequences to VCF:

In [44]:
!faToVcf ./data/cov/alignment/aligned_Coronavirus-Orf9-NCBI.fasta ./data/cov/phylogenetic_tree/phylogenetic_tree_aligned_Coronavirus-Orf9-NCBI.vcf

### 3. Generate a Reference Tree in Newick Format:

In [45]:
!fasttree -nt ./data/cov/alignment/aligned_Coronavirus-Orf9-NCBI.fasta > ./data/cov/phylogenetic_tree/phylogenetic_tree_reference_aligned_Coronavirus-Orf9-NCBI.nwk

FastTree Version 2.1.11 Double precision (No SSE3)
Alignment: ./data/cov/alignment/aligned_Coronavirus-Orf9-NCBI.fasta
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Jukes-Cantor, CAT approximation with 20 rate categories
Initial topology in 0.00 seconds
Refining topology: 8 rounds ME-NNIs, 2 rounds ME-SPRs, 4 rounds ML-NNIs
Total branch-length 0.004 after 0.00 sec
ML-NNI round 1: LogLk = -1090.899 NNIs 0 max delta 0.00 Time 0.00
Switched to using 20 rate categories (CAT approximation)
Rate categories were divided by 0.624 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable across runs
Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2: LogLk = -1089.957 NNIs 0 max delta 0.00 Time 0.00
Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3: LogLk = -1089.957 NNIs 0 

### 4. Run USHER to Construct the Phylogenetic Tree:

In [46]:
!usher -t ./data/cov/phylogenetic_tree/phylogenetic_tree_reference_aligned_Coronavirus-Orf9-NCBI.nwk -v ./data/cov/phylogenetic_tree/phylogenetic_tree_aligned_Coronavirus-Orf9-NCBI.vcf -o ./data/cov/phylogenetic_tree/phylogenetic_tree_output_aligned_Coronavirus-Orf9-NCBI.nwk

Initializing 2 worker threads.

Loading input tree.
Completed in 0 msec 

Loading VCF file.
Completed in 0 msec 

Computing parsimonious assignments for input variants.
At variant site 89
At variant site 592
At variant site 677
Completed in 1 msec 

Output newick files will have branch lengths equal to the number of mutations of that branch.

Found 0 missing samples.

Writing final tree to file /home/sagemaker-user/nosi-phylogenetic-tree/final-tree.nh 
The parsimony score for this tree is: 4 
Completed in 0 msec 

Saving mutation-annotated tree object to file (after condensing identical sequences) ./data/cov/phylogenetic_tree/phylogenetic_tree_output_aligned_Coronavirus-Orf9-NCBI.nwk
Completed in 0 msec 



### 5. Output:

- The constructed phylogenetic tree will be saved in the Newick file format `(.nwk).`
- You can visualize the tree using compatible tools.

<div style="padding: 10px; border: 1px solid #ffccbc; border-radius: 5px; background-color: #ffebee;">
    <strong>Alert:</strong>⚠️ Before running USHER, change the Jupyter kernel to a dedicated USHER kernel to avoid dependency conflicts.
</div>

## 3.3 Manage Computational Intensity Through Cloud Computing

Cloud computing refers to the use of remote servers hosted on the internet to store, manage, and process data instead of relying on local hardware. It allows researchers to access virtually unlimited computational resources without the need to invest in expensive infrastructure.

### 3.3.1 Why Use Cloud Computing for Phylogenetics?
1. **Scalability:**
   - Cloud platforms automatically adjust resources based on the size of the dataset.
   - For example, small datasets may require minimal resources, while large datasets (thousands of sequences) can leverage more virtual CPUs or GPUs as needed.

2. **Cost-Effectiveness:**
   - Pay-as-you-go models allow researchers to optimize costs by paying only for the resources they use.

3. **Accessibility:**
   - Cloud services enable researchers to access computational resources and data from anywhere, supporting collaboration across teams and geographies.

### 3.3.2 How Cloud Computing Works in Phylogenetics?
1. **Input Data:**
- Upload your sequence data (e.g., FASTA files) to the cloud storage provided by the platform
  
2. **Set Up the Environment:**
- Use pre-configured machine images (e.g., Ubuntu with MAFFT installed) or create your own environment.
- Choose the instance type based on the task (e.g., high-memory machines for large alignments).
  
3. **Run the Workflow:**
- Use tools like MAFFT for sequence alignment and USHER for tree construction.
- Jobs can run on virtual CPUs or GPUs for faster performance.
  
4. **Retrieve Results:**
- Download the aligned sequences, VCF files, or constructed trees for further analysis.


<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝 Always configure your cloud environment to balance performance and cost, and ensure data security protocols are followed when working with sensitive datasets.
</div>

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Tools like MAFFT and USHER integrate well with cloud computing platforms for high-performance analysis.
</div>

## Installations

Ensure all necessary tools and libraries are installed before proceeding.

### Install MAFFT

In [None]:
!conda install -c bioconda mafft fasttree -y

### Install USHER and Dependencies
#### Install USHER:

In [None]:
!conda install -c bioconda usher -y

#### Install Additional Dependencies

In [None]:
!conda install -c defaults -c bioconda -c conda-forge perl gzip -y

### Install ClustalW

In [None]:
!conda install -c bioconda clustalw -y

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Verify installations by checking tool versions:
</div>

!mafft --version
!usher --version
!clustalw2 --version

## Summary
This notebook teaches how to construct phylogenetic trees by aligning sequences using tools like MAFFT and ClustalW, then creating trees with USHER. It explains the importance of sequence alignment, compares tools, and uses cloud computing for scalable analysis.
In next module we will study about analyzing Phylogenetic Tree.

## Interactive Quiz

Test your understanding of phylogenetic tree construction with this interactive quiz:

In [None]:
from jupyterquiz import display_quiz
display_quiz('Quiz/QS3.json')