# Submodule3: Construct Phylogenetic Tree

## Learning Objectives
In Submodule 3, we will construct a phylogenetic tree from a gene sequence. This involves the following steps:
- Perform sequence alignment using MAFFT.
- Reconstruct the phylogenetic tree using USHER, a tool optimized for large-scale phylogenetic analysis.

By the end of this submodule, learners will be able to:
1. Align gene sequences for phylogenetic analysis.
2. Use USHER to rapidly construct and analyze phylogenetic trees.
3. Explore ClustalW as an alternative sequence alignment method.

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data and Analysis

<font color="green"> **Submodule #3: Construct Phylogenetic Tree** </font>
 
Submodule #4: Analyze Phylogenetic Tree

----------------------------------------------------------------------------------------------------------------

## 3.1 Perform Accurate Sequence Alignment with MAFFT

Sequence alignment is a critical step in phylogenetic analysis. It arranges sequences to highlight similarities and differences, providing the foundation for accurate tree construction.

### Why Use MAFFT for Sequence Alignment?
- MAFFT is a fast and reliable tool for multiple sequence alignment.
- It is highly scalable, making it suitable for large metagenomic datasets like SARS-CoV-2 sequences.

### Steps to Perform Sequence Alignment with MAFFT
1. Align the sequences in your dataset using the following command:

<<<<<<< LOCAL CELL DELETED >>>>>>>
## 3.3 Phylogenetic Tree Reconstruction using USHER
USHER (Ultrafast Sample Placement on Existing tRee) is a tool designed to place samples on a given phylogenetic tree rapidly. It is beneficial for large-scale phylogenetic analysis and real-time epidemiology.

**Important Note:**

Before running USHER, change the Jupyter kernel to a dedicated USHER kernel. The dependencies required for USHER may conflict with other installed packages, so a separate kernel helps avoid installation issues.

### Steps to Use USHER for Phylogenetic Tree Reconstruction:

In [None]:
!mafft --auto data/cov/sequences_subset.fasta > data/cov/aligned_sequences_mafft_subset.fasta

## 3.2 Manage Computational Intensity Through Cloud Computing

Due to the large size of metagenomic datasets, sequence alignment and phylogenetic tree construction can be computationally intensive. Leveraging cloud computing resources can significantly enhance the efficiency and speed of these tasks.

### Benefits of Cloud Computing for Sequence Alignment
1. **Scalability:**
   - Cloud platforms, such as AWS, Google Cloud, and Azure, allow researchers to scale up resources dynamically based on the computational demands of the task.

2. **Cost-Effectiveness:**
   - Pay-as-you-go models allow researchers to optimize costs by paying only for the resources they use.

3. **Accessibility:**
   - Cloud services enable researchers to access computational resources and data from anywhere, supporting collaboration across teams and geographies.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Note:</strong> 📝 Always configure your cloud environment to balance performance and cost, and ensure data security protocols are followed when working with sensitive datasets.
</div>

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Tools like MAFFT and USHER integrate well with cloud computing platforms for high-performance analysis.
</div>

## 3.3 Phylogenetic Tree Reconstruction Using USHER

USHER (**Ultrafast Sample Placement on Existing tRee**) is a high-performance tool designed to rapidly construct and analyze phylogenetic trees. It is particularly effective for large datasets, such as those generated during SARS-CoV-2 genomic studies.

### Key Features of USHER
- **Speed:** Processes large datasets quickly, even in real-time.
- **Scalability:** Handles complex phylogenies with thousands of sequences.
- **Integration:** Accepts input in common formats (e.g., VCF, Newick).

### Steps to Construct a Phylogenetic Tree with USHER
1. **Prepare Input Files:**
   - Aligned sequences in FASTA format.
   - VCF file containing sequence variants.
   - A reference tree in Newick format.

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Always inspect the aligned sequences to ensure proper alignment before proceeding to phylogenetic tree construction.
</div>

### Convert Aligned Sequences to VCF:

In [None]:
!faToVcf data/cov/aligned_sequences_mafft_subset.fasta data/cov/seq_subset.vcf

### Generate a Reference Tree in Newick Format:

In [None]:
!fasttree -nt data/cov/aligned_sequences_mafft_subset.fasta > data/cov/reference_sequences_subset.nwk

### Run USHER to Construct the Phylogenetic Tree:

In [None]:
!usher -t data/cov/reference_sequences_subset.nwk -v data/cov/seq_subset.vcf -o data/cov/seq_output_subset.nwk

### Output:

- The constructed phylogenetic tree will be saved in the Newick file format `(seq_output_subset.nwk).`
- You can visualize the tree using compatible tools.

<div style="padding: 10px; border: 1px solid #ffccbc; border-radius: 5px; background-color: #ffebee;">
    <strong>Alert:</strong>⚠️ Before running USHER, change the Jupyter kernel to a dedicated USHER kernel to avoid dependency conflicts.
</div>

## Interactive Quiz

Test your understanding of phylogenetic tree construction with this interactive quiz:

In [None]:
from jupyterquiz import display_quiz
display_quiz('Quiz/QS3.json')

## 3.4 Alternative Sequence Alignment Using ClustalW

ClustalW is another commonly used tool for multiple sequence alignment. It arranges sequences to emphasize similarities and differences, providing an alternative to MAFFT.

### Steps to Perform Sequence Alignment with ClustalW
1. **Define Paths for Files and Tools:**
 
   - fasta_file = "data/cov/sequences_subset.fasta"
   - clustalw_exe = "/path/to/clustalw2"
   - seq_algn_file = "data/cov/sequences_subset.aln"

### Process with Clustalw

In [None]:
import subprocess
import datetime 
import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "sequences_subset.fasta"
clustalw_exe = "/home/ec2-user/anaconda3/envs/python3/bin/clustalw2"
seq_algn_file = "sequences_subset.aln"

start_time = datetime.datetime.now()
print(f"Process started at: {start_time}")

# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

end_time = datetime.datetime.now()
print(f"Process ended at: {end_time}")

# Calculate the duration
duration = end_time - start_time
print(f"Total time taken: {duration}")

## Installations

Ensure all necessary tools and libraries are installed before proceeding.

### Install MAFFT

In [None]:
!conda install -c bioconda mafft fasttree -y

### Install USHER and Dependencies
#### Install USHER:

In [None]:
!conda install -c bioconda usher -y

#### Install Additional Dependencies

In [None]:
!conda install -c defaults -c bioconda -c conda-forge perl gzip -y

### Install ClustalW

In [None]:
!conda install -c bioconda clustalw -y

<div style="padding: 10px; border: 1px solid #b3e5fc; border-radius: 5px; background-color: #e1f5fe;">
    <strong>Tip:</strong>💡 Verify installations by checking tool versions:
</div>

In [None]:
!mafft --version
!usher --version
!clustalw2 --version