# Submodule3: Construct Phylogenetic Tree

# Learning Objectives:
In submodule 3 we will construct phylogenetic tree from a gene sequence that includes the following steps:
- Perform sequence alignment
- Perform phylogenetic tree reconstruction

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data and Analysis

<font color="green"> **Submodule #3: Construct Phylogenetic Tree** </font>
 
Submodule #4: Analyze Phylogenetic Tree

----------------------------------------------------------------------------------------------------------------

## 3.1 Perform Accurate Sequence Alignment of Metagenomic Data using augur align (Nextstrain)
Sequence alignment is essential for phylogenetic analysis, as it arranges sequences to emphasize their similarities and differences, setting the foundation for accurate tree construction.

**Using augur align for Sequence Alignment**

In this notebook, we’ll use `augur align` from Nextstrain to align SARS-CoV-2 sequences in preparation for phylogenetic tree construction. Follow these steps to install the necessary packages and perform sequence alignment with `augur align`.

#### Step-by-Step Guide:
1. Install Necessary Packages: Ensure that the required libraries are installed, including `nextstrain-cli`, `nextstrain-augur`, and `bioconda tools` for `mafft` and `fasttree`.

## 3.2 Manage Computational Intensity through Cloud Computing
Due to the large size of metagenomic datasets, sequence alignment can be computationally intensive. Utilizing cloud computing resources can significantly enhance the efficiency and speed of these tasks.

**Benefits of Cloud Computing for Sequence Alignment:**
- Scalability: Easily scale up resources based on the demand of the computation.
- Cost-Effectiveness: Pay-as-you-go models allow for cost savings by only using resources when needed.
- Accessibility: Access computational resources and data from anywhere, facilitating collaboration among researchers.

## 3.3 Phylogenetic Tree Reconstruction using USHER
USHER (Ultrafast Sample Placement on Existing tRee) is a tool designed to place samples on a given phylogenetic tree rapidly. It is beneficial for large-scale phylogenetic analysis and real-time epidemiology.

**Important Note:**

Before running USHER, change the Jupyter kernel to a dedicated USHER kernel. The dependencies required for USHER may conflict with other installed packages, so a separate kernel helps avoid installation issues.

### Steps to Use USHER for Phylogenetic Tree Reconstruction:

In [None]:
!conda install -c defaults -c bioconda -c conda-forge perl

In [None]:
!conda install -c defaults -c bioconda -c conda-forge gzip

4.	Aligning Sequences:
- Use mafft to align your sequences and output them to aligned_sequences.fasta:

In [None]:
!mafft --auto data/cov/sequences_subset.fasta > data/cov/aligned_sequences_mafft_subset.fasta

5.	Generating VCF File:
- Convert the aligned sequences to a VCF file:

In [None]:
!faToVcf data/cov/aligned_sequences_mafft_subset.fasta data/cov/seq_subset.vcf

6.	Creating Newick Tree File:
- Use fasttree to generate a Newick tree file:

In [None]:
!fasttree -nt data/cov/aligned_sequences_mafft_subset.fasta > data/cov/reference_sequences_subset.nwk

7.	Running USHER:
- With the aligned sequences, VCF file, and Newick tree file, run USHER:

In [None]:
!usher -t data/cov/reference_sequences_subset.nwk -v data/cov/seq_subset.vcf -o data/cov/seq_output_subset.nwk

## Alternate Sequence Alignment of Metagenomic Data using ClustalW
Sequence alignment is a critical step in phylogenetic analysis, as it arranges the sequences in a manner that highlights their similarities and differences, allowing for accurate tree construction.
Using ClustalW for Sequence Alignment:

### Process with Clustalw

In [None]:
import subprocess
import datetime 
import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "sequences_subset.fasta"
clustalw_exe = "/home/ec2-user/anaconda3/envs/python3/bin/clustalw2"
seq_algn_file = "sequences_subset.aln"

start_time = datetime.datetime.now()
print(f"Process started at: {start_time}")

# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

end_time = datetime.datetime.now()
print(f"Process ended at: {end_time}")

# Calculate the duration
duration = end_time - start_time
print(f"Total time taken: {duration}")

In [None]:
from jupyterquiz import display_quiz
display_quiz('Quiz/QS3.json')