# Submodule3: Construct Phylogenetic Tree

# Learning Objectives:
In submodule 3 we will construct phylogenetic tree from a gene sequence that includes the following steps:
- Perform sequence alignment
- Perform phylogenetic tree reconstruction

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data and Analysis

<font color="green"> **Submodule #3: Construct Phylogenetic Tree** </font>
 
Submodule #4: Analyze Phylogenetic Tree

----------------------------------------------------------------------------------------------------------------

## 3.1 Perform Accurate Sequence Alignment of Metagenomic Data using augur align (Nextstrain)
Sequence alignment is essential for phylogenetic analysis, as it arranges sequences to emphasize their similarities and differences, setting the foundation for accurate tree construction.

**Using augur align for Sequence Alignment**

In this notebook, we’ll use `augur align` from Nextstrain to align SARS-CoV-2 sequences in preparation for phylogenetic tree construction. Follow these steps to install the necessary packages and perform sequence alignment with `augur align`.

#### Step-by-Step Guide:
1. Install Necessary Packages: Ensure that the required libraries are installed, including `nextstrain-cli`, `nextstrain-augur`, and `bioconda tools` for `mafft` and `fasttree`.

In [None]:
!pip install matplotlib

In [None]:
!pip install networkx

In [None]:
!pip install biopython

In [None]:
!pip install nextstrain-cli

In [None]:
!pip install nextstrain-augur

In [None]:
!conda install -c bioconda mafft fasttree -y

In [None]:
%pwd

In [None]:
%cd data/cov/

In [None]:
%pwd

2. **Run Sequence Alignment with augur align:** Align the SARS-CoV-2 sequences to prepare them for phylogenetic tree construction.

In [None]:
!augur align --sequences sequences_subset.fasta --output aligned_subset_augur.fasta --fill-gaps

In [None]:
%cd ../..

In [None]:
%pwd

This process aligns the SARS-CoV-2 sequences, preparing them for phylogenetic tree construction.

## 3.2 Manage Computational Intensity through Cloud Computing
Due to the large size of metagenomic datasets, sequence alignment can be computationally intensive. Utilizing cloud computing resources can significantly enhance the efficiency and speed of these tasks.

**Benefits of Cloud Computing for Sequence Alignment:**
- Scalability: Easily scale up resources based on the demand of the computation.
- Cost-Effectiveness: Pay-as-you-go models allow for cost savings by only using resources when needed.
- Accessibility: Access computational resources and data from anywhere, facilitating collaboration among researchers.

## 3.3 Phylogenetic Tree Reconstruction using USHER
USHER (Ultrafast Sample Placement on Existing tRee) is a tool designed to place samples on a given phylogenetic tree rapidly. It is beneficial for large-scale phylogenetic analysis and real-time epidemiology.

**Important Note:**

Before running USHER, change the Jupyter kernel to a dedicated USHER kernel. The dependencies required for USHER may conflict with other installed packages, so a separate kernel helps avoid installation issues.

### Steps to Use USHER for Phylogenetic Tree Reconstruction:
1. Clone USHER Repository:


In [None]:
# !git clone https://github.com/yatisht/usher.git

2. Installing Dependencies:
- Update the conda environment with the necessary dependencies:

In [None]:
# !conda env update -f usher/workflows/envs/usher.yaml

In [None]:
!conda install -c defaults -c bioconda -c conda-forge usher

In [None]:
!conda install -c defaults -c bioconda -c conda-forge wget

In [None]:
!conda install -c defaults -c bioconda -c conda-forge perl

In [None]:
!conda install -c defaults -c bioconda -c conda-forge gzip

In [None]:
!conda install usher -y

In [None]:
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

3.	Installing Additional Packages:
- Install the required packages mafft and fasttree:

In [None]:
!conda install -c bioconda fasttree -y

4.	Aligning Sequences:
- Use mafft to align your sequences and output them to aligned_sequences.fasta:

In [None]:
!mafft --auto data/cov/sequences_subset.fasta > data/cov/aligned_sequences_mafft_subset.fasta

5.	Generating VCF File:
- Convert the aligned sequences to a VCF file:

In [None]:
!faToVcf data/cov/aligned_sequences_mafft_subset.fasta data/cov/seq_subset.vcf

6.	Creating Newick Tree File:
- Use fasttree to generate a Newick tree file:

In [None]:
!fasttree -nt data/cov/aligned_sequences_mafft_subset.fasta > data/cov/reference_sequences_subset.nwk

7.	Running USHER:
- With the aligned sequences, VCF file, and Newick tree file, run USHER:

In [None]:
!usher -t data/cov/reference_sequences_subset.nwk -v data/cov/seq_subset.vcf -o data/cov/seq_output_subset.nwk

## Alternate Sequence Alignment of Metagenomic Data using ClustalW
Sequence alignment is a critical step in phylogenetic analysis, as it arranges the sequences in a manner that highlights their similarities and differences, allowing for accurate tree construction.
Using ClustalW for Sequence Alignment:
1. Download Clustal
    - Obtain the ClustalW tool from its official website: http://www.clustal.org/clustal2/#Download
    - click on ![image.png](attachment:17eaadba-d4ac-409f-be74-9ad593702af2.png) link in the webpage.
    - This will take you to another page where you need to download for Windows ![image.png](attachment:e69eac74-340a-4a4c-b1c6-d884188a09e4.png)
    - Once downloaded double-click on the downloaded file and complete the installation process.
2. Install Clustal
    - Follow the installation instructions specific to your operating system. For example, on Windows, it is typically installed at:
        - C:\Program Files (x86)\ClustalW2\clustalw2.exe
3. Run ClustalW using Python and Biopyon:

## Install and locate clustalw for sequence alignment

In [None]:
!conda config --add channels conda-forge
!conda config --add channels bioconda
!conda install -y clustalw

In [None]:
!which clustalw2

### Process with Clustalw

In [None]:
import subprocess
import datetime 
import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "sequences_subset.fasta"
clustalw_exe = "/home/ec2-user/anaconda3/envs/python3/bin/clustalw2"
seq_algn_file = "sequences_subset.aln"

start_time = datetime.datetime.now()
print(f"Process started at: {start_time}")

# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

end_time = datetime.datetime.now()
print(f"Process ended at: {end_time}")

# Calculate the duration
duration = end_time - start_time
print(f"Total time taken: {duration}")