# Submodule3: Construct Phylogenetic Tree

## 3.1 Perform Accurate Sequence Alignment of Metagenomic Data using ClustalW
Sequence alignment is a critical step in phylogenetic analysis, as it arranges the sequences in a manner that highlights their similarities and differences, allowing for accurate tree construction.
Using ClustalW for Sequence Alignment:
1. Download Clustal
    - Obtain the ClustalW tool from its official website: http://www.clustal.org/clustal2/#Download
    - click on ![image.png](attachment:17eaadba-d4ac-409f-be74-9ad593702af2.png) link in the webpage.
    - This will take you to another page where you need to download for Windows ![image.png](attachment:e69eac74-340a-4a4c-b1c6-d884188a09e4.png)
    - Once downloaded double-click on the downloaded file and complete the installation process.
2. Install Clustal
    - Follow the installation instructions specific to your operating system. For example, on Windows, it is typically installed at:
        - C:\Program Files (x86)\ClustalW2\clustalw2.exe
3. Run ClustalW using Python and Biopyon:

In [1]:
import subprocess

import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "data/cov/sequences.fasta"
clustalw_exe = "C:\\Program Files (x86)\\ClustalW2\\clustalw2.exe"
seq_algn_file = "data/cov/sequences.aln"
# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

This process aligns the SARS-CoV-2 sequences, preparing them for phylogenetic tree construction.

## 3.2 Manage Computational Intensity through Cloud Computing
Due to the large size of metagenomic datasets, sequence alignment can be computationally intensive. Utilizing cloud computing resources can significantly enhance the efficiency and speed of these tasks.

**Benefits of Cloud Computing for Sequence Alignment:**
- Scalability: Easily scale up resources based on the demand of the computation.
- Cost-Effectiveness: Pay-as-you-go models allow for cost savings by only using resources when needed.
- Accessibility: Access computational resources and data from anywhere, facilitating collaboration among researchers.

## 3.3 Phylogenetic Tree Reconstruction using USHER
USHER (Ultrafast Sample Placement on Existing tRee) is a tool designed to place samples on a given phylogenetic tree rapidly. It is beneficial for large-scale phylogenetic analysis and real-time epidemiology.

### Steps to Use USHER for Phylogenetic Tree Reconstruction:
1. Clone USHER Repository:


In [None]:
!git clone https://github.com/yatisht/usher.git

2. Installing Dependencies:
- Update the conda environment with the necessary dependencies:

In [None]:
!conda env update -f usher/workflows/envs/usher.yaml

3.	Installing Additional Packages:
- Install the required packages mafft and fasttree:

In [None]:
!conda install -c bioconda mafft fasttree

4.	Aligning Sequences:
- Use mafft to align your sequences and output them to aligned_sequences.fasta:

In [None]:
!mafft --auto sequences.fasta > aligned_sequences.fasta

5.	Generating VCF File:
- Convert the aligned sequences to a VCF file:

In [None]:
!faToVcf aligned_sequences.fasta seq.vcf

6.	Creating Newick Tree File:
- Use fasttree to generate a Newick tree file:

In [None]:
!fasttree -nt aligned_sequences.fasta > reference.nwk

7.	Running USHER:
- With the aligned sequences, VCF file, and Newick tree file, run USHER:

In [None]:
!usher -t reference.nwk -v seq.vcf -o seq_output.nwk