# Submodule3: Construct Phylogenetic Tree

# Learning Objectives:
In submodule 3 we will construct phylogenetic tree from a gene sequence that includes the following steps:
- Perform sequence alignment
- Perform phylogenetic tree reconstruction

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data and Analysis

<font color="green"> **Submodule #3: Construct Phylogenetic Tree** </font>
 
Submodule #4: Analyze Phylogenetic Tree

----------------------------------------------------------------------------------------------------------------

## 3.1 Perform Accurate Sequence Alignment of Metagenomic Data using ClustalW
Sequence alignment is a critical step in phylogenetic analysis, as it arranges the sequences in a manner that highlights their similarities and differences, allowing for accurate tree construction.
Using ClustalW for Sequence Alignment:
1. Download Clustal
    - Obtain the ClustalW tool from its official website: http://www.clustal.org/clustal2/#Download
    - click on ![image.png](attachment:17eaadba-d4ac-409f-be74-9ad593702af2.png) link in the webpage.
    - This will take you to another page where you need to download for Windows ![image.png](attachment:e69eac74-340a-4a4c-b1c6-d884188a09e4.png)
    - Once downloaded double-click on the downloaded file and complete the installation process.
2. Install Clustal
    - Follow the installation instructions specific to your operating system. For example, on Windows, it is typically installed at:
        - C:\Program Files (x86)\ClustalW2\clustalw2.exe
3. Run ClustalW using Python and Biopyon:

In [2]:
!pip install matplotlib



In [3]:
!pip install networkx



In [4]:
%cd data/cov/

/mnt/d/Documents/Masters_Notes/bio_med/code/nosi-phylogenetic-tree/data/cov


In [5]:
%pwd

'/mnt/d/Documents/Masters_Notes/bio_med/code/nosi-phylogenetic-tree/data/cov'

### Process for Windows OS

In [1]:
import subprocess

import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "sequences.fasta"
clustalw_exe = "C:\\Program Files (x86)\\ClustalW2\\clustalw2.exe"
seq_algn_file = "data/cov/sequences.aln"
# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

This process aligns the SARS-CoV-2 sequences, preparing them for phylogenetic tree construction.

## 3.2 Manage Computational Intensity through Cloud Computing
Due to the large size of metagenomic datasets, sequence alignment can be computationally intensive. Utilizing cloud computing resources can significantly enhance the efficiency and speed of these tasks.

**Benefits of Cloud Computing for Sequence Alignment:**
- Scalability: Easily scale up resources based on the demand of the computation.
- Cost-Effectiveness: Pay-as-you-go models allow for cost savings by only using resources when needed.
- Accessibility: Access computational resources and data from anywhere, facilitating collaboration among researchers.

## 3.3 Phylogenetic Tree Reconstruction using USHER
USHER (Ultrafast Sample Placement on Existing tRee) is a tool designed to place samples on a given phylogenetic tree rapidly. It is beneficial for large-scale phylogenetic analysis and real-time epidemiology.

### Steps to Use USHER for Phylogenetic Tree Reconstruction:
1. Clone USHER Repository:


In [None]:
!git clone https://github.com/yatisht/usher.git

2. Installing Dependencies:
- Update the conda environment with the necessary dependencies:

In [None]:
!conda env update -f usher/workflows/envs/usher.yaml

3.	Installing Additional Packages:
- Install the required packages mafft and fasttree:

In [None]:
!conda install -c bioconda mafft fasttree

4.	Aligning Sequences:
- Use mafft to align your sequences and output them to aligned_sequences.fasta:

In [11]:
!mafft --auto sequences.fasta > aligned_sequences.fasta

nthread = 0
nthreadpair = 0
nthreadtb = 0
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..

There are 115892 ambiguous characters.
  101 / 165
done.

Constructing a UPGMA tree (efffree=0) ... 
  160 / 165
done.

Progressive alignment 1/2... 
STEP    40 / 164 
Reallocating..done. *alloclen = 60816
STEP   111 / 164 
len1=30013, len2=29903, Switching to the memsave mode
STEP   164 / 164 mP 00002 / 00183DP 00003 / 00183DP 00004 / 00183DP 00005 / 00183DP 00006 / 00183DP 00007 / 00183DP 00008 / 00183DP 00009 / 00183DP 00010 / 00183DP 00011 / 00183DP 00012 / 00183DP 00013 / 00183DP 00014 / 00183DP 00015 / 00183DP 00016 / 00183DP 00017 / 00183DP 00018 / 00183DP 00019 / 00183DP 00020 / 00183DP 00021 / 00183DP 00022 / 00183DP 00023 / 00183DP 00024 / 00183DP 00025 / 00183DP 00026 / 00183DP 00027 / 00183DP 00028 / 00183DP 00029 / 00183DP 00030 / 00183DP 00031 / 00183DP 00032 / 00183DP 

5.	Generating VCF File:
- Convert the aligned sequences to a VCF file:

In [12]:
!faToVcf aligned_sequences.fasta seq.vcf

6.	Creating Newick Tree File:
- Use fasttree to generate a Newick tree file:

In [13]:
!fasttree -nt aligned_sequences.fasta > reference_sequences.nwk

FastTree Version 2.1.11 Double precision (No SSE3)
Alignment: aligned_sequences.fasta
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Jukes-Cantor, CAT approximation with 20 rate categories
Ignored unknown character k (seen 10 times)
Ignored unknown character m (seen 4 times)
Ignored unknown character n (seen 115807 times)
Ignored unknown character r (seen 25 times)
Ignored unknown character s (seen 2 times)
Ignored unknown character w (seen 3 times)
Ignored unknown character y (seen 41 times)
Initial topology in 1.64 seconds0 of    162   165 seqs (at seed    100)   
Refining topology: 29 rounds ME-NNIs, 2 rounds ME-SPRs, 15 rounds ML-NNIs
Total branch-length 0.067 after 10.73 sec, 1 of 163 splits   8 changes (max delta 0.000)    

may not be appropriate for aligments of very closely-related sequences
like this one, as FastTree does not accou

7.	Running USHER:
- With the aligned sequences, VCF file, and Newick tree file, run USHER:

In [14]:
!usher -t reference_sequences.nwk -v seq.vcf -o seq_output.nwk

Initializing 8 worker threads.

Loading input tree.
Completed in 3 msec 

Loading VCF file.
Completed in 1 msec 

Computing parsimonious assignments for input variants.
At variant site 8
At variant site 9
At variant site 10
At variant site 11
At variant site 12
At variant site 14
At variant site 15
At variant site 16
At variant site 17
At variant site 18
At variant site 20
At variant site 19
At variant site 21
At variant site 22
At variant site 23
At variant site 25
At variant site 24
At variant site 27
At variant site 26
At variant site 28
At variant site 29
At variant site 30
At variant site 31
At variant site 34
At variant site 35
At variant site 36
At variant site 37
At variant site 44
At variant site 66
At variant site 72
At variant site 105
At variant site 127
At variant site 140
At variant site 143
At variant site 174
At variant site 188
At variant site 193
At variant site 201
At variant site 203
At variant site 204
At variant site 210
At variant site 217
At variant site 218
At 