# Submodule3: Construct Phylogenetic Tree

# Learning Objectives:
In submodule 3 we will construct phylogenetic tree from a gene sequence that includes the following steps:
- Perform sequence alignment
- Perform phylogenetic tree reconstruction

----------------------------------------------------------------------------------------------------------------
# Training Plan 


Submodule #1: Understanding the Basics of Phylogenetic

Submodule #2: Collect and Prepare Sequence Data and Analysis

<font color="green"> **Submodule #3: Construct Phylogenetic Tree** </font>
 
Submodule #4: Analyze Phylogenetic Tree

----------------------------------------------------------------------------------------------------------------

## 3.1 Perform Accurate Sequence Alignment of Metagenomic Data using augur align (Nextstrain)
Sequence alignment is essential for phylogenetic analysis, as it arranges sequences to emphasize their similarities and differences, setting the foundation for accurate tree construction.

**Using augur align for Sequence Alignment**

In this notebook, we’ll use `augur align` from Nextstrain to align SARS-CoV-2 sequences in preparation for phylogenetic tree construction. Follow these steps to install the necessary packages and perform sequence alignment with `augur align`.

#### Step-by-Step Guide:
1. Install Necessary Packages: Ensure that the required libraries are installed, including `nextstrain-cli`, `nextstrain-augur`, and `bioconda tools` for `mafft` and `fasttree`.

In [5]:
!pip install matplotlib



In [6]:
!pip install networkx



In [7]:
!pip install biopython



In [8]:
!pip install nextstrain-cli

Collecting nextstrain-cli
  Downloading nextstrain_cli-8.5.4-py3-none-any.whl.metadata (3.8 kB)
Collecting fasteners (from nextstrain-cli)
  Downloading fasteners-0.19-py3-none-any.whl.metadata (4.9 kB)
Collecting wcmatch>=6.0 (from nextstrain-cli)
  Downloading wcmatch-10.0-py3-none-any.whl.metadata (5.0 kB)
Collecting s3fs!=2023.9.1,>=2021.04.0 (from s3fs[boto3]!=2023.9.1,>=2021.04.0->nextstrain-cli)
  Downloading s3fs-2024.10.0-py3-none-any.whl.metadata (1.7 kB)
Collecting aiobotocore<3.0.0,>=2.5.4 (from s3fs!=2023.9.1,>=2021.04.0->s3fs[boto3]!=2023.9.1,>=2021.04.0->nextstrain-cli)
  Downloading aiobotocore-2.15.2-py3-none-any.whl.metadata (23 kB)
Collecting fsspec!=2023.9.1 (from nextstrain-cli)
  Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Collecting bracex>=2.1.1 (from wcmatch>=6.0->nextstrain-cli)
  Downloading bracex-2.5.post1-py3-none-any.whl.metadata (3.5 kB)
Collecting botocore<1.35.37,>=1.35.16 (from aiobotocore<3.0.0,>=2.5.4->s3fs!=2023.9.1,>=2021.04.0->

In [9]:
!pip install nextstrain-augur

Collecting nextstrain-augur
  Downloading nextstrain_augur-26.1.0-py3-none-any.whl.metadata (6.1 kB)
Collecting bcbio-gff==0.7.*,>=0.7.1 (from nextstrain-augur)
  Downloading bcbio_gff-0.7.1-py3-none-any.whl.metadata (343 bytes)
Collecting cvxopt==1.*,>=1.1.9 (from nextstrain-augur)
  Downloading cvxopt-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting isodate==0.6.* (from nextstrain-augur)
  Downloading isodate-0.6.1-py2.py3-none-any.whl.metadata (9.6 kB)
Collecting jsonschema==3.*,>=3.0.0 (from nextstrain-augur)
  Downloading jsonschema-3.2.0-py2.py3-none-any.whl.metadata (7.8 kB)
Collecting phylo-treetime<0.12,>=0.11.2 (from nextstrain-augur)
  Downloading phylo_treetime-0.11.4-py3-none-any.whl.metadata (13 kB)
Collecting pyfastx<3.0,>=1.0.0 (from nextstrain-augur)
  Downloading pyfastx-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (30 kB)
Collecting python-calamine>=0.2.0 (from nextstrain-augur)
  Downloading p

In [10]:
!conda install -c bioconda mafft fasttree -y

Channels:
 - bioconda
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.7.1
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



# All requested packages already installed.



In [None]:
%pwd

In [None]:
%cd data/cov/

In [None]:
%pwd

2. **Run Sequence Alignment with augur align:** Align the SARS-CoV-2 sequences to prepare them for phylogenetic tree construction.

In [None]:
!augur align --sequences sequences_subset.fasta --output aligned_subset_augur.fasta --fill-gaps

In [None]:
!conda install augur

Channels:
 - bioconda
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.7.1
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs:
    - augur


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    aioeasywebdav-2.4.0        |     pyha770c72_0          12 KB  conda-forge
    amply-0.1.6                |     pyhd8ed1ab_0          21 KB  conda-forge
    appdirs-1.4.4              |     pyh9f0ad1d_0          13 KB  conda-forge
    attmap-0.13.2              |     pyhd8ed1ab_0          13 KB  conda-forge
    augur-11.1.2               |             py_0         137 KB  bioconda
    bcbio-gff-0.7.1            |     pyh7e72e81_2    

In [None]:
%cd ../..

In [None]:
%pwd

This process aligns the SARS-CoV-2 sequences, preparing them for phylogenetic tree construction.

## 3.2 Manage Computational Intensity through Cloud Computing
Due to the large size of metagenomic datasets, sequence alignment can be computationally intensive. Utilizing cloud computing resources can significantly enhance the efficiency and speed of these tasks.

**Benefits of Cloud Computing for Sequence Alignment:**
- Scalability: Easily scale up resources based on the demand of the computation.
- Cost-Effectiveness: Pay-as-you-go models allow for cost savings by only using resources when needed.
- Accessibility: Access computational resources and data from anywhere, facilitating collaboration among researchers.

## 3.3 Phylogenetic Tree Reconstruction using USHER
USHER (Ultrafast Sample Placement on Existing tRee) is a tool designed to place samples on a given phylogenetic tree rapidly. It is beneficial for large-scale phylogenetic analysis and real-time epidemiology.

**Important Note:**

Before running USHER, change the Jupyter kernel to a dedicated USHER kernel. The dependencies required for USHER may conflict with other installed packages, so a separate kernel helps avoid installation issues.

### Steps to Use USHER for Phylogenetic Tree Reconstruction:
1. Clone USHER Repository:


In [None]:
# !git clone https://github.com/yatisht/usher.git

2. Installing Dependencies:
- Update the conda environment with the necessary dependencies:

In [None]:
# !conda env update -f usher/workflows/envs/usher.yaml

In [None]:
!conda install -c defaults -c bioconda -c conda-forge usher

In [None]:
!conda install -c defaults -c bioconda -c conda-forge wget

In [None]:
!conda install -c defaults -c bioconda -c conda-forge perl

In [None]:
!conda install -c defaults -c bioconda -c conda-forge gzip

In [3]:
!conda install usher -y

Channels:
 - bioconda
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
done


    current version: 24.7.1
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs:
    - usher


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    aom-3.5.0                  |       h27087fc_0         2.7 MB  conda-forge
    arrow-cpp-11.0.0           |  ha770c72_12_cpu          30 KB  conda-forge
    aws-c-auth-0.6.26          |       h987a71b_2          93 KB  conda-forge
    aws-c-cal-0.5.21           |       h48707d8_2          43 KB  conda-forge
    aws-c-common-0.8.14        |       h0b41bf4_0         195 KB  conda-forge
    aws-c-compression-0.2.16   |       h03acc5a_5          18 KB  conda

In [None]:
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

3.	Installing Additional Packages:
- Install the required packages mafft and fasttree:

In [4]:
!conda install -c bioconda fasttree -y

Channels:
 - bioconda
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.7.1
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs:
    - fasttree


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fasttree-2.1.11            |       h031d066_4         261 KB  bioconda
    ------------------------------------------------------------
                                           Total:         261 KB

The following NEW packages will be INSTALLED:

  fasttree           bioconda/linux-64::fasttree-2.1.11-h031d066_4 



Downloading and Extracting Packages:
                                                                            

4.	Aligning Sequences:
- Use mafft to align your sequences and output them to aligned_sequences.fasta:

In [None]:
!mafft --auto data/cov/sequences_subset.fasta > data/cov/aligned_sequences_mafft_subset.fasta

5.	Generating VCF File:
- Convert the aligned sequences to a VCF file:

In [None]:
!faToVcf data/cov/aligned_sequences_mafft_subset.fasta data/cov/seq_subset.vcf

6.	Creating Newick Tree File:
- Use fasttree to generate a Newick tree file:

In [None]:
!fasttree -nt data/cov/aligned_sequences_mafft_subset.fasta > data/cov/reference_sequences_subset.nwk

7.	Running USHER:
- With the aligned sequences, VCF file, and Newick tree file, run USHER:

In [None]:
!usher -t data/cov/reference_sequences_subset.nwk -v data/cov/seq_subset.vcf -o data/cov/seq_output_subset.nwk

## Alternate Sequence Alignment of Metagenomic Data using ClustalW
Sequence alignment is a critical step in phylogenetic analysis, as it arranges the sequences in a manner that highlights their similarities and differences, allowing for accurate tree construction.
Using ClustalW for Sequence Alignment:
1. Download Clustal
    - Obtain the ClustalW tool from its official website: http://www.clustal.org/clustal2/#Download
    - click on ![image.png](attachment:17eaadba-d4ac-409f-be74-9ad593702af2.png) link in the webpage.
    - This will take you to another page where you need to download for Windows ![image.png](attachment:e69eac74-340a-4a4c-b1c6-d884188a09e4.png)
    - Once downloaded double-click on the downloaded file and complete the installation process.
2. Install Clustal
    - Follow the installation instructions specific to your operating system. For example, on Windows, it is typically installed at:
        - C:\Program Files (x86)\ClustalW2\clustalw2.exe
3. Run ClustalW using Python and Biopyon:

## Install and locate clustalw for sequence alignment

In [1]:
!conda config --add channels conda-forge
!conda config --add channels bioconda
!conda install -y clustalw

Retrieving notices: ...working... done
Channels:
 - bioconda
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.7.1
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs:
    - clustalw


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    clustalw-2.1               |      h4ac6f70_10         339 KB  bioconda
    openssl-3.4.0              |       hb9d3cd8_0         2.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following NEW packages will be INSTALLED:

  clustalw           bioconda/linux-64::clustalw-2.1-h4ac6f70_10 



In [2]:
!which clustalw2

/home/ec2-user/anaconda3/envs/python3/bin/clustalw2


### Process with Clustalw

In [None]:
import subprocess
import datetime 
import matplotlib.pyplot as plt
import networkx as nx
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator

# Define the paths
fasta_file = "sequences_subset.fasta"
clustalw_exe = "/home/ec2-user/anaconda3/envs/python3/bin/clustalw2"
seq_algn_file = "sequences_subset.aln"

start_time = datetime.datetime.now()
print(f"Process started at: {start_time}")

# Run ClustalW for multiple sequence alignment using subprocess
try:
    subprocess.run([clustalw_exe, "-INFILE=" + fasta_file, "-OUTFILE=" + seq_algn_file, "-OUTPUT=FASTA"], check=True)
except subprocess.CalledProcessError as e:
    print("Error running ClustalW:", e)
    exit(1)

end_time = datetime.datetime.now()
print(f"Process ended at: {end_time}")

# Calculate the duration
duration = end_time - start_time
print(f"Total time taken: {duration}")

In [2]:
from jupyterquiz import display_quiz
display_quiz('Quiz/QS3.json')

<IPython.core.display.Javascript object>