# Final Project of Introduction to Bioinformatics

## Comparative Mitochondrial Genomics and Phylogenetic Analysis

### Please Note That the Project Is Located at https://colab.research.google.com/drive/1eENAo7y5OAmH4wtYYULIDm00sx3C4m9y?usp=sharing

This task focuses on the comparative analysis of mitochondrial genomes from different species, primarily birds, mammals, and insects. The aim is to understand the evolutionary relationships between these species by analyzing and comparing their mitochondrial DNA, which is about 16,000 base pairs in length. You will use advanced computational methods to construct phylogenetic trees and delve into the ecological and anthropological insights that can be gleaned from this data. This project is designed to provide a comprehensive understanding of mitochondrial genomics, its importance in evolutionary biology, and its applications in broader scientific contexts.

You will learn:

- Techniques for aligning and comparing mitochondrial DNA sequences.
- How to construct and interpret phylogenetic trees using advanced computational methods.
- The application of mitochondrial genomics in understanding ecological interactions and human evolutionary history.

#### Task Roadmap

1. **Mitochondrial Genome Comparison**:
   - Align mitochondrial DNA sequences from the provided dataset.
   - Analyze these sequences to identify similarities and differences across species.

2. **Phylogenetic Analysis Using Advanced Methods**:
   - Apply Maximum Likelihood (ML) and Bayesian Inference methods, utilizing tools like `ETE Toolkit`, `DendroPy`, `BEAST`, or `PyRate`.
   - Compare the trees generated by these methods to understand how different approaches can lead to different interpretations of the data.

3. **Cross-Disciplinary Applications (Bonus)**:
   - **Ecology**: Examine how mitochondrial DNA analysis can reveal information about species adaptation, migration, and conservation. This involves understanding how genetic variation within and between species can inform ecological strategies and conservation efforts.
   - **Anthropology**: Investigate the use of mitochondrial DNA in tracing human evolution and migration patterns. This includes studying the mitochondrial DNA of mammals in your dataset to draw parallels with human evolutionary studies.

### Data Sources

The mitochondrial DNA data for birds, mammals, and insects will be provided to you. This dataset has been curated to facilitate a comprehensive comparative analysis and is essential for the completion of your phylogenetic studies.

### Useful Resources and Material

- [Mitochondrial DNA - Wikipedia](https://en.wikipedia.org/wiki/Mitochondrial_DNA): A general introduction to the structure, function, origin, and diversity of mitochondrial DNA, as well as its applications in various fields such as medicine, forensics, and anthropology.
- [Mitochondrial DNA Analysis: Introduction, Methods, and Applications](https://bioinfo.cd-genomics.com/mitochondrial-dna-analysis-introduction-methods-and-applications.html): An explanation of the basics of mitochondrial DNA sequencing, bioinformatics analysis, heteroplasmy, and advantages of mitochondrial DNA analysis over nuclear DNA analysis.
- [Phylogenetic Tree- Definition, Types, Steps, Methods, Uses - Microbe Notes](https://microbenotes.com/phylogenetic-tree/): A coverage of the concepts and methods of phylogenetic tree construction, including the types of phylogenetic trees, the steps involved in phylogenetic analysis, the main methods of phylogenetic inference, and the applications of phylogenetic trees in various disciplines.
- [Phylogenetics - Wikipedia](https://en.wikipedia.org/wiki/Phylogenetics): An overview of the field of phylogenetics, which is the study of the evolutionary history and relationships among or within groups of organisms. It also discusses the data sources, models, algorithms, software, and challenges of phylogenetic analysis.
- [ETE Toolkit](http://etetoolkit.org/): A Python library for manipulating, analyzing, and visualizing phylogenetic trees. It supports various formats, methods, and tools for phylogenetic analysis, such as alignment, tree inference, tree comparison, tree annotation, and tree visualization.
- [DendroPy](https://dendropy.org/): Another Python library for phylogenetic computing. It provides a comprehensive API for working with phylogenetic data structures, such as trees, characters, and networks. It also offers a rich set of functions for simulation, manipulation, analysis, and annotation of phylogenetic data.

### Exploration and Reflection

As we proceed with our analysis of mitochondrial DNA for phylogenetic tree construction, it is valuable to contemplate a few questions. These inquiries aim to facilitate a more thorough understanding of the roles and characteristics of mitochondrial DNA in the context of evolutionary biology:

1. **Maternal Inheritance and Its Implications**: How does the maternal inheritance of mitochondrial DNA simplify our understanding of evolutionary lineage compared to nuclear DNA, which undergoes recombination? What unique insights can this aspect provide in tracing the evolutionary history of species?

2. **Mutation Rate and Evolutionary Insights**: Mitochondrial DNA mutates at a faster rate than nuclear DNA. How does this characteristic make mtDNA a more sensitive tool for detecting recent evolutionary events and relationships among closely related species? Can you think of any specific scenarios or studies where this property of mtDNA has been particularly instrumental?

Reflect on these questions as you work through the project, and consider how the properties of mitochondrial DNA enhance its value and applicability in evolutionary biology and beyond. Provide your answer either in this notebook, or in your report (if you had one).

<blockquote style="font-family:Arial; color:red; font-size:16px; border-left:0px solid red; padding: 10px;">
    <strong>Don't forget to answer these questions!</strong>
</blockquote>

### Step 0: Installing Necessary Packages

In [None]:
import sys
import subprocess
import pkg_resources

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

REQUIRED_PACKAGES = [
    'biopython',
    'pandas',
    'numpy'
]

for package in REQUIRED_PACKAGES:
    try:
        dist = pkg_resources.get_distribution(package)
        print('{} ({}) is installed'.format(dist.key, dist.version))
    except pkg_resources.DistributionNotFound:
        print('{} is NOT installed'.format(package))
        install(package)
        print('{} was successfully installed.'.format(package))

biopython is NOT installed
biopython was successfully installed.
pandas (1.5.3) is installed
numpy (1.23.5) is installed


In [None]:
# Import necessary libraries.
import pandas as pd
import numpy as np

### Step 1: Dataset Expansion

Our first task is to augment our dataset with additional species. This involves engaging with the NCBI database to retrieve mitochondrial DNA sequences.

#### Instructions:

- **Species Selection**: Identify and choose 10 additional species to include in your dataset. Aim for a diverse selection to enrich your phylogenetic analysis.

- **Querying NCBI Database**: Use the NCBI database to locate mitochondrial DNA sequences for your chosen species. While you can manually search on the [NCBI website](https://www.ncbi.nlm.nih.gov/), consider automating this process through their API for a more efficient approach.
    - **Example Query**: As a starting point, you might use a query like `"mitochondrion[Filter] AND (your_species_name[Organism])` to find specific mtDNA sequences. Adjust the query parameters according to your species selection.
    - **Documentation**: Familiarize yourself with the [NCBI API documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/) for detailed guidance on constructing queries.

- **Using NCBI Website**: You are welcome to use the NCBI website for this task. If you do so, document each step of your process clearly in your task report. This should include the species names, search terms used, and how you determined the relevant sequences to include.

- **Bonus Opportunity**: Implementing an automated, methodological approach using the NCBI API and relevant Python packages to add all 10 records in your dataset will earn you a 50% bonus for this section. Your method should be structured and replicable, demonstrating a systematic approach to data collection.

Remember, the goal is to methodically expand your dataset with relevant mtDNA sequences, paving the way for insightful phylogenetic analysis.

In [None]:
dataset = pd.read_csv('./dataset/species.csv')
dataset.head()

Unnamed: 0,taxo_id,specie,blast_name,genbank_common_name,accession_number,mtDNA
0,8945,Eudynamys scolopaceus,birds,Asian koel,NC_060520,https://www.ncbi.nlm.nih.gov/nucleotide/NC_060...
1,7460,Apis mellifera,Bees,honey bee,NC_051932,https://www.ncbi.nlm.nih.gov/nucleotide/NC_051...
2,36300,Pelecanus crispus,birds,Dalmatian pelican,OR620163,https://www.ncbi.nlm.nih.gov/nuccore/OR620163.1
3,10116,Rattus norvegicus,Rodents,Norway rat,NC_001665,https://www.ncbi.nlm.nih.gov/nuccore/NC_001665
4,9031,Gallus gallus,birds,Gallus gallus,NC_053523,https://www.ncbi.nlm.nih.gov/nuccore/NC_053523.1


In [None]:
import pandas as pd
import requests

# TODO: Add 10 more species to your dataset
# for species in additional_species:
    # Fetch mtDNA and add to the dataset

# TODO: Save the expanded dataset
# dataset.to_csv('./dataset/expanded_dataset.csv', index=False)
expanded_dataset = pd.read_csv('./dataset/expanded_dataset.csv')
expanded_dataset.sample(5)


Unnamed: 0,taxo_id,specie,blast_name,genbank_common_name,accession_number,mtDNA
2,36300,Pelecanus crispus,birds,Dalmatian pelican,OR620163,https://www.ncbi.nlm.nih.gov/nuccore/OR620163.1
21,59729,Taeniopygia guttata,birds,zebra finch,NC_007897,https://www.ncbi.nlm.nih.gov/nucleotide/NC_007...
7,8845,Anser cygnoides,birds,Swan goose,NC_023832,https://www.ncbi.nlm.nih.gov/nucleotide/NC_023...
19,193005,Aquila nipalensis,hawks & eagles,steppe eagle,NC_045042,https://www.ncbi.nlm.nih.gov/nuccore/NC_045042.1
8,9940,Ovis aries,sheep,even-toed ungulates,NC_001941,https://www.ncbi.nlm.nih.gov/nuccore/NC_001941


#### Checking Data Consistency

The code block is designed to check the accuracy of a biological dataset. It examines each entry in a CSV file, focusing on taxonomy IDs, species names, GenBank accession numbers, and mitochondrial DNA links. It uses the NCBI's Entrez system to ensure taxonomy IDs correspond to the correct species and confirms the mitochondrial DNA links are accurate. The script also checks GenBank accession numbers against the provided links. This method is useful for maintaining the accuracy of current data and **might help in adding new entries to the database.**

In [None]:
import pandas as pd
import requests
from Bio import Entrez
import os
from io import BytesIO
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Set your email and API key for NCBI
entrez_email = os.getenv('ENTREZ_EMAIL')
entrez_key = os.getenv('ENTREZ_API_KEY')
Entrez.email = entrez_email

def fetch_entrez_record(db, id, rettype, retmode):
    """Fetch record from NCBI Entrez with retries."""
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db={db}&id={id}&rettype={rettype}&retmode={retmode}"
    session = requests.Session()
    retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
    session.mount('https://', HTTPAdapter(max_retries=retries))

    try:
        response = session.get(url)
        response.raise_for_status()
        if db == "nucleotide":
            return response.text
        return response.content
    except requests.exceptions.HTTPError as err:
        print(f"HTTP error: {err}")
    except requests.exceptions.ConnectionError as err:
        print(f"Connection error: {err}")
    return None

def verify_taxonomy_id(taxo_id, species_name):
    """Verify taxonomy ID against species name."""
    xml_data = fetch_entrez_record("taxonomy", taxo_id, "xml", "xml")
    if xml_data:
        # Convert bytes data to a binary file-like object
        xml_data_io = BytesIO(xml_data)
        records = Entrez.read(xml_data_io)
        return records[0]['ScientificName'].lower() == species_name.lower()
    return False

def verify_mitochondrial_dna(accession_number):
    gb_data = fetch_entrez_record("nucleotide", accession_number, "gb", "text")
    return "mitochondrion" in gb_data.lower() if gb_data else False

def extract_accession_from_link(link):
    return link.split('/')[-1].split('.')[0]

def check_dataset_consistency(file_path):
    species_df = pd.read_csv(file_path)

    for index, row in species_df.iterrows():
        taxonomy_id = str(row['taxo_id'])
        species_name = row['specie']
        accession_number = row['accession_number']
        mtDNA_link = row['mtDNA']
        extracted_accession = extract_accession_from_link(mtDNA_link)

        taxonomy_check = verify_taxonomy_id(taxonomy_id, species_name)
        accession_match = (accession_number == extracted_accession)
        mitochondrial_check = verify_mitochondrial_dna(accession_number)

        print(f"Row {index}: Taxonomy Check: {taxonomy_check}, Accession Match: {accession_match}, Mitochondrial Check: {mitochondrial_check}")


In [None]:
check_dataset_consistency('./dataset/expanded_dataset.csv')

Row 0: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 1: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 2: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 3: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 4: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 5: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 6: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 7: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 8: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 9: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 10: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 11: Taxonomy Check: True, Accession Match: True, Mitochondrial Check: True
Row 12: Taxonomy Check: True, Accession Match: True, Mitochond

### Step 3: Sequence Download and Preparation

The next step in our project involves downloading the mitochondrial DNA sequences for each species and preparing them for analysis.

#### Instructions:

- **Download mtDNA Sequences**: Write a script to download the mtDNA sequences from the links provided in your dataset. The sequences should be in FASTA format, which is the standard for nucleotide sequences.

- **Sequence Labeling**: Properly label each sequence within the FASTA file. This header, starting with '>', should include the species name and any other relevant information (e.g., `>Eudynamys_scolopaceus_NC_060520`). This is crucial for identifying the sequences in subsequent analysis.

- **Concatenate Sequences**:
    - Create a script to concatenate all downloaded sequences into a single `.fasta` or `.fna` file.
    - Ensure each sequence in the file is clearly separated by its header line, which is important for differentiating the sequences of various species.

#### Tips for Writing the Download and Concatenation Script:
- Use Python libraries such as `httpx` or `requests`, or any other tool you prefer for downloading sequences. For processing FASTA files you can use a wide range of tools. One recommended option is `Biopython` library.
- Use a loop to go through each link in the dataset, download the sequence, and append it to your concatenated file.
- Maintain the format integrity of the FASTA file, ensuring each sequence is correctly associated with its header.


In [None]:
# TODO: Write a function to download mtDNA sequences
# def download_mtDNA(url, label):
    # Code to download and label the sequence

# TODO: Loop through the dataset and download each mtDNA sequence
# for index, row in dataset.iterrows():
    # Call the download function for each species

In [None]:
from Bio import Entrez, SeqIO
import pandas as pd

def download_mtDNA(accession_number, label):
    try:
        handle = Entrez.efetch(db="nucleotide", id=accession_number, rettype="fasta", retmode="text")
        record = SeqIO.read(handle, "fasta")
        handle.close()

        header = f'>{label}\n'
        return header + str(record.seq)
    except Exception as e:
        print(f"Failed to download sequence from {accession_number}: {str(e)}")
        return None

def download_concatenate_sequences(dataset):
    concatenated_sequences = ""

    for index, row in dataset.iterrows():
        specie_label = row["specie"].replace(" ", "_")
        blast_name_label = row["blast_name"].replace(" ", "_")  # Replace spaces with underscores
        label = f'{specie_label}_{blast_name_label}_{row["accession_number"]}'
        mtDNA_sequence = download_mtDNA(row["accession_number"], label)

        if mtDNA_sequence:
            concatenated_sequences += mtDNA_sequence + '\n'

    return concatenated_sequences

def save_to_fasta(concatenated_sequences, output_file='/content/dataset/concatenated_seq.fasta'):
    with open(output_file, 'w') as fasta_file:
        fasta_file.write(concatenated_sequences)

Entrez.email = 'example@gmail.com'

dataset = pd.read_csv('/content/dataset/expanded_dataset.csv')

concatenated_sequences = download_concatenate_sequences(dataset)
save_to_fasta(concatenated_sequences)


### Step 4: Sequence Alignment

After downloading the mitochondrial DNA sequences, the next critical step is their alignment. This process allows us to compare the sequences and discern the evolutionary relationships among the species.

#### Instructions:

- **Select an Alignment Tool**: Choose one of the following alignment tools based on your project needs. Each tool has its strengths and is widely used in bioinformatics for multiple sequence alignment.

1. **MAFFT**:
    - **Brief Introduction**: MAFFT (Multiple Alignment using Fast Fourier Transform) is renowned for its speed and efficiency, particularly suitable for large datasets.
    - **Resources**:
        - [MAFFT Official Documentation](https://mafft.cbrc.jp/alignment/software/)
        - [Example Usage on GitHub](https://github.com/MountainMan12/SARS-Cov2-phylo)
        - [Relevant Notebook](https://colab.research.google.com/github/pb3lab/ibm3202/blob/master/tutorials/lab03_phylo.ipynb)

2. **Clustal Omega**:
    - **Brief Introduction**: Clustal Omega offers high-quality alignments and is user-friendly, ideal for those new to sequence alignment.
    - **Resources**:
        - [A Python wrapper around Clustal Omega](https://github.com/benchling/clustalo-python)
        - [Clustal Omega Official Website](http://www.clustal.org/omega/)

3. **MUSCLE**:
    - **Brief Introduction**: MUSCLE (Multiple Sequence Comparison by Log-Expectation) is known for its balance between speed and accuracy, making it a versatile choice for various datasets.
    - **Resources**:
        - [MUSCLE Documentation](https://drive5.com/muscle5/manual/)

- **Perform Sequence Alignment**: Utilize your chosen tool to align the downloaded mtDNA sequences. This alignment is foundational for the accurate construction of phylogenetic trees.

- **Save Aligned Sequences**: After alignment, save the output in an appropriate format for further analysis in the subsequent steps of this project.

In [None]:
# TODO: Install and import the alignment tool
# !pip install mafft
# from Bio.Align.Applications import MafftCommandline

# TODO: Perform the sequence alignment
# def perform_alignment(input_file, output_file):
    # Code to align sequences using the chosen tool

# TODO: Align your downloaded sequences
# perform_alignment('path_to_downloaded_sequences.fasta', 'aligned_sequences.fasta')

In [None]:
!apt-get install -y clustalo

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libargtable2-0
The following NEW packages will be installed:
  clustalo libargtable2-0
0 upgraded, 2 newly installed, 0 to remove and 32 not upgraded.
Need to get 273 kB of archives.
After this operation, 694 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libargtable2-0 amd64 13-1.1 [14.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 clustalo amd64 1.2.4-7 [259 kB]
Fetched 273 kB in 0s (911 kB/s)
Selecting previously unselected package libargtable2-0.
(Reading database ... 121730 files and directories currently installed.)
Preparing to unpack .../libargtable2-0_13-1.1_amd64.deb ...
Unpacking libargtable2-0 (13-1.1) ...
Selecting previously unselected package clustalo.
Preparing to unpack .../clustalo_1.2.4-7_amd64.deb ...
Unpacking clustalo (1.2.4-7) ...
Setting up

In [None]:
import subprocess

# Define the input and output file paths
input_file_path = '/content/dataset/concatenated_seq.fasta'
output_file_path = '/content/dataset/aligned_sequences_clustalo.fasta'

# Run Clustal Omega using subprocess
clustalo_cmd = ['clustalo', '--infile', input_file_path, '--outfile', output_file_path, '--outfmt', 'fasta']

try:
    subprocess.run(clustalo_cmd, check=True)
    print(f"Clustal Omega alignment successful. Aligned sequences saved to {output_file_path}")
except subprocess.CalledProcessError as e:
    print(f"Error during Clustal Omega alignment: {e}")


Clustal Omega alignment successful. Aligned sequences saved to /content/dataset/aligned_sequences_clustalo.fasta


### Step 5: Phylogenetic Tree Construction

The next phase in our project involves constructing phylogenetic trees to visualize and analyze the evolutionary relationships among the species. We will use three distinct methods, each providing unique insights.

#### Phylogenetic Tree Construction Methods:

1. **Bayesian Inference Trees**:
    - **Overview**: This method uses Bayesian statistics to estimate the likelihood of different evolutionary histories. It's particularly useful for its ability to estimate branch lengths and support values.
    - **Tools**: MrBayes, BEAST
        - MrBayes ([Official Website](https://nbisweden.github.io/MrBayes/manual.html/)) is widely recognized for its robustness in Bayesian inference.
        - BEAST2 ([BEAST Software](https://www.beast2.org/)) is another powerful tool, offering advanced features for complex evolutionary models.

2. **Maximum Likelihood Trees**:
    - **Overview**: Maximum Likelihood methods evaluate tree topologies based on the likelihood of observed data given a tree model. It's known for its statistical rigor and accuracy.
    - **Tools**: RAxML, PhyML
        - RAxML ([RAxML GitHub](https://github.com/stamatak/standard-RAxML)) is preferred for large datasets due to its efficiency.
        - PhyML ([PhyML Documentation](http://www.atgc-montpellier.fr/phyml/)) offers a balance of speed and accuracy, with a user-friendly interface.

3. **Neighbor-Joining Trees**:
    - **Overview**: The Neighbor-Joining method is a distance-based approach that constructs phylogenetic trees by evaluating the genetic distance between sequences. It is known for its speed and simplicity, making it well-suited for initial exploratory analyses.
    - **Tools**:
        - MEGA: A versatile tool specifically used here for constructing Neighbor-Joining trees. It's recognized for its ease of use and effectiveness in phylogenetic analysis. [MEGA Software](https://www.megasoftware.net/)


In [None]:
# TODO: Import necessary libraries for tree construction
# from Bio import Phylo
# from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

# TODO: Construct a phylogenetic tree using the Neighbor Joining method
# def construct_tree_NJ(aligned_sequences):
    # Implement the tree construction using Neighbor Joining

# TODO: Repeat the process for Maximum Likelihood and Supertree methods
# def construct_tree_ML(aligned_sequences):
    # Implement the tree construction using Maximum Likelihood

# def construct_tree_BI(gene_trees):
    # Implement the construction of a tree using bayes inference

# TODO: Visualize and save the constructed trees
# Phylo.draw(tree, do_show=False)
# Phylo.write(tree, 'tree_output.xml', 'phyloxml')

In [None]:
!pip install biopython
# aligned_sequences_clustalo_oneline.fasta



In [None]:
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceMatrix
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, ParsimonyTreeConstructor
from Bio.Phylo.TreeConstruction import DistanceCalculator, ParsimonyScorer
# Step 1: Read sequences from the FASTA file
fasta_file = "/content/dataset/aligned_sequences_clustalo_oneline.fasta"
alignment = AlignIO.read(fasta_file, "fasta")

calculator = DistanceCalculator("identity")
distance_matrix = calculator.get_distance(alignment)
constructor = DistanceTreeConstructor(calculator)
nj_tree = constructor.nj(distance_matrix)
Phylo.draw_ascii(nj_tree)
Phylo.write(nj_tree, 'tree_output.xml', 'phyloxml')

  ______ Eudynamys_scolopaceus_birds_NC_060520
 |
 |       ___ Cygnus_cygnus_birds_NC_027095
 |   ___|
 |  |   |______ Aquila_nipalensis_hawks_&_eagles_NC_0...
 | _|
 || | _ Anas_platyrhynchos_birds_NC_009684
 || ||
 ||  |_ Anser_cygnoides_birds_NC_023832
 ||
 ||  _____ Oncorhynchus_keta_bony_fishes_NC_017838
 || |
 || |   ___ Callicebus_lugens_primates_NC_024630
 ||,| ,|
 |||| ||___ Ovis_aries_sheep_NC_001941
 |||| ||
 |||| ||___ Pteropus_vampyrus_bats_NC_026542
 ||||_||
 |||  ||____ Myotis_lucifugus_bats_NC_029849
 |||  |
 |||  |  _ Mus_musculus_rodents_NC_005089
 | |  |_|
 | |    |_ Rattus_norvegicus_Rodents_NC_001665
 | |
 | |   _________ Pelecanus_crispus_birds_OR620163
 | |  |
 | |  |             _________ Damon_diadema_whip_spiders_NC_011293
 | |__|            |
 |    |            |   _____ Malaza_empyreus_moths_NC_048454
 |    |            |  |
 |    |            | ,|___ Ruspolia_dubia__crickets_NC_009876
 |    |____________| ||
 |                 | ||  ___ Corydalus_cornutus_i

1

we use MEGA to import aligned fasta file, (we can import raw fasta files and align them using MEGA but here we import aligned sequences), then we choose to construct Makimum likelihood tree, options were set to default

<img src='https://drive.google.com/thumbnail?id=1vQOM-zXWq7RjlmE4dEtNw-CDAQ4xNRpE'>

same steps were done to construct neighbor joining tree, except here we set a test phylogeny step to use 500 replications of bootstrap method, which according to previous tests on this subject, leads to more accurate tree


<img src='https://drive.google.com/thumbnail?id=1c5HsZrfs0CQ0XAbEOMi3LauQfFmuPpuM'>

for constructing bayesian inference tree, first we use aligned sequences to get nexus file, there are so many tools to do this, but here we used AliView software to get the .nex datatype, then we use this data type to infere Bayesian inference phylogenetic tree.

<img src='https://drive.google.com/thumbnail?id=1cgTMxY47U_nqF4JJQTvqoo1zoC-T0AnY'>

the tool we're gonna use here is MrBayes (3.2.7 version), proceeding steps were performed in order to construct BI tree.

---



1. MrBayes > exe aligned_sequences_clustalo_oneline.nex</br>
> Executing file "aligned_sequences_clustalo_oneline.nex"</br>
  DOS line termination</br>
  Longest line length = 46397</br>
  Parsing file</br>
  Expecting NEXUS formatted file</br>
  Reading data block</br>
  Allocated taxon set</br>
  Allocated matrix</br>
  Defining new matrix with 33 taxa and 46343 characters</br>
  Data is Dna</br>
  Gaps coded as -</br>
  Missing data coded as ?</br>
  Taxon  1 -> Eudynamys_scolopaceus_birds_NC_060520</br>
  Taxon  2 -> Apis_mellifera_Bees_NC_051932</br>
  Taxon  3 -> Pelecanus_crispus_birds_OR620163</br>
  Taxon  4 -> Rattus_norvegicus_Rodents_NC_001665</br>
  Taxon  5 -> Gallus_gallus_birds_NC_053523</br>
....


---
2. MrBayes > lset nst=6 rates=invgamma</br>
>  Setting Nst to 6</br>
   Setting Rates to Invgamma</br>
   Successfully set likelihood model parameters

---

3. mcmc ngen=20000 samplefreq=100 printfreq=100 diagnfreq=1000</br>
>  Setting number of generations to 20000</br>
   Setting sample frequency to 100</br>
   Setting print frequency to 100</br>
   Setting diagnosing frequency to 1000</br>
   Running Markov chain</br>
   MCMC stamp = 5988690643</br>
   Seed = 762457471</br>
   Swapseed = 1707117446</br>
....</br>
   19600 -- (-369057.537) (-369132.535) [-369032.121] (-369076.958) * [-369004.518] (-369008.116) (-369010.075) (-369016.818) -- 0:00:42 </br>
   19700 -- (-369055.871) (-369134.445) [-369030.308] (-369078.326) * (-369002.091) [-369007.878] (-369008.541) (-369021.728) -- 0:00:31</br>
   19800 -- (-369052.821) (-369128.822) [-369033.383] (-369076.015) * [-368999.426] (-369001.347) (-369009.764) (-369018.846) -- 0:00:21</br>
   19900 -- (-369051.006) (-369125.358) [-369038.452] (-369076.178) * [-368996.576] (-369004.653) (-368999.092) (-369019.324) -- 0:00:10</br>
   20000 -- (-369049.700) (-369125.901) [-369032.708] (-369074.678) * [-368993.184] (-368997.261) (-369004.145) (-369020.128) -- 0:00:00</br></br>

   Average standard deviation of split frequencies: 0.166069</br></br>

   Continue with analysis? (yes/no):</br>
  Additional number of generations: no</br>


---
4. MrBayes > sump
>   <p>Summarizing parameters in files aligned_sequences_clustalo_oneline.nex.run1.p and aligned_sequences_clustalo_oneline.nex.run2.p
   Writing summary statistics to file aligned_sequences_clustalo_oneline.nex.pstat
   Using relative burnin ('relburnin=yes'), discarding the first 25 % of samples. </br> You can use these
   graphs to determine what the burn in for your analysis should be.
   When the log probability starts to plateau you may be at station-
   arity. Sample trees and parameters after the log probability
   plateaus.Also examine the convergence diagnostics provided by
   the 'sump' and 'sumt' commands for all the parameters in your
   model. Remember that the burn in is the number of samples to dis-
   card.</p>
...

---
5. MrBayes > sumt
>   Summarizing trees in files "aligned_sequences_clustalo_oneline.nex.run1.t" and "aligned_sequences_clustalo_oneline.nex.run2.t". Writing statistics to files aligned_sequences_clustalo_oneline.nex.</br>

   Summary statistics for partitions with frequency >= 0.10 in at least one run:</br>
       Average standard deviation of split frequencies = 0.166069</br>
       Maximum standard deviation of split frequencies = 0.646230</br>
       Average PSRF for parameter values (excluding NA and >10.0) = 1.076</br>
       Maximum PSRF for parameter values = 1.725</br>
...




<img src='https://drive.google.com/thumbnail?id=1ur-6eCDgCMjB8Gq--RIT5YUr95oydOHF'>

### Step 6: In-Depth Phylogenetic Tree Visualization

Having constructed phylogenetic trees using different methods, our next task is to visualize these trees effectively. This step is crucial for interpreting the results and communicating our findings.

#### Visualization Tools:

1. **FigTree**:
    - **Overview**: FigTree is designed for the graphical representation of phylogenetic trees. It's excellent for creating publication-ready visualizations.
    - **Resource**: [FigTree Tool](http://tree.bio.ed.ac.uk/software/figtree/)
    - **Usage**: Use FigTree to add detailed annotations, adjust branch colors, and format tree layouts for clear, interpretable visualizations.

2. **iTOL (Interactive Tree Of Life)**:
    - **Overview**: iTOL is a web-based tool for the display, annotation, and management of phylogenetic trees, offering extensive customization options.
    - **Resource**: [iTOL Website](https://itol.embl.de/)
    - **Usage**: Ideal for interactive tree visualizations. It allows users to explore different layers of data through their tree, such as adding charts or color-coding branches.

3. **Dendroscope**:
    - **Overview**: Dendroscope is a software program for viewing and editing phylogenetic trees, particularly useful for large datasets.
    - **Resource**: [Dendroscope Download](https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/dendroscope/)
    - **Usage**: Utilize Dendroscope when dealing with large and complex trees or when you need to compare multiple trees side-by-side.

#### Task:

- **Visualize Each Tree**: Use one or more of the above tools to visualize the phylogenetic trees you constructed using Bayesian inference, maximum likelihood, and neighbor-joining methods.
- **Highlight Differences**: Focus on highlighting the differences and similarities between the trees obtained from the different methods. Pay attention to tree topology, branch lengths, and any notable patterns.
- **Interpretation and Presentation**: Aim for visualizations that are not only accurate but also interpretable and visually appealing. This will enhance the clarity of your work.

visualizing maximum likelihood tree using figtree:

<img src='https://drive.google.com/thumbnail?id=1OBpyIly9GtAEVaKxwnmRYjJ-utzvz-xI'>

visualizing neighbor joining tree using figtree:

<img src='https://drive.google.com/thumbnail?id=1REJVSDY0JGn1zId73CzOX9gjQnpgkj8w'>

visualizing bayesian inference tree using figtree:

<img src='https://drive.google.com/thumbnail?id=14NtmOAhDhdl8IjkzD-2tTecOMzwced5M'>

### Cross-Disciplinary Applications (Optional)

This is an optional part with bonus, relative to the depth of your analysis. Refer to the first part of this notebook. You have complete freedom to do this part anyway you like, but to gain a portion of the bonus score for this section, a bare minimum effort is required.

### Conclusion and Reflective Insights

As we conclude our exploration of phylogenetic tree construction and analysis, let's reflect on the insights learned from this task and consider questions that emerge from our findings.

#### Interpretation of Results:

- Reflect on the phylogenetic trees produced by each method (Bayesian inference, maximum likelihood, and neighbor-joining). Consider how the differences in tree topology might offer varied perspectives on the evolutionary relationships among the species.

#### Questions to Ponder:

1. **Species Divergence**: Based on the trees, which species appear to have the most ancient divergence? How might this information contribute to our understanding of their evolutionary history?
   
2. **Common Ancestors**: Are there any unexpected pairings or groupings of species that suggest a closer evolutionary relationship than previously thought? How could this reshape our understanding of these species' evolutionary paths?

3. **Methodology Insights**: Considering the discrepancies between the trees generated by different methods, what might this tell us about the limitations and strengths of each phylogenetic analysis method?

4. **Conservation Implications**: Considering the evolutionary relationships revealed in your phylogenetic analysis, what insights can be gained for conservation strategies? Specifically, how could understanding the close evolutionary ties between species, which might be facing distinct environmental challenges, guide targeted conservation efforts?

<blockquote style="font-family:Arial; color:red; font-size:16px; border-left:0px solid red; padding: 10px;">
    <strong>Don't forget to answer these questions!</strong>
</blockquote>


**Species Divergence**
considering the phylogenetic trees and the times corresponding each specie, "Pelecanus crispus" is the most ancienct specie among the species in our dataset.
Understanding the most ancient divergences among species provides useful insights into evolutionary history. By identifying basal positions on phylogenetic trees, scientists can root the tree of life, construct evolutionary timelines, and unravel biogeographic events.</br>
**Common Ancestors**
the only grouping that's a bit different is the grouping of species sheep, primate, bat and rodent with bony fishes, in a way that it seems like the first group which are all mammalian have same common ancestor with a fish, which is unexpected. my personal thought is that we infer from this result that there have been interspecies connections which had led to creation of new species.
It suggests that key innovations, such as terrestrial locomotion in mammals, may have deeper roots in vertebrate evolution. Understanding this shared ancestry provides insights into both constraints and opportunities that have shaped their evolutionary trajectories.</br>
**Methodology Insights**
Neighbor joining is clustering algorithms that can make quick trees but is not the most reliable, especially when dealing with deeper divergence times. This method is good to give you an overall idea about your data, but is almost never acceptable for publication. Neighbor joining is most vulnerable to missing data, Maximum likelihood is the least vulnerable inference. The reason is simple: the more missing data, the less the distance matrix is a reflection of the true tree. <br/>
Maximum likelihood and Bayesian methods can apply a model of sequence evolution and are ideal for building a phylogeny using sequence data. These methods are the two methods that are most often used in publications. The main downside of these methods is that they are computational expensive.
If the phylogeny is the main focus of your work, my suggestion is to make both maximum likelihood and Bayesian trees. For these methods you will need to choose a model of sequence evolution.<br/>
The lower the diversity in your data, the less discriminative Maximum likelihood and Bayesian will be, you won't get necessarily wrong trees, but random ones. Maximum likelihood will still choose one tree, which is however not significantly better than other alternatives and may include numerous branches that receive lower BS support than conflicting alternatives.<br/>
**Conservation Implications**
Understanding the close evolutionary relationships between species helps us develop conservation strategies that benefit multiple related species. By identifying key species that represent evolutionary diversity, we can prioritize their protection to safeguard the broader evolutionary heritage. This approach, known as cross-species conservation, allows us to leverage shared ecological requirements and genetic traits. Additionally, recognizing these evolutionary ties informs adaptive management strategies, such as translocations, which enhance genetic diversity and resilience to environmental changes. Ultimately, integrating evolutionary perspectives into conservation planning improves the efficiency and effectiveness of conservation actions, ensuring the preservation of biodiversity and ecosystem integrity in the face of environmental threats.