Welcome to Bio 1B Notebook on Phylogeny Part B on the Molecular Estimation of the Phylogeny of Primates Using the Epsilon Hemoglobin Gene! In this notebook we will be going over the phylogeny of primates, utilizing the hemoglobin gene and performing analysis through data science principles. 

## Learning Outcomes

- Understanding BioPython

- Understanding comprehensively the process of parsimony analysis

- Selecting an outgroup and display it on a tree

- Understand the phylogeny of primates & the interdisciplinary connection with data science

## Reading in the Data

We need to import our libraries first.

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns

We need to also install biopython. Biopython is a helpful set of Python tools intended for computational biology and bioinformatics. You can learn and explore more here.

In [6]:
pip install biopython

Collecting biopython
  Downloading biopython-1.81-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m122.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.81
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [7]:
from Bio import Phylo

We will begin by importing SeqIO, which allows us to parse through fasta files. Fasta files are text-based format for representing either nucleotide sequences or amino acid (protein) sequences, used extensively in data analysis in the realm of biology. Run the following cell, and we can see different types of primates paired with their individual DNA amino acid sequence.

In [16]:
from Bio import SeqIO
for seq_record in SeqIO.parse("Primate_Epsilon.fas", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

Galago
Seq('CTTTGACCAATGACTTCTAACTACCACGGAGAACAAGGGGCTAGAACTTCAGCA...TCA')
1620
Lemur
Seq('CCTTGACCAATGACTTCTAACTACCACGGAGAGCAAGGGGCCAGAACATCAGCA...TCA')
1637
Goat
Seq('CCTTGACCAATGACTTCAAAGGACAAGGGGGAGCAAGGGGGCAGAAGTTCAGCA...TCA')
1608
Tarsier
Seq('ATCACTAGCAAGTTGCCAGACCTGACATCATGGTGCATCTTACTGCTGAAGAAA...TCA')
1510
Marmoset
Seq('CCTTGACCAATGACTTTTAAGTACCATGGAGAATAGGGAGCAGAACTTCAGCAG...TCA')
1507
Chimpanzee
Seq('CCTTGACCAATGACTTTTAAGTACCATGGAGAACAGGGGGCCAGAACTTCGGCA...TCA')
1660
Gorilla
Seq('CCTTGACCAATGACTTTTAAGTACCATGGAGAACAGGGGGCCAGAACTTCGGCA...TCA')
1662
Gibbon
Seq('CCTTGACCAATGACTTTTAAGTACCACGGAGAACAGGGGGCCAGAACTTCGGCA...TCA')
1672
Human
Seq('CCTTGACCAATGACTTTTAAGTACCATGGAGAACAGGGGGCCAGAACTTCGGCA...TCA')
1659
Orangutan
Seq('CCTTGACCAATGACTTTTAAATACCATGGAGAACAGGGGGCCAGAACTTCGGCA...TCA')
1667


Run the next cell to check out the different primates represented in the 

In [24]:
from Bio import SeqIO

for record in SeqIO.parse("Primate_Epsilon.fas", "fasta"):
    print(record.id)

Galago
Lemur
Goat
Tarsier
Marmoset
Chimpanzee
Gorilla
Gibbon
Human
Orangutan


Next we'll finally keep our values for each primate's DNA sequence within a python dictionary. A python dictionary maps two distinct values, a key and a value pair, together. This allows for one to easily link two separate facets together, like the name of a primate and the primate's specific sequence record! Run the cell and check it out.

In [28]:
record_dict = SeqIO.to_dict(SeqIO.parse("Primate_Epsilon.fas", "fasta"))
record_dict

{'Galago': SeqRecord(seq=Seq('CTTTGACCAATGACTTCTAACTACCACGGAGAACAAGGGGCTAGAACTTCAGCA...TCA'), id='Galago', name='Galago', description='Galago', dbxrefs=[]),
 'Lemur': SeqRecord(seq=Seq('CCTTGACCAATGACTTCTAACTACCACGGAGAGCAAGGGGCCAGAACATCAGCA...TCA'), id='Lemur', name='Lemur', description='Lemur', dbxrefs=[]),
 'Goat': SeqRecord(seq=Seq('CCTTGACCAATGACTTCAAAGGACAAGGGGGAGCAAGGGGGCAGAAGTTCAGCA...TCA'), id='Goat', name='Goat', description='Goat', dbxrefs=[]),
 'Tarsier': SeqRecord(seq=Seq('ATCACTAGCAAGTTGCCAGACCTGACATCATGGTGCATCTTACTGCTGAAGAAA...TCA'), id='Tarsier', name='Tarsier', description='Tarsier', dbxrefs=[]),
 'Marmoset': SeqRecord(seq=Seq('CCTTGACCAATGACTTTTAAGTACCATGGAGAATAGGGAGCAGAACTTCAGCAG...TCA'), id='Marmoset', name='Marmoset', description='Marmoset', dbxrefs=[]),
 'Chimpanzee': SeqRecord(seq=Seq('CCTTGACCAATGACTTTTAAGTACCATGGAGAACAGGGGGCCAGAACTTCGGCA...TCA'), id='Chimpanzee', name='Chimpanzee', description='Chimpanzee', dbxrefs=[]),
 'Gorilla': SeqRecord(seq=Seq('CCTTGACCAAT

In your own words (1-2 sentences), why might it be important for biologists to have the biopython library for programming? How is it helpful for biologists in this scenario of DNA sequences?

Your answer here

# Aligning Data with ClustalQ

## parsimony analysis

## Tree Display

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=5f76ca8c-bac0-4258-8219-91851f8426bf' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>