# Project 2

### Scientific Question: How similar is the CD4 cell surface receptor of animal species that can contract immunodeficiency viruses to the CD4 of species that don't have known contraction of immunodeficiency viruses in terms of nucleotide sequence and protein structure?

CD4 is a transmembrane glycoprotein on that is expressed on the cell surface of T cells. In humans, CD4 is the receptor that initiates the transmembrane fusion between the HIV virus's gp120 envelope and the cellular membrane of T cells. There are other types of immunodeficiency viruses that can be contracted by other species that have their own versions of the CD4 glycoprotein. Contraction of immunodeficiency viruses compromise the functions of the immune system.

The sequence data being used for this analysis is sourced from the NCBI gene libary which is a government funded public database for biotechnology information. Information on the structure of the proteins of interest are sourced from RCSB's protein data bank which is a database that contains three dimensional structural data of larger biological molecules.

### Scientific Hypothesis: If the presence of CD4 cell surface receptors is the gateway to different variations of immunodeficiency virus infection and pathogenesis, then the CD4 of species that can contract (such as chimpanzees and felines) will have some noticeable sequential or structural difference to those species that do not contract the disease.

There are some species other than humans that can contract a form of the immunodeficiency retrovirus as well as many species that do not contract the virus. The Pan troglodytes, more commonly known as chimpanzees, and Felis catus, also known as the common house cat are known to contract a form of immunodeficiency virus and will be used for the data analysis. Colobus guereza, an old world monkey, and Canis lupus familiaris, the domestic dog, are species that don't have known immunodeficiency virus will also be used for this comparison.

The amino acid sequences of the CD4 protein are downloaded as fasta files which can be read and analyzed easily by Biopython.

### Part 1: Load the Packages

insert description of all packages

In [1]:
# installing packages

In [2]:
# importing packages
import pandas as pd
import numpy as np
from Bio.Blast import NCBIWWW
from Bio import SeqIO
from Bio import pairwise2
from Bio.pairwise2 import format_alignment

insert description of each package

### Part 2: Load in Data and Perform Bioinformatics Analysis

In [3]:
# assigning fasta file to a variable
human_cd4_aa = 'homo_sapien_CD4_aa.fasta'

# reading in human CD4 and checking fasta file
for seq_record in SeqIO.parse(human_cd4_aa, "fasta"):     
    print(seq_record.id)
    print(seq_record.name)
    print(repr(seq_record.seq))
    print(len(seq_record))
    
# human CD4 sequence variable
human_cd4 = seq_record.seq
# checking sequence
print(human_cd4)

QDC22486.1
QDC22486.1
Seq('MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWK...SPI')
458
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI


In [4]:
# establish quicker connection by performing qblast method on our sequence
result_handle = NCBIWWW.qblast("blastp", "nr", human_cd4)
print(result_handle)

<_io.StringIO object at 0x00000231310D68B0>


In [5]:
# saving blast results to an xml file
with open('CD4_BLASTresults.xml', 'w') as save_file: 
    blast_results = result_handle.read() 
    save_file.write(blast_results)

In [6]:
from Bio.Blast import NCBIXML
handle = open("CD4_BLASTresults.xml")
blast_record = NCBIXML.read(handle)

In [7]:
# viewing blast hits
for hit in blast_record.descriptions: 
    print(hit.title)
    print(hit.e)

gb|AAR13802.1| CD4-EGFP [Cloning vector pMACSiBac]
0.0
ref|NP_000607.1| T-cell surface glycoprotein CD4 isoform 1 precursor [Homo sapiens] >ref|NP_001369636.1| T-cell surface glycoprotein CD4 isoform 1 precursor [Homo sapiens] >sp|P01730.1| RecName: Full=T-cell surface glycoprotein CD4; AltName: Full=T-cell surface antigen T4/Leu-3; AltName: CD_antigen=CD4; Flags: Precursor [Homo sapiens] >gb|AAX41622.1| CD4 antigen [synthetic construct] >emb|SJX39576.1| unnamed protein product, partial [Human ORFeome Gateway entry vector] >gb|AAA35572.1| T4 surface glycoprotein precursor [Homo sapiens] >gb|AAB51309.1| surface antigen CD4 [Homo sapiens] >gb|AAH25782.1| CD4 molecule [Homo sapiens]
0.0
gb|AAA16069.1| T4 surface glycoprotein precursor [Homo sapiens]
0.0
ref|XP_008971926.1| T-cell surface glycoprotein CD4 [Pan paniscus]
0.0
gb|ABX56472.1| T-cell surface glycoprotein CD4 [Pan troglodytes] >gb|AKC01945.1| T-cell surface glycoprotein CD4 [Pan troglodytes verus] >gb|ABX56478.1| T-cell surface 

In [8]:
# importing fasta files of four different species
pan_troglodytes_CD4_aa = SeqIO.parse('pan_troglodytes_CD4.fasta', "fasta")
felis_catus_CD4_aa = SeqIO.parse('felis_catus_CD4.fasta', "fasta")
colobus_guereza_CD4_aa = SeqIO.parse('colobus_guereza_CD4.fasta', "fasta")
canis_lupus_familiaris_CD4_aa = SeqIO.parse('canis_lupus_familiaris_CD4.fasta', "fasta")

# list of species and their sequences
species_seq = [pan_troglodytes_CD4_aa, felis_catus_CD4_aa, colobus_guereza_CD4_aa, canis_lupus_familiaris_CD4_aa]

In [9]:
# creating amino acid sequence variables for each species
                    # ??if I try to simplify this into a for loop, all of my variables become the same sequence??
for seq_record in pan_troglodytes_CD4_aa:
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record)) 
chimp_cd4 = seq_record.seq

for seq_record in felis_catus_CD4_aa:
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
feline_cd4 = seq_record.seq

for seq_record in colobus_guereza_CD4_aa:
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
colobus_cd4 = seq_record.seq

for seq_record in canis_lupus_familiaris_CD4_aa:
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
canine_cd4 = seq_record.seq

QCV56761.1
Seq('MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWK...SPI')
458
AAB24450.1
Seq('MNQGAAFRHLLLVLQLVMLEAAVPQGKEVVLGKAGGTAELPCQASQKKYMTFTW...NPI')
474
AHY35179.1
Seq('MNRGISFRHLLLVLQLALHPAVTQGKNVVLGKKGDTVELTCNAPSKKNIQFHWK...HRR')
426
CAB37664.1
Seq('LMLQLVMLPAVTPVREVVLGKAGDAVELPCQTSQKKNIHFNWRDSSMVQILGNQ...KRL')
432


In [10]:
# list of CD4 sequences
species_list = [human_cd4, chimp_cd4, feline_cd4, colobus_cd4, canine_cd4]
for species in species_list:
    print(species)
    print('|')

MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
|
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQTKILGNQGSFLTKGPSKLNDRVDSRRSLWDQGNFTLIIKNLKIEDSDTYICEVGDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAQRMSQIKRLLSEKKTCQCPHRFQKTCSPI
|
MNQGAAFRHLLLVLQLVMLEAAVPQGKEVVLGKAGGTAELPCQASQKKYMTFTWRLSSQVKILESQHSSLILTGSSKL

In [11]:
# pairwise sequence alignment
                    # this for loop makes all of the alignments compare humans to the last species in the list
# for seq in species_list:
#     alignments = pairwise2.align.globalxx(human_cd4, seq)
# for score in alignments:
#     print(format_alignment(*score))

In [12]:
# alternative to commented out code above
alignment0 = pairwise2.align.globalxx(human_cd4, human_cd4)
alignment1 = pairwise2.align.globalxx(human_cd4, chimp_cd4)
alignment2 = pairwise2.align.globalxx(human_cd4, feline_cd4)
alignment3 = pairwise2.align.globalxx(human_cd4, colobus_cd4)
alignment4 = pairwise2.align.globalxx(human_cd4, canine_cd4)

# formatting alignments and retrieving scores
# this shows different variations of alignments that achieve the highest possible score for each species
alignments = [alignment0, alignment1, alignment2, alignment3, alignment4]
for alignment in alignments:
    for score in alignment:
        print(format_alignment(*score))

MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS