# Part 2 - Accessing & Working with DNA, RNA & Protein Sequences

In this notebook we will start working with biological sequences by retreiving records, looking at their structure and the information that is associated with them. We will also start manipulating the sequences and performing some basic analysis to become more familiar with the sorts of operations and processes we can perform.

We have included web links were appropriate to additional information and web based resrouces that can be used to either replace or complement working in the Python environment. It is absolutely fine to use web based tools to perform Bioinformatic work, but those tools are often limited in their functionality in ways that eventually become problematic in real-life anaysis situations. This is why, if you would like to pursue further study and/or research in Bioinformatics and related disciplines it is a good plan to begin learning the two core programming languages that are in common use, namely [Python](https://www.learnpython.org) and the Statistical programming language [R](https://cran.r-project.org).

In [2]:
# install and/or load BioPython
%pip install biopython

# replace this with your e-mail address
EMAIL = 's2614533@ed.ac.uk'

Note: you may need to restart the kernel to use updated packages.


First we load the Entrez module from BioPython.

You can read the description of this module [here](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html)

In [None]:
from Bio import Entrez

Entrez.email = EMAIL

# note the egquery function provides Entrez database counts from a global search.
handle = Entrez.egquery(term="Cypripedioideae")
record = Entrez.read(handle)
handle.close()

print(type(record))

# Look at what is inside the record object
print(record.keys())

# The first contains the search term
print(record['Term'])

# The second contains a list of results from different Entrez Databases
for row in record['eGQueryResult']:
    print(row)

# we can iterate through the record and only return the 'nucleotide' result
for row in record["eGQueryResult"]:
    if row["DbName"]=="nuccore":
        print('***',row)
        # print just how many nucleotide entries there are
        print(row["Count"])

Note the number of nucleotide sequences returned and compare it to the result you get if you seach for "Cypripedioideae" using the [Entrez Search Webpage](https://www.ncbi.nlm.nih.gov/search/). For interest, these are a sub-family of Orchid (one member is the [Lady's Slipper Orchid](https://en.wikipedia.org/wiki/Cypripedium_calceolus))

Lets now select a particular sequence and download it for further analysis.

In [4]:
from Bio import Entrez

Entrez.email = EMAIL

# we're going to search for up to 1000 sequences and we're going to ask for the accession number for each

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Cypripedioideae",retmax=1000,idtype='acc')
record = Entrez.read(handle)
handle.close()

print(record.keys())
#look at the first 10 ids
print(record['IdList'][:10])

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation'])
['OQ672773.1', 'OR266943.1', 'OQ555605.1', 'OQ555604.1', 'OQ981989.1', 'NC_063680.1', 'NC_064145.1', 'NC_063681.1', 'NC_066405.1', 'OP465225.1']


In [5]:
#lets fetch one
accession = record['IdList'][500]

handle = Entrez.efetch(db="nucleotide", id=accession, retmode="xml")
entry = Entrez.read(handle)
handle.close()

#print the whole entry (this is a GenBank record in XML format)
print(entry)

[{'GBSeq_locus': 'ICTD01000465', 'GBSeq_length': '330', 'GBSeq_strandedness': 'single', 'GBSeq_moltype': 'mRNA', 'GBSeq_topology': 'linear', 'GBSeq_division': 'TSA', 'GBSeq_update-date': '23-MAR-2023', 'GBSeq_create-date': '23-MAR-2023', 'GBSeq_definition': 'TSA: Cypripedium macranthos var. rebunense mRNA, evgLocus_103335.p2, mRNA sequence', 'GBSeq_primary-accession': 'ICTD01000465', 'GBSeq_accession-version': 'ICTD01000465.1', 'GBSeq_other-seqids': ['dbj|ICTD01000465.1|', 'gnl|TSA:ICTD01|evgLocus_103335.p2', 'gi|2472521496'], 'GBSeq_project': 'PRJDB15443', 'GBSeq_keywords': ['TSA', 'Transcriptome Shotgun Assembly'], 'GBSeq_source': 'Cypripedium macranthos var. rebunense', 'GBSeq_organism': 'Cypripedium macranthos var. rebunense', 'GBSeq_taxonomy': 'Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliopsida; Liliopsida; Asparagales; Orchidaceae; Cypripedioideae; Cypripedium', 'GBSeq_references': [{'GBReference_reference': '1', 'GBReference_position'

In [6]:
print(entry[0]['GBSeq_definition'])
print(entry[0]['GBSeq_organism'])

TSA: Cypripedium macranthos var. rebunense mRNA, evgLocus_103335.p2, mRNA sequence
Cypripedium macranthos var. rebunense


We can retreive the record in a more user-friendly format

In [7]:
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
print(handle.read())

LOCUS       ICTD01000465             330 bp    mRNA    linear   TSA 23-MAR-2023
DEFINITION  TSA: Cypripedium macranthos var. rebunense mRNA,
            evgLocus_103335.p2, mRNA sequence.
ACCESSION   ICTD01000465
VERSION     ICTD01000465.1
DBLINK      BioProject: PRJDB15443
            BioSample: SAMD00586444, SAMD00586445, SAMD00586446
            Sequence Read Archive: DRR451067, DRR451068, DRR451069
KEYWORDS    TSA; Transcriptome Shotgun Assembly.
SOURCE      Cypripedium macranthos var. rebunense
  ORGANISM  Cypripedium macranthos var. rebunense
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; Liliopsida; Asparagales; Orchidaceae;
            Cypripedioideae; Cypripedium.
REFERENCE   1
  AUTHORS   Kambara,K., Shimura,H. and Fujino,K.
  TITLE     Construction of a de novo assembly pipeline using multiple
            transcriptome data sets from Cypripedium macranthos (Orchidaceae)
  JOURNAL   Unpublished
REFERENC

We can use the Bio.SeqIO module which handles groups of records to capture the search and create a Bio.Seq.Seq sequence object

In [9]:
from Bio import SeqIO
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for entry in records:
    sequence = entry.seq
    print(sequence)
    print(type(sequence))
    
print('complement',sequence.complement())
print('reverse_complement',sequence.reverse_complement())

TGCTTGTTCGCTCCAGGCTGGTCCTTAACCCACACTATTGTCACAAATCTGTCTATCAAGTTGTCCTGGACTACGGTCGCAAAAATTAACCACCAGCCTTGGGCGAGGGTCCATATCCTCCGTAATGAGCTTGTAATGGTGGCCGGCTTGCAGCCATTGGAGAATAATAACGACTCGTTTGCTGTCTGCGAACAAACAAACGACGCATACGCCATGGCCGCTGCCGCCAATGAGATAGCCAACAGACCAGTGCCACCCCCCGGTATTCTACCCTTTATTACCCCCAGACATCGACACAACCATGGCTGGCACTGGTGGGCATGGCCGCTA
<class 'Bio.Seq.Seq'>
complement ACGAACAAGCGAGGTCCGACCAGGAATTGGGTGTGATAACAGTGTTTAGACAGATAGTTCAACAGGACCTGATGCCAGCGTTTTTAATTGGTGGTCGGAACCCGCTCCCAGGTATAGGAGGCATTACTCGAACATTACCACCGGCCGAACGTCGGTAACCTCTTATTATTGCTGAGCAAACGACAGACGCTTGTTTGTTTGCTGCGTATGCGGTACCGGCGACGGCGGTTACTCTATCGGTTGTCTGGTCACGGTGGGGGGCCATAAGATGGGAAATAATGGGGGTCTGTAGCTGTGTTGGTACCGACCGTGACCACCCGTACCGGCGAT
reverse_complement TAGCGGCCATGCCCACCAGTGCCAGCCATGGTTGTGTCGATGTCTGGGGGTAATAAAGGGTAGAATACCGGGGGGTGGCACTGGTCTGTTGGCTATCTCATTGGCGGCAGCGGCCATGGCGTATGCGTCGTTTGTTTGTTCGCAGACAGCAAACGAGTCGTTATTATTCTCCAATGGCTGCAAGCCGGCCACCATTACAAGCTCATTACGGAGGATATGGACCCTCGCCCAAGGCTGGTGGTTAATTTTTGCGACCGTAGTCCAGGACAACTTGATAGACAGATTT

The real power of this system comes when you want to search and work with a lot of sequences.

Lets say we want to search for Gene entries for Pax6

In [14]:
#search for

from Bio import Entrez

Entrez.email = EMAIL

# we're going to limit this to 100 sequences and we're going to ask for the accession number for each

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Pax6[Gene]",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

['2589542570', '2589542568', '568815587', '2589241291', '2589241289', '2589241287', '2589241285', '2582873531', '2194973393', '2310183265']


In [10]:
# now lets fetch them all, to do this we extract the accession id list

gi_list = record['IdList']
print(gi_list)

#then turn it into a comma-separated string

gi_str = ",".join(gi_list)

handle = Entrez.efetch(db="nucleotide", id=gi_str, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for record in records:
    print("%s, length %i, from organism %s" % (record.name, len(record), record.description))

['OQ672773.1', 'OR266943.1', 'OQ555605.1', 'OQ555604.1', 'OQ981989.1', 'NC_063680.1', 'NC_064145.1', 'NC_063681.1', 'NC_066405.1', 'OP465225.1', 'OP465224.1', 'OP465223.1', 'OP465222.1', 'OP465221.1', 'OP465220.1', 'OP465219.1', 'OP465218.1', 'OP465217.1', 'OP465216.1', 'OP465215.1', 'OP465214.1', 'OP465213.1', 'OP465212.1', 'OP465211.1', 'OP465210.1', 'OP465209.1', 'OP465208.1', 'OP465207.1', 'OP465206.1', 'OP465205.1', 'OP465204.1', 'OP465203.1', 'OP465202.1', 'OP465201.1', 'OP465200.1', 'ICTD00000000.1', 'ICTD01000001.1', 'ICTD01000002.1', 'ICTD01000003.1', 'ICTD01000004.1', 'ICTD01000005.1', 'ICTD01000006.1', 'ICTD01000007.1', 'ICTD01000008.1', 'ICTD01000009.1', 'ICTD01000010.1', 'ICTD01000011.1', 'ICTD01000012.1', 'ICTD01000013.1', 'ICTD01000014.1', 'ICTD01000015.1', 'ICTD01000016.1', 'ICTD01000017.1', 'ICTD01000018.1', 'ICTD01000019.1', 'ICTD01000020.1', 'ICTD01000021.1', 'ICTD01000022.1', 'ICTD01000023.1', 'ICTD01000024.1', 'ICTD01000025.1', 'ICTD01000026.1', 'ICTD01000027.1', '

Now we're going to pull a full gene entry for human Pax6 from Genbank and look at it, we can also do this online by clicking [here](https://www.ncbi.nlm.nih.gov/nuccore/208879460).

In [17]:
from Bio import Entrez

Entrez.email = EMAIL
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
gb_entry = handle.read()
handle.close()

#NB this is just a straight string at this point (as we just read() it straight into a string object)
print(gb_entry)

LOCUS       NG_008679              40170 bp    DNA     linear   PRI 14-AUG-2023
DEFINITION  Homo sapiens paired box 6 (PAX6), RefSeqGene (LRG_720) on
            chromosome 11.
ACCESSION   NG_008679
VERSION     NG_008679.1
KEYWORDS    RefSeq; RefSeqGene.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 40170)
  AUTHORS   Zhang Y, Yamada Y, Fan M, Bangaru SD, Lin B and Yang J.
  TITLE     The beta subunit of voltage-gated Ca2+ channels interacts with and
            regulates the activity of a novel isoform of Pax6
  JOURNAL   J Biol Chem 285 (4), 2527-2536 (2010)
   PUBMED   19917615
REFERENCE   2  (bases 1 to 40170)
  AUTHORS   Osumi N, Shinohara H, Numayama-Tsuruta K and Maekawa M.
  TITLE     Concise review: Pax6 transcription factor contributes to both
     

Now we're going to extract the coding sequence from this entry and translate it into protein

In [31]:
from Bio import SeqIO
from Bio import Entrez

Entrez.email = EMAIL
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")


if record.features:
    for feature in record.features:
        #this tag identifies the CoDingSequences from the record
        if feature.type == "CDS":
            print(feature.qualifiers["protein_id"])
            print(feature.location,'\n')
            current_sequence = feature.location.extract(record).seq
            print('Nucleotide Sequence')
            print(current_sequence,'\n')
            #translate the current sequence into protein
            print('Protein Sequence')
            print(current_sequence.translate(),'\n')


<class 'Bio.SeqRecord.SeqRecord'>
['NP_000271.1']
join{[16550:16560](+), [20127:20258](+), [21185:21401](+), [22105:22271](+), [28173:28332](+), [28847:28930](+), [29159:29310](+), [29408:29524](+), [32101:32252](+), [32942:33028](+)} 

Nucleotide Sequence
ATGCAGAACAGTCACAGCGGAGTGAATCAGCTCGGTGGTGTCTTTGTCAACGGGCGGCCACTGCCGGACTCCACCCGGCAGAAGATTGTAGAGCTAGCTCACAGCGGGGCCCGGCCGTGCGACATTTCCCGAATTCTGCAGGTGTCCAACGGATGTGTGAGTAAAATTCTGGGCAGGTATTACGAGACTGGCTCCATCAGACCCAGGGCAATCGGTGGTAGTAAACCGAGAGTAGCGACTCCAGAAGTTGTAAGCAAAATAGCCCAGTATAAGCGGGAGTGCCCGTCCATCTTTGCTTGGGAAATCCGAGACAGATTACTGTCCGAGGGGGTCTGTACCAACGATAACATACCAAGCGTGTCATCAATAAACAGAGTTCTTCGCAACCTGGCTAGCGAAAAGCAACAGATGGGCGCAGACGGCATGTATGATAAACTAAGGATGTTGAACGGGCAGACCGGAAGCTGGGGCACCCGCCCTGGTTGGTATCCGGGGACTTCGGTGCCAGGGCAACCTACGCAAGATGGCTGCCAGCAACAGGAAGGAGGGGGAGAGAATACCAACTCCATCAGTTCCAACGGAGAAGATTCAGATGAGGCTCAAATGCGACTTCAGCTGAAGCGGAAGCTGCAAAGAAATAGAACATCCTTTACCCAAGAGCAAATTGAGGCCCTGGAGAAAGAGTTTGAGAGAACCCATTATCCAGATGTGTTTGCCCGAGAAAGACTAGCAGCCAAAATAGA



In [20]:
from Bio import Entrez

Entrez.email = EMAIL

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='gene',term="Nrg1[Gene] AND human",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

# lets retrieve as XML format and use the Entrez parser to read it
handle = Entrez.efetch(db="gene", id=record['IdList'][:1], retmode="xml")
# this returns an array of records which are in Python dict format
records = Entrez.read(handle)
handle.close()

# look at the first record by iterating through the keys of the dict
# NB there's a lot of information in here
for feature in list(records[0]):
    print(feature,':',records[0][feature])

['132512549', '132481126', '3084', '211323', '112400', '796461', '373906', '281361', '696275', '106040455']
Entrezgene_track-info : {'Gene-track': {'Gene-track_geneid': '132512549', 'Gene-track_status': StringElement('0', attributes={'value': 'live'}), 'Gene-track_create-date': {'Date': {'Date_std': {'Date-std': {'Date-std_year': '2023', 'Date-std_month': '10', 'Date-std_day': '9'}}}}, 'Gene-track_update-date': {'Date': {'Date_std': {'Date-std': {'Date-std_year': '2023', 'Date-std_month': '10', 'Date-std_day': '10'}}}}}}
Entrezgene_type : 6
Entrezgene_source : {'BioSource': {'BioSource_genome': StringElement('1', attributes={'value': 'genomic'}), 'BioSource_origin': StringElement('1', attributes={'value': 'natural'}), 'BioSource_org': {'Org-ref': {'Org-ref_taxname': 'Lagenorhynchus albirostris', 'Org-ref_common': 'white-beaked dolphin', 'Org-ref_db': [{'Dbtag_db': 'taxon', 'Dbtag_tag': {'Object-id': {'Object-id_id': '27610'}}}], 'Org-ref_orgname': {'OrgName': {'OrgName_name': {'OrgName

### Challenge 1 - Finding Genes with NCBI-Entrez
Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive entries for a gene called Nrg1.
- How many different gene entries are there for this gene in NCBI databases?
- What is the full name of this gene?
- What kind of protein does this gene encode?

In [33]:
handle = Entrez.egquery(term="Nrg1 AND alive[prop]")
record = Entrez.read(handle)
handle.close()

count = 0
for row in record['eGQueryResult']:
    if row['DbName'] == 'gene':
        print(f"There are {row['Count']} entries for the gene Nrg1 in NCBI gene database.")


# we know the accession ID of this gene, then
'''
accession_id = ''

handle = Entrez.efetch(db="nucleotide", id=accession_id, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")

print("Name: ",record.description)

if record.features:
    for feature in record.features:
        if feature.type == "CDS":
            print("Protein this gene encodes: ",feature.qualifiers["protein_id"])
'''




There are 1211 entries for the gene Nrg1 in NCBI gene database.


'\naccession_id = \'\'\n\nhandle = Entrez.efetch(db="nucleotide", id=accession_id, rettype="gb", retmode="text")\nrecord = SeqIO.read(handle, "genbank")\n\nprint("Name: ",record.description)\n\nif record.features:\n    for feature in record.features:\n        if feature.type == "CDS":\n            print("Protein this gene encodes: ",feature.qualifiers["protein_id"])\n'

### Challenge 2 - Human and Mouse Nrg1 Genes
Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive full-length human and mouse (RefSeq) gene entries for Nrg1.
- What are the accession numbers / ids of the Genbank records?
- How long are the Human and Mouse NRG1, Nrg1 proteins?
- How many nucleotide sequence differences are there between their longest CDs?
- How many protein sequence differences are there between their longest proteins?