# Introduction to Biopython:
## Searching for and Retrieving Sequence Records from Entrez

The following code may be used to search the Entrez databases and retrieve individual sequences. Normally, if that's all I wanted to do, I would just search the [NCBI website]("https://www.ncbi.nlm.nih.gov/"), but for more advanced applications, using Python to retrieve records may be more efficient.

The code below is based heavily on the [Biopython Tutorial]("http://biopython.org/DIST/docs/tutorial/Tutorial.html"). Please see this document for further information.

### Import packages

I will be using the SeqIO and Entrez packages from Biopython. When accessing Entrez, you need to provide an email address.

In [14]:
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "bmarieg@gmail.com"

### Get info on all databases

You can use Entrez.einfo() to get information on all of the databases in Entrez.

Note: In my variable names, HTTPRO stands for "HTTP response object".

In [18]:
# Get a list of all the databases in Entrez

Entrez_info_HTTPRO = Entrez.einfo()
Entrez_info = Entrez.read(Entrez_info_HTTPRO)
Entrez_info_HTTPRO.close()
print(Entrez_info.keys())

dict_keys(['DbList'])


As you can see, the variable Entrez_info is a dictionary with a single key: DbList.

The value associated with this key is a list of the Entrez databases.

In [19]:
print(Entrez_info["DbList"])

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']


### Get info on a specific database

You can find out information on any of the databases above by using Entrez.info() and specifying the database from the list above:

In [21]:
# Get information on a specific database

protdb_info_HTTPRO = Entrez.einfo(db="protein")
protdb_info = Entrez.read(protdb_info_HTTPRO)
protdb_info_HTTPRO.close()
print(protdb_info.keys())

dict_keys(['DbInfo'])


I chose to access the protein database. As you can see, the variable protdb_info is a dictionary with a single key: "DbInfo". The value associated with this key is another dictionary with keys and values that describe the protein database.

In [22]:
print(protdb_info["DbInfo"].keys())

dict_keys(['DbName', 'MenuName', 'Description', 'DbBuild', 'Count', 'LastUpdate', 'FieldList', 'LinkList'])


You can access the values associated with any of these keys:

In [23]:
print(protdb_info["DbInfo"]["LastUpdate"])

2022/10/08 01:28


The FieldList key is associated with a great big list full of dictionaries; each dictionary describes a search field. Instead of printing the list, which is overwhelmingly large and hard to read, you can print a subset of the keys in each dictionary:

In [24]:
for field in protdb_info["DbInfo"]["FieldList"]:
    print("%(Name)s, %(FullName)s, %(Description)s" % field)

ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to each sequence
FILT, Filter, Limits the records
WORD, Text Word, Free text associated with record
TITL, Title, Words in definition line
KYWD, Keyword, Nonstandardized terms provided by submitter
AUTH, Author, Author(s) of publication
JOUR, Journal, Journal abbreviation of publication
VOL, Volume, Volume number of publication
ISS, Issue, Issue number of publication
PAGE, Page Number, Page number(s) of publication
ORGN, Organism, Scientific and common names of organism, and all higher levels of taxonomy
ACCN, Accession, Accession number of sequence
PACC, Primary Accession, Does not include retired secondary accessions
GENE, Gene Name, Name of gene associated with sequence
PROT, Protein Name, Name of protein associated with sequence
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
PDAT, Publication Date, Date sequence added to GenBank
MDAT, Modification Date, Date of last update
SUBS, S

### Search a specific Entrez database

I am going to search the protein database for human protein kinase R (PKR), which has the gene symbol EIFKAK2. (Protein kinase R is involved in the cell's response to viral infection, among other functions.) You can use the name of the search fields provided above to limit your search. You can also limit the number of results that are returned by assigning a value to retmax.

If you print the whole search result, you get a long-ish dictionary:

In [29]:
# Search a database
# Note: PKR = EIFKAK2

PKR_search_HTTPRO = Entrez.esearch(db="protein", term="human[Orgn] EIF2AK2[Gene]", retmax = "")
PKR_search = Entrez.read(PKR_search_HTTPRO)
PKR_search_HTTPRO.close()
print(PKR_search)

{'Count': '18', 'RetMax': '18', 'RetStart': '0', 'IdList': ['208431827', '125527', '208431829', '4506103', '2217329612', '2217329610', '767914900', '119620814', '119620813', '119620812', '119620811', '219520292', '116283356', '75517635', '62739920', '34784744', '33871554', '27371006'], 'TranslationSet': [{'From': 'human[Orgn]', 'To': '"Homo sapiens"[Organism]'}], 'TranslationStack': [{'Term': '"Homo sapiens"[Organism]', 'Field': 'Organism', 'Count': '1771931', 'Explode': 'Y'}, {'Term': 'EIF2AK2[Gene]', 'Field': 'Gene', 'Count': '1118', 'Explode': 'N'}, 'AND'], 'QueryTranslation': '"Homo sapiens"[Organism] AND EIF2AK2[Gene]'}


Alternately, you can print a list of the search keys ...

In [31]:
print(PKR_search.keys())

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation'])


... and then access the results by key.

In [33]:
print("Count: %s" %PKR_search["Count"])
print("IdList: %s" %PKR_search["IdList"])

Count: 18
IdList: ['208431827', '125527', '208431829', '4506103', '2217329612', '2217329610', '767914900', '119620814', '119620813', '119620812', '119620811', '219520292', '116283356', '75517635', '62739920', '34784744', '33871554', '27371006']


### Retrieve information on search results

I need more information before I can determine whether the IDs above match the protein I am looking for. You can retrieve a summary of each individual search result:

In [34]:
# Retrieve a summary of a search result

PKR_prot_summ_HTTPRO = Entrez.esummary(db="protein", id="208431827")
PKR_prot_summ = Entrez.read(PKR_prot_summ_HTTPRO)
PKR_prot_summ_HTTPRO.close()
print(PKR_prot_summ[0]["Title"])
print(PKR_prot_summ[0]["Status"])

interferon-induced, double-stranded RNA-activated protein kinase isoform a [Homo sapiens]
live


Alternately, you can retrieve information on every search result. This may help you select the canonical isoform of a particular protein, for example.

In [36]:
# Retrieve a summary of all of the search results

for ID in PKR_search["IdList"]:
    thissummary_HTTPRO = Entrez.esummary(db="protein", id=ID)    
    thissummary = Entrez.read(thissummary_HTTPRO)
    thissummary_HTTPRO.close()    
    print("ID: %s" %ID)
    print("Caption: %s" %thissummary[0]["Caption"])
    print("Summary: %s" %thissummary[0]["Title"])
    print("Length: %s" %int(thissummary[0]["Length"]))
    print("Status: %s \n" %thissummary[0]["Status"])

ID: 208431827
Caption: NP_001129123
Summary: interferon-induced, double-stranded RNA-activated protein kinase isoform a [Homo sapiens]
Length: 551
Status: live 

ID: 125527
Caption: P19525
Summary: RecName: Full=Interferon-induced, double-stranded RNA-activated protein kinase; AltName: Full=Eukaryotic translation initiation factor 2-alpha kinase 2; Short=eIF-2A protein kinase 2; AltName: Full=Interferon-inducible RNA-dependent protein kinase; AltName: Full=P1/eIF-2A protein kinase; AltName: Full=Protein kinase RNA-activated; Short=PKR; Short=Protein kinase R; AltName: Full=Tyrosine-protein kinase EIF2AK2; AltName: Full=p68 kinase
Length: 551
Status: live 

ID: 208431829
Caption: NP_001129124
Summary: interferon-induced, double-stranded RNA-activated protein kinase isoform b [Homo sapiens]
Length: 510
Status: live 

ID: 4506103
Caption: NP_002750
Summary: interferon-induced, double-stranded RNA-activated protein kinase isoform a [Homo sapiens]
Length: 551
Status: live 

ID: 2217329612
C

### Retrieve a protein sequence

Once you have identified which protein sequence you need, you can access it in fasta format:

In [38]:
# Fetch a fasta or genbank record from Entrez and turn it into a SeqRecord
# SeqIO.read() is for files containing a single sequence

PKR_prot_fa_HTTPRO = Entrez.efetch(db="protein", id="208431827", rettype="fasta", retmode="text")
PKR_prot_fa = SeqIO.read(PKR_prot_fa_HTTPRO, "fasta")
PKR_prot_fa_HTTPRO.close()
print(PKR_prot_fa)

ID: NP_001129123.1
Name: NP_001129123.1
Description: NP_001129123.1 interferon-induced, double-stranded RNA-activated protein kinase isoform a [Homo sapiens]
Number of features: 0
Seq('MAGDLSAGFFMEELNTYRQKQGVVLKYQELPNSGPPHDRRFTFQVIIDGREFPE...HTC')


Or in genbank format:

In [39]:
PKR_prot_gb_HTTPRO = Entrez.efetch(db="protein", id="208431827", rettype="gb", retmode="text")
PKR_prot_gb = SeqIO.read(PKR_prot_gb_HTTPRO, "genbank")
PKR_prot_gb_HTTPRO.close()
print(PKR_prot_gb)

ID: NP_001129123.1
Name: NP_001129123
Description: interferon-induced, double-stranded RNA-activated protein kinase isoform a [Homo sapiens]
Number of features: 32
/topology=linear
/data_file_division=PRI
/date=16-SEP-2022
/accessions=['NP_001129123']
/sequence_version=1
/db_source=REFSEQ: accession NM_001135651.3
/keywords=['RefSeq', 'MANE Select']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='YTHDF3 Facilitates eIF2AK2 and eIF3A Recruitment on mRNAs to Regulate Translational Processes in Oxaliplatin-Resistant Colorectal Cancer', ...), Reference(title='Protein kinase R is an innate immune sensor of proteotoxic stress via accumulation of cytoplasmic IL-24', ...), Reference(title='Protein kinase RNA-activated controls mitotic progression and determines paclita

The fasta and genbank files for this protein are now stored as SeqRecord objects. If you don't want to see the whole file, you can access elements individually:

In [41]:
print("Fasta sequence: %s \n" %PKR_prot_fa.seq)
print("Genbank sequence: %s" %PKR_prot_gb.seq)

Fasta sequence: MAGDLSAGFFMEELNTYRQKQGVVLKYQELPNSGPPHDRRFTFQVIIDGREFPEGEGRSKKEAKNAAAKLAVEILNKEKKAVSPLLLTTTNSSEGLSMGNYIGLINRIAQKKRLTVNYEQCASGVHGPEGFHYKCKMGQKEYSIGTGSTKQEAKQLAAKLAYLQILSEETSVKSDYLSSGSFATTCESQSNSLVTSTLASESSSEGDFSADTSEINSNSDSLNSSSLLMNGLRNNQRKAKRSLAPRFDLPDMKETKYTVDKRFGMDFKEIELIGSGGFGQVFKAKHRIDGKTYVIKRVKYNNEKAEREVKALAKLDHVNIVHYNGCWDGFDYDPETSDDSLESSDYDPENSKNSSRSKTKCLFIQMEFCDKGTLEQWIEKRRGEKLDKVLALELFEQITKGVDYIHSKKLIHRDLKPSNIFLVDTKQVKIGDFGLVTSLKNDGKRTRSKGTLRYMSPEQISSQDYGKEVDLYALGLILAELLHVCDTAFETSKFFTDLRDGIISDIFDKKEKTLLQKLLSKKPEDRPNTSEILRTLTVWKKSPEKNERHTC 

Genbank sequence: MAGDLSAGFFMEELNTYRQKQGVVLKYQELPNSGPPHDRRFTFQVIIDGREFPEGEGRSKKEAKNAAAKLAVEILNKEKKAVSPLLLTTTNSSEGLSMGNYIGLINRIAQKKRLTVNYEQCASGVHGPEGFHYKCKMGQKEYSIGTGSTKQEAKQLAAKLAYLQILSEETSVKSDYLSSGSFATTCESQSNSLVTSTLASESSSEGDFSADTSEINSNSDSLNSSSLLMNGLRNNQRKAKRSLAPRFDLPDMKETKYTVDKRFGMDFKEIELIGSGGFGQVFKAKHRIDGKTYVIKRVKYNNEKAEREVKALAKLDHVNIVHYNGCWDGFDYDPETSDDSLESSDYDPENSKNSSRSKTKCLFIQMEFCDKGTLEQWIEKRRGEKLDKVLALELFEQITKGVDYIHSKKLIH

As you might expect, both files contain the same sequence:

In [44]:
PKR_prot_fa.seq == PKR_prot_gb.seq

True

### All the code at once