The following retrieves FASTA-format sequences for a given set of accession numbers in GI format.

In [1]:
# import the package
from Bio import Entrez
# always give the NCBI your email address
Entrez.email = "bleseapringle@gradcenter.cuny.edu"
# Want to get the FASTA sequences for the homologs
# Already have homolog list; must use GI numbers as search term for the protein database per the Entrez eUtilities documentation
handle = Entrez.esearch(db = "protein", term = ("37039612 124486791 664720500 134085369 847087507 45387633 50417271 17533301 330803495"))

In [2]:
record = Entrez.read(handle)
handle.close()
# record is a dictionary, we can look at the keys
record.keys()

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'QueryTranslation'])

The Entrez.read parser breaks the retrieved XML data down into individual parts,
and transforms them into Python objects that can be accessed individually. Let’s see how
many sequences are available in the nucleotide database for our search term, and access the
record IDs (note that NCBI returns only 20 IDs by default to keep traffic on its server low;
if you need all IDs, call Entrez.esearch again and set retmax to the maximum number of
IDs, here, 126):

In [3]:
record["Count"]
print(record)
# retrieve list of genebank identifiers
id_list = record["IdList"]
print(id_list)

{'Count': '8', 'RetMax': '8', 'RetStart': '0', 'IdList': ['37039612', '124486791', '664720500', '847087507', '45387633', '50417271', '17533301', '330803495'], 'TranslationSet': [], 'QueryTranslation': ''}
['37039612', '124486791', '664720500', '847087507', '45387633', '50417271', '17533301', '330803495']


In [4]:
# data using Entrez.fetch. We retrieve the first ten sequences in fasta format 
# and save them to a file:
Entrez.email = "bleseapringle@gradcenter.cuny.edu"
handle = Entrez.efetch(db = "protein", rettype = "fasta", retmode = "text", id = "37039612 124486791 664720500 134085369 847087507 45387633 50417271 17533301 330803495")
# set up a handle to an output file
out_handle = open("c9orf72_homologs.fasta", "w")
# write obtained seq data to file
for line in handle:
    out_handle.write(line)
out_handle.close()
handle.close()
        

4.3.2 Input and output of sequence data using SeqIO
Next, we use the module SeqIO to manipulate our sequences and obtain more information
about our U. investigator results:

In [5]:
from Bio import SeqIO
handle = open("c9orf72_homologs.fasta", "r")
# print ID and seq length
for record in SeqIO.parse(handle, "fasta"):
    print(record.description)
    print(len(record))
handle.close()

NP_060795.1 guanine nucleotide exchange factor C9orf72 isoform a [Homo sapiens]
481
NP_001074812.1 guanine nucleotide exchange factor C9orf72 homolog isoform 1 [Mus musculus]
481
XP_008517285.1 PREDICTED: protein C9orf72 homolog isoform X1 [Equus przewalskii]
481
XP_012817590.1 guanine nucleotide exchange C9orf72 homolog isoform X1 [Xenopus tropicalis]
461
NP_991166.1 guanine nucleotide exchange C9orf72 homolog [Danio rerio]
462
AAH77130.1 Zgc:100846 protein [Danio rerio]
326
NP_495604.1 ALS/FTD Associated gene homolog [Caenorhabditis elegans]
731
XP_003289741.1 uncharacterized protein DICPUDRAFT_80518, partial [Dictyostelium purpureum]
295
