### Working with GenBank data
There's a very useful module that makes dealing with common bioinformatic sequence and other formats relatively easy. Even if it doesn't look that simple, at least it has done a lot of work that would have been needed if you implemented this from the ground up.

This is called the **BioPython** module.

You use it via the module name **Bio** that you import into your Python session or file.

NCBI changed their requirements for accessing them via the Internet to require https instead of just http. Actually, I believe, this was a change that all federal government web sites were encouraged to make.

Since the NCBI website locations are embedded in the BioPython code you need to have ***Version 1.68*** or greater. The latest version from July 2017 is ***1.70*** and it handles MAF and NEXUS formats better than earlier versions.
#### Installing the BioPython module onto your computer
You can use **pip** to install this module via the bash terminal command:

**pip install biopython**

After it is installed check its version.

In [6]:
import Bio
print "biopython version number is:", Bio.__version__

biopython version number is: 1.70


If it is version 1.68 or greater, we should be good to go.

The first thing we are going to do is define a Python function that prints summary info based on a query.

In [7]:
#!/usr/bin/env python

# usage: esummary.py "quoted_query_string"
#
# return the summary records that correspond to the query
# number of matching records written to stderr
# first line is column names of Accn TaxID Length and Title
# rest of lines are summaries with these fields for the records
# fields separated by | since that is not a likely character in Title

import sys
from Bio import Entrez
Entrez.email = "ccg@calacademy.org"

def inJupyter(): # see if we are running in an iPython or Jupyter notebook
    try:
        get_ipython
    except:
        return False
    else:
        return True

def eSummary(qryStr):
    global rec # only so we can inspect value (eg keys) later in a Jupyter notebook
    handle = Entrez.esearch(db="nucleotide", term=qryStr, retmax=10000)
    record = Entrez.read(handle)
    numIDs = int(record["Count"])
    sys.stderr.write(str(numIDs) + " results for \""+qryStr+"\"\n")
    if numIDs < 1:
        sys.exit(0)

    if inNoteBook and numIDs > maxToShow: # while testing limit to maxToShow
        numIDs = maxToShow

    id_list = record["IdList"]
    id_list.sort()

    print "Accn | TaxonID | Length | Title"
    startIx = 0; sliceSize = min(numIDs,200) # can't send too large a request at a time, so cut it up into this size slices
    while numIDs > 0:
        ids = ",".join( id_list[startIx : startIx+sliceSize] )

        handle = Entrez.esummary(db="nucleotide", id=ids)
        records = Entrez.read(handle)

        for rec in records:
            print rec['Caption'], '|', rec['TaxId'], '|', rec['Length'], '|', rec['Title']

        numIDs  -= sliceSize
        startIx += sliceSize
        
def usage():
    sys.stderr.write("Usage: esummary.py \"query_str\"\n   Ex: esummary.py \"Strix[Title] AND mitochondrial[Title]\"\n");
    sys.exit(0)

# set vars that allow us to run in Jupyter notebook or from command line file
inNoteBook = inJupyter()
if inNoteBook:
    maxToShow = 25
    qryStr = "Strix[Title] AND mitochondrion[Title]"
    #qryStr = "1246431282 1246431296 523582230"
    #qryStr = "KC953095 MF431746 MF431745 KU365899 MF001440"
else:
    if len(sys.argv) < 2:
        usage()
    qryStr = sys.argv[1] # ex: "Strix[Title] AND mitochondrion[Title]"

# get our summaries with the defined query string
eSummary(qryStr)


3 results for "Strix[Title] AND mitochondrion[Title]"


Accn | TaxonID | Length | Title
MF431745 | 1053996 | 18975 | Strix varia varia voucher CNHM<USA-OH>:ORNITH:B41533 mitochondrion, complete genome
MF431746 | 311401 | 19889 | Strix occidentalis caurina voucher CAS:ORN:98821 mitochondrion, complete genome
KC953095 | 126835 | 16307 | Strix leptogrammica mitochondrion, complete genome


In [7]:
rec

{'Status': 'live', 'Comment': '  ', 'Caption': 'KC953095', 'AccessionVersion': 'KC953095.1', 'Title': 'Strix leptogrammica mitochondrion, complete genome', 'CreateDate': '2013/07/20', 'Extra': 'gi|523582230|gb|KC953095.1|[523582230]', 'TaxId': 126835, 'ReplacedBy': '', u'Item': [], 'Length': 16307, 'Flags': 256, 'UpdateDate': '2014/09/11', u'Id': '523582230', 'Gi': 523582230}

In [35]:
eSummary("KC953095 KX773380 MF431746")

3 results for "KC953095 KX773380 MF431746"


Accn | TaxonID | Length | Title
KX773380 | 109736 | 575 | Centropyge vrolikii isolate cvr45 cytochrome b (cytb) gene, partial cds; mitochondrial
MF431746 | 311401 | 19889 | Strix occidentalis caurina voucher CAS:ORN:98821 mitochondrion, complete genome
KC953095 | 126835 | 16307 | Strix leptogrammica mitochondrion, complete genome


In [38]:
eSummary("Centropyge flavissimi and rRNA")

191 results for "Centropyge flavissimi and rRNA"


Accn | TaxonID | Length | Title
KU365899 | 1474813 | 16432 | Centropyge interrupta mitochondrion, complete genome
KY129718 | 586785 | 356 | Centropyge bispinosa isolate 1 small subunit ribosomal RNA gene, partial sequence
KU244249 | 586790 | 6964 | Centropyge multicolor voucher 278 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
KU244251 | 1489734 | 6803 | Centropyge woodheadi voucher 333 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
KX499480 | 1445868 | 16843 | Centropyge deborae voucher 373 mitochondrion, complete genome
KU356780 | 466111 | 6810 | Centropyge aurantia voucher 352 18S ribosomal RNA gene, internal transcribed spacer 1, 5.8S ribosomal RNA gene, internal transcribed spacer 2, a

In [9]:
#!/usr/bin/env python

# usage: efetch.py IDs [ [-gb] | [-f <feature_name>|all [-q <qualifier_name>|all]] [-s max_seqlen] ]
#
# return the summary records that correspond to the Accession IDs 
# number of matching records written to stderr
# first line is column names of gID TaxID Length and Title
# rest of lines are summaries with these fields for the records
# fields separated by | since that is not a likely character in Title

import sys
from Bio import Entrez
Entrez.email = "ccg@calacademy.org"

from IPython.core.debugger import set_trace

def inJupyter(): # see if we are running in an iPython or Jupyter notebook
    try:
        get_ipython
    except:
        return False
    else:
        return True

def get_feature_qualifier_list(rec, feature_name, qualifier_name):
    result = []
    if not 'GBSeq_feature-table' in rec or feature_name == "":
        return result
    
    # build list of all qualifier values for feature_name[qualifier_name]
    # special case of feature_name of "all" for ALL feature_names
    # also special case for empty qualifier_name puts name of feature and name of qualifier in the list
    #   and if qualifier_name is "all" then also include value with feature and qualifier name
    special_case_qual = (qualifier_name == "" or qualifier_name.lower() == "all")
    
    for feat in rec['GBSeq_feature-table']:
        if (feature_name.lower() == "all" or feat['GBFeature_key'] == feature_name) and 'GBFeature_quals' in feat:
            for qual in feat['GBFeature_quals']:
                if qual['GBQualifier_name'] == qualifier_name:
                    result.append( qual['GBQualifier_value'] )
                elif special_case_qual: # special case: put feature name and qualifier name and maybe val in list
                    inf = feat['GBFeature_key'] + " \t" + qual['GBQualifier_name']
                    if qualifier_name.lower() == "all":
                        inf += " \t" + qual['GBQualifier_value']
                    result.append(inf)
    return result
    
def efetch(IDs, mode="xml", seq_end=-1, feat_name="", qual_name=""): #IDs is a string with comma separated values, each an ID
    global rec # only so we can inspect value (eg keys) later in a Jupyter notebook
    
    showing_feature = (feat_name != "")
    if mode == "xml" and feat_name == "":
        print "Accn | Length | UpdateDate | Title | Taxonomy"

    # if use a short seq_stop this speeds up retrieval quite a bit. but you won't get actual length values
    if seq_end <= 0:
        handle = Entrez.efetch(db="nucleotide", id=IDs, rettype="gb", retmode=mode)
    else:
        handle = Entrez.efetch(db="nucleotide", id=IDs, rettype="gb", retmode=mode, strand=1,seq_start=1,seq_stop=seq_end)

    if mode == "text":
        print(handle.read())
    else:
        records = Entrez.read(handle)
        for rec in records:
            # set_trace() #enter ipdb debugger if we want to look at records interactively in Jupyter notebook
            if not showing_feature:
                print rec['GBSeq_primary-accession'], '|', rec['GBSeq_length'], '|', rec['GBSeq_update-date'], '|', \
                      rec['GBSeq_definition'],        '|', rec['GBSeq_taxonomy']
            else:
                features = get_feature_qualifier_list(rec, feat_name, qual_name)
                for qual in features:
                    print rec['GBSeq_primary-accession'] + " \t" + qual
                    
def usage():
    sys.stderr.write('''
    Usage: efetch.py <ID1> [<ID2> ...] [ [-gb] | [-f <feature_name>|all [-q <qualifier_name>|all]] [-s max_seqlen] ]
    
    Examples:
        efetch.py KX773372 KX773303     # show Accn Length UpdateDate Title Taxonomy for each ID
        efetch.py KX773372 -gb          # show full textual GenBank output format
        efetch.py KX773372 -f source    # show name of a specific feature and each of its qualifier names
        efetch.py KX773372 -f source -q organelle # show specific feature and a specific qualifier value
        efetch.py MF431746 -f source -q all       # show specific feature and each of its qualifier names and value
        efetch.py MF431746 -f all                 # show names of each feature and each of its qualifier names
        efetch.py KX773372 -s 200       # limit the size of the sequence retrieved. this also limits which features are retrieved.

''');
    sys.exit(0)

def get_options(arg_list, min_args=2):
    class options:
        argv           = arg_list
        id_list        = []
        mode           = "xml"
        sequence_max   = -1
        feature_name   = ""
        qualifier_name = ""
        def number_of_ids(): # this is how a class method is defined
            return len(id_list)

    num_args = len(options.argv)
    if num_args < min_args:
        usage()

    lst = options.id_list
    ixArg = 0; ixLast = num_args-1;
    while ixArg < ixLast:
        ixArg += 1
        idStr = arg_list[ixArg]
        if idStr[0] != "-":
            lst.append(idStr)
        elif idStr == "-gb" or idStr == "-text":
            options.mode = "text"
        else: # get options that have a value after the option flag
            optionType = idStr[1].lower()
            if optionType in "sfq" and ixArg < ixLast: # -s, -f or -q has a value after
                ixArg += 1
                optVal = arg_list[ixArg]
                if optionType == "f":
                    options.feature_name = optVal
                elif optionType == "q":
                    options.qualifier_name = optVal
                elif optionType == "s":
                    options.sequence_max = optVal
    return options

# main code entrypoint
if inJupyter():
    efetch("KC953095,KX773380,MF431746")
else:
    ops = get_options(sys.argv)
    ids = ",".join( ops.id_list )
    
    # call efetch with comma delimited set of IDs and optionally seq max, feature and qualifier name
    efetch(ids, ops.mode, seq_end=ops.sequence_max, feat_name=ops.feature_name, qual_name=ops.qualifier_name)

Accn | Length | UpdateDate | Title | Taxonomy
KC953095 | 16307 | 11-SEP-2014 | Strix leptogrammica mitochondrion, complete genome | Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves; Neognathae; Strigiformes; Strigidae; Strix
KX773380 | 575 | 04-OCT-2016 | Centropyge vrolikii isolate cvr45 cytochrome b (cytb) gene, partial cds; mitochondrial | Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Neoteleostei; Acanthomorphata; Eupercaria; Pomacanthidae; Centropyge
MF431746 | 19889 | 01-OCT-2017 | Strix occidentalis caurina voucher CAS:ORN:98821 mitochondrion, complete genome | Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves; Neognathae; Strigiformes; Strigidae; Strix


In [68]:
efetch("KC953095,KX773380,MF431746", feat_name="source", qual_name="country")

KC953095 	China: Anhui, South Anhui National Wild Animal Rescue Centre
KX773380 	Palau
MF431746 	USA: 260 Piedmont Road, Larkspur, Marin County, California


In [69]:
efetch("KC953095,KX773380,MF431746", feat_name="source", qual_name="all")

KC953095 	source 	organism 	Strix leptogrammica
KC953095 	source 	organelle 	mitochondrion
KC953095 	source 	mol_type 	genomic DNA
KC953095 	source 	db_xref 	taxon:126835
KC953095 	source 	tissue_type 	feather
KC953095 	source 	country 	China: Anhui, South Anhui National Wild Animal Rescue Centre
KC953095 	source 	collection_date 	Jun-2012
KX773380 	source 	organism 	Centropyge vrolikii
KX773380 	source 	organelle 	mitochondrion
KX773380 	source 	mol_type 	genomic DNA
KX773380 	source 	isolate 	cvr45
KX773380 	source 	db_xref 	taxon:109736
KX773380 	source 	country 	Palau
MF431746 	source 	organism 	Strix occidentalis caurina
MF431746 	source 	organelle 	mitochondrion
MF431746 	source 	mol_type 	genomic DNA
MF431746 	source 	sub_species 	caurina
MF431746 	source 	specimen_voucher 	CAS:ORN:98821
MF431746 	source 	db_xref 	taxon:311401
MF431746 	source 	sex 	female
MF431746 	source 	tissue_type 	blood
MF431746 	source 	dev_stage 	adult
MF431746 	source 	country 	USA: 260 Piedmont Road, L

In [10]:
efetch("KC953095", feat_name="tRNA", qual_name="product")

KC953095 	tRNA-Phe
KC953095 	tRNA-Val
KC953095 	tRNA-Leu
KC953095 	tRNA-Ile
KC953095 	tRNA-Gln
KC953095 	tRNA-Met
KC953095 	tRNA-Trp
KC953095 	tRNA-Ala
KC953095 	tRNA-Asn
KC953095 	tRNA-Cys
KC953095 	tRNA-Tyr
KC953095 	tRNA-Ser
KC953095 	tRNA-Asp
KC953095 	tRNA-Lys
KC953095 	tRNA-Gly
KC953095 	tRNA-Arg
KC953095 	tRNA-His
KC953095 	tRNA-Ser
KC953095 	tRNA-Leu
KC953095 	tRNA-Thr
KC953095 	tRNA-Pro
KC953095 	tRNA-Glu


In [11]:
efetch("KC953095", feat_name="tRNA", qual_name="all")

KC953095 	tRNA 	product 	tRNA-Phe
KC953095 	tRNA 	transcription 	GTCCCTGTAGCTTACAAATAAAGCATGGCACTGAAGATGCCAAGATGGTGGGCCAATCCCAGTACA
KC953095 	tRNA 	product 	tRNA-Val
KC953095 	tRNA 	transcription 	CAGGACGTAGCTACGATACCAAAGCATTCAGCTTACACCTGAAAGATATCTGTACCTATCAGATCGCCCTGA
KC953095 	tRNA 	product 	tRNA-Leu
KC953095 	tRNA 	codon_recognized 	UUR
KC953095 	tRNA 	transcription 	GCTAGCGTGGCAGAGCCCGGCAAGTGCAAAAGGCTTAAGCCCTTTACCCCAGAGGTTCAAATCCTCTCCCTAGCT
KC953095 	tRNA 	product 	tRNA-Ile
KC953095 	tRNA 	transcription 	GGAAATGTGCCTGAATGCAAAGGGTCACTATGATAAAGTGAACATGGAGGTACACCAACCCTCTCATTTCCT
KC953095 	tRNA 	product 	tRNA-Gln
KC953095 	tRNA 	transcription 	TAGGAAATAATATAAAGGAAGTATGAAGGGTTTTGGTCTCTTCTGTGTAGGTTCGACTCCTGCTTTTCTAA
KC953095 	tRNA 	product 	tRNA-Met
KC953095 	tRNA 	transcription 	AGTAGGGTCAGCTAACAAAGCTATCGGGCCCATACCCCGAAAATGACGGTTTAACCCCATCCCCCACTA
KC953095 	tRNA 	product 	tRNA-Trp
KC953095 	tRNA 	transcription 	AGAAACTTAGGATAACCTACCTTAAACCGAAGGCCTTCAAAGCCTTAAACAAGAGTTAAACCCTCTTAGTTTCTG


In [80]:
feature_name = 'source'; qualifier_name = 'country'
#feature_name = 'rRNA'; qualifier_name = 'product'
print get_feature_qualifier_list(rec, feature_name, qualifier_name)

['China: Anhui, South Anhui National Wild Animal Rescue Centre']


In [79]:
efetch("KC953095")

Accn | Length | UpdateDate | Title | Taxonomy
KC953095 | 16307 | 11-SEP-2014 | Strix leptogrammica mitochondrion, complete genome | Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves; Neognathae; Strigiformes; Strigidae; Strix


In [12]:
efetch("KX773372", "text")

LOCUS       KX773372                 575 bp    DNA     linear   VRT 04-OCT-2016
DEFINITION  Centropyge vrolikii isolate cvr37 cytochrome b (cytb) gene, partial
            cds; mitochondrial.
ACCESSION   KX773372
VERSION     KX773372.1
KEYWORDS    .
SOURCE      mitochondrion Centropyge vrolikii (pearlscale angelfish)
  ORGANISM  Centropyge vrolikii
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Actinopterygii; Neopterygii; Teleostei; Neoteleostei;
            Acanthomorphata; Eupercaria; Pomacanthidae; Centropyge.
REFERENCE   1  (bases 1 to 575)
  AUTHORS   DiBattista,J.D., Gaither,M.R., Hobbs,J.-P.A., Rocha,L.A. and
            Bowen,B.W.
  TITLE     Angelfishes, paper tigers and the devilish taxonomy of the
            Centropyge flavissima complex
  JOURNAL   J. Hered. (2016) In press
   PUBMED   27651391
  REMARK    Publication Status: Available-Online prior to print
REFERENCE   2  (bases 1 to 575)
  AUTHORS   DiBattista,J.D., Gaither,M.R.

In [73]:
efetch("KC953095", "text", 180)

LOCUS       KC953095                 180 bp    DNA     linear   VRT 11-SEP-2014
DEFINITION  Strix leptogrammica mitochondrion, complete genome.
ACCESSION   KC953095 REGION: 1..180
VERSION     KC953095.1
KEYWORDS    .
SOURCE      mitochondrion Strix leptogrammica
  ORGANISM  Strix leptogrammica
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda;
            Coelurosauria; Aves; Neognathae; Strigiformes; Strigidae; Strix.
REFERENCE   1  (bases 1 to 180)
  AUTHORS   Liu,G., Zhou,L. and Gu,C.
  TITLE     The complete mitochondrial genome of Brown wood owl Strix
            leptogrammica (Strigiformes: Strigidae)
  JOURNAL   Mitochondrial DNA 25 (5), 370-371 (2014)
   PUBMED   25204537
REFERENCE   2  (bases 1 to 180)
  AUTHORS   Liu,G. and Zhou,L.
  TITLE     Direct Submission
  JOURNAL   Submitted (19-APR-2013) School of Resource and Environmental
            Engineering, Anhui University, 

In [74]:
rec.keys()

[u'GBSeq_moltype',
 u'GBSeq_source',
 u'GBSeq_sequence',
 u'GBSeq_primary-accession',
 u'GBSeq_definition',
 u'GBSeq_accession-version',
 u'GBSeq_topology',
 u'GBSeq_length',
 u'GBSeq_feature-table',
 u'GBSeq_create-date',
 u'GBSeq_other-seqids',
 u'GBSeq_division',
 u'GBSeq_taxonomy',
 u'GBSeq_references',
 u'GBSeq_update-date',
 u'GBSeq_organism',
 u'GBSeq_locus',
 u'GBSeq_strandedness']

In [25]:
efetch("KX773372", feat_name="source", qual_name="all")

KX773372 	source 	organism 	Centropyge vrolikii
KX773372 	source 	organelle 	mitochondrion
KX773372 	source 	mol_type 	genomic DNA
KX773372 	source 	isolate 	cvr37
KX773372 	source 	db_xref 	taxon:109736
KX773372 	source 	country 	Palau


In [17]:
def print_object_attrs(obj):
    attrs = [a for a in dir(obj) if not a.startswith('__') and not callable(getattr(obj,a))]
    for attr in attrs:
        print attr + ":", getattr(obj, attr)

def print_object_methods(obj):
    methods = [a for a in dir(obj) if callable(getattr(obj,a)) and not a.startswith('__')]
    for method in methods:
        print method

ops = get_options(["pgm", "KC953095","-f", "source", "-q", "all"])
ids = ",".join( ops.id_list )
print "ops object attributes:"; print_object_attrs(ops); print #debug stmt, comment out this line to remove debug
 
# call efetch with comma delimited set of IDs and optionally seq max, feature and qualifier name
efetch(ids, ops.mode, seq_end=ops.sequence_max, feat_name=ops.feature_name, qual_name=ops.qualifier_name)

ops object attributes:
argv: ['pgm', 'KC953095', '-f', 'source', '-q', 'all']
feature_name: source
id_list: ['KC953095']
mode: xml
qualifier_name: all
sequence_max: -1

KC953095 	source 	organism 	Strix leptogrammica
KC953095 	source 	organelle 	mitochondrion
KC953095 	source 	mol_type 	genomic DNA
KC953095 	source 	db_xref 	taxon:126835
KC953095 	source 	tissue_type 	feather
KC953095 	source 	country 	China: Anhui, South Anhui National Wild Animal Rescue Centre
KC953095 	source 	collection_date 	Jun-2012
