### Bio.Entrez package

[Entrez](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) is a data retrieval system for NCBI databases. In this section, the list of available NCBI databases is obtained via Entrez.einfo method of Biopython. For more details, see chapter 9.2 in [Biopython Tutorial and Cookbook](https://biopython.org/DIST/docs/tutorial/Tutorial.html)


In [1]:
from Bio import Entrez
Entrez.email = "" # Use the optional email parameter so the NCBI can contact you if there is a problem

In [2]:
handle = Entrez.einfo()
record = Entrez.read(handle)

In [3]:
record["DbList"] # all of the databases that can be reached via Entrez

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

### ESearch: Searching via Entrez
In this section, Entrez.esearch method is used to search Nucleotide database of NCBI for a specific gene in an organism. [Orgn] denotes for Organism,
[Gene] denotes for Gene and [prop] denotes for Property, complete[prop] restricts the search to just completed genomes. idtype is set as "acc", which returns accesion ids. For more details, see chapter 9.3 in [Biopython Tutorial and Cookbook](https://biopython.org/DIST/docs/tutorial/Tutorial.html)

In [4]:
from Bio import Entrez
Entrez.email = "" # Use the optional email parameter so the NCBI can contact you if there is a problem
handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene] AND complete[prop]", idtype="acc")
record = Entrez.read(handle)

In [5]:
print(record["Count"])

261


In [6]:
print(record["IdList"])

['NC_084420.1', 'NC_084419.1', 'NC_084418.1', 'OR726575.1', 'OR726574.1', 'OR726573.1', 'OQ981989.1', 'NC_063680.1', 'NC_063681.1', 'NC_064145.1', 'NC_066405.1', 'OP465215.1', 'NC_071758.1', 'NC_069974.1', 'NC_069973.1', 'NC_069972.1', 'NC_069971.1', 'NC_069970.1', 'NC_069969.1', 'NC_069968.1']


### EFetch: Downloading full records from Entrez
In this section, the first result from the search query in the previous section is downloaded as a local file. Then the downloaded file is parsed with Bio.SeqIO. For more details, see chapter 9.6 in [Biopython Tutorial and Cookbook](https://biopython.org/DIST/docs/tutorial/Tutorial.html)

In [7]:
# The first record from the search query will be downloaded
access_id = record["IdList"][0]

In [10]:
import os
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "" # Use the optional email parameter so the NCBI can contact you if there is a problem

# Create a filename to save the data as a GenBank file
filename = f"{str(access_id)}.gbk"
if not os.path.isfile(filename): # If the file exists, avoid re-downloading
    # Downloading...
    net_handle = Entrez.efetch(
    db="nucleotide", id=access_id, rettype="gb", retmode="text"
    )
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")

Saved


In [11]:
print("Parsing...")
record = SeqIO.read(filename, "genbank")
print(record)

Parsing...
ID: NC_084420.1
Name: NC_084420
Description: Cypripedium sichuanense chloroplast, complete genome
Database cross-references: BioProject:PRJNA927338
Number of features: 264
/molecule_type=DNA
/topology=circular
/data_file_division=PLN
/date=07-DEC-2023
/accessions=['NC_084420']
/sequence_version=1
/keywords=['RefSeq']
/source=chloroplast Cypripedium sichuanense
/organism=Cypripedium sichuanense
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliopsida', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Cypripedium']
/references=[Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/comment=PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence is identical to OR726575.
COMPLETENESS: full length.
/structured_comment=defaultdict(<class 'dict'>, {'Assembly-Data': {'Assembly Method': 'canu v. v1.3; fastp v. 0.20.0; minimap2 v.

In [46]:
print(f"ID: {record.id}")
print(f"DESCRIPTION: {record.description}")

ID: OR726575.1
DESCRIPTION: Cypripedium sichuanense chloroplast, complete genome


In [12]:
# Record has the sequence 
print(type(record.seq),"\n", record.seq)

<class 'Bio.Seq.Seq'> 
 GAACCCCCATATCTTGTATCTTGTAAGATATGGGGGGATTGCTACCTTCAAAAATTCATATCATATACATATAAATTACATATAAATTTCTACATTTTATACATTAAAGTATTATCCATTTGTAGATGGAGCTTCTACAGAAGCTAGATCTAGAGGGAAGTTGTGAGCATTACGTTCATGCATTACTTCCATACCAAGATTCGCGCGATTTATGATATCAGCCCAAGTGTTAATAACACGACCTTGACTATCAACTACGGATTGGTTAAAATTGAAACCATTCAGGTTGAACGCCATAGTGCTAATACCCAAAGCAGTGAACCAGATACCCACTACAGGCCAAGCAGCCAGGAAGAAATGTAAGGAACGAGAATTGTTGAAACTAGCATATTGGAAGATCAATCGGCCAAAATAACCATGAGCAGCTACGATATTATAGGTTTCTTCCTCTTGACCGAATCTGTAACCTTCATTAGCAGACTCGTTTTCAGTGGTTTCCCTGATCAAACTAGAAGTTACCAAAGAACCATGCATAGCACTGAATAGGGAGCCGCCGAATACACCAGCTACGCCTAACATGTGAAATGGATGCATAAGAATGTTGTGCTCTGCCTGGAATACAATCATGAAGTTGAAAGTACCAGATATTCCTAAAGGCATACCATCAGAGAAACTTCCTTGACCAATAGGGTAGATCAAGAAAACAGCTGCAGCAGCCGCAACAGGAGCTGAATATGCAACAGCAATCCAAGGGCGCATACCCAGACGGAAACTAAGTTCCCACTCACGACCCATGTAACAAGCTACACCAAGTAAAAAGTGTAGAACAATAAGTTCATAAGGACCGCCGTTGTATAACCACTCATCAACAGATGCCGCTTCCCATATTGGGTAAAAATGCAAACCTATAGCTGCGGAAGTAGGAATAATGGCACCGGAGATAATATTGTTTCCATAAAGTAGAGACCCAGAAACA