# Introduction to Biopython 3

### Key concepts

 - **SeqRecord object:** The `SeqRecord` object used in _Biopython_ to hold a sequence, as a `Seq` object, with identifiers `id` and `name`, `description` and optionally `annotation` and other sub-features. 
  <details>
    <summary>
        <span style="color: purple">Click here to display more information</span>
    </summary>
    <p>The <code>SeqRecord</code> object used in <em>Biopython</em> to hold a sequence, as a <a href="#Seq"><code>Seq</code></a> object, with identifiers <code>id</code> and <code>name</code>, description and optionally annotation and sub-features. The following table contains the <code>SeqRecord</code> attributes and the information they hold.</p>

    <blockquote>
            <table>
                <tr>
                   <td>.seq</td>
                   <td>Seq object containing a sequence</td>
                </tr>
                <tr>
                   <td>.id</td>
                   <td>Primary ID used to identify the sequence in a string format</td>
                </tr>
                <tr>
                   <td>.name</td>
                   <td> In some cases this will be the same as the accession number, but it could also be a clone name in a string a string format</td>
                </tr>
                <tr>
                   <td>.description</td>
                   <td>Brief description or expressive name for the sequence</td>
                </tr>
                <tr>
                   <td>.letter_annotations</td>
                   <td>
                   Dictionary of additional information about the letters in the sequence.
                   The keys are the name of the information (e.g. "phread_quality") and the value (as a list, tuple, string,...) has the same length as the
                   sequence itself (e.g. [40, 40, 38, 30, ...]).
                   </td>
                </tr>
                <tr>
                   <td>.annotations</td>
                   <td>
                    A dictionary of additional information about the sequence. The keys
                    are the name of the information, and the information is contained in
                    the value.
                   </td>
                </tr>
                 <tr>
                   <td>.features</td>
                   <td>
                   A list of SeqFeature objects with more structured information about the
                   features on a sequence (e.g. position of genes on a genome, or domains
                   on a protein sequence). See more on section 4.3 of the [documentation][docu].
                   </td>
                </tr>
                <tr>
                   <td>.dbxrefs</td>
                   <td>A list of database cross-references as strings (e.g. ['Project:58037']).</td>
                </tr>
             </table></blockquote>
    
    We will mainly be using the first 4 attributes. For example, the `example1.fa` file contains only two lines:
                
    >example1 this is a simple example<br>
    GATTACA-A
    
</details>



 - **Seq object:** The Seq attribute in the `SeqRecord` object is the minimum information needed to create an instance of this class. It consist on a sequence in the form of a `Python` `string` which offers many of the same methods along with additional ones. 
<details>
    <summary><span style="color: purple">Click here to display more information</span></summary>
    <p>The <code>Seq</code> attribute is the minimum information needed to create an instance of a <code>SeqRecord</code>. Like <code>SeqRecord</code>, the <code>Seq</code> object has its own set of attributes and its own module <code>Bio.Seq</code> which can be imported using <code>from Bio.Seq import Seq</code>.</p>
    <p>You need to keep in mind that like <code>Python</code> strings, <code>Seq</code> objects do not support item assignments and in order to modify them, we need to transform them into a <code>MutableSeq</code> object. To do this we need to import it using <code>from Bio.Seq import MutableSeq</code> and transform the <code>Seq</code> sequence through reassignment (e.g. <code>sequence = MutableSeq(sequence)</code>).</p>

</details>


 
Both of these objects are available in  their own module and can be imported using `from Bio.SeqRecord import SeqRecord` and `from Bio.Seq import Seq`.


_____

## ENTREZ


We have seen how we can parse sequence data from a local file. Now we will see
how we can use a network connection to download and parse sequences from the
internet into a `SeqRecord` object. [_Entrez_](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) is a data retrieval system that allows access to [_NCBI's_](https://www.ncbi.nlm.nih.gov/home/about/) databases. Currently it includes 38 databases
covering a variety of biomedical data.

_Biopython’s_ `Bio.Entrez` module allows you to search _PubMed_ or download
_GenBank_ records from within a Python script. The module uses _Entrez
Programming Utilities_ (EUtils) which consist on eight tools each of which
corresponds to a Python function. For example, out of  the eight available:


 - **`Entrez.esearch()`** allows you to get a list of ID's containing a specific
 term in a specific database.
 - **`Entrez.esummary()`** retrieves document summaries from a list of primary
 IDs.
 - **`Entrez.efetch()`** is used to retrieve a full record from _Entrez_ and
Requesting a specific file format.

<details>
    <summary>
        <span style="color: purple">display more information</span>
    </summary>
    There are another five <em>EUtils</em> functions described in sections 9.1 to 9.9 of <a href="http://biopython.org/DIST/docs/tutorial/Tutorial.pdf">Biopython's documentation</a>, and you can find more information on <a href="https://www.ncbi.nlm.nih.gov/books/NBK25501/">Entrez Programming Utilities on NCBI's page</a>. For now, just know that <em>EUtils</em> tools ensure that the correct URL is used for the queries, and that NCBI's guidelines for responsible data access are being followed.
</details>


A logical common pipeline to  import sequence files in this way would be the 
following.

1. Conduct a query using  `Entrez.egquery()` to get the and overview on amount of
IDs of a given term.
2. Use the module `Entrez.esearch()` and giving a selected database and the term
to retrieve the list of IDs of records that match a search term.

3. Use `Entrez.efetch()` to download the records we using the IDs we just 
obtained giving a specific file format. 

4. Finally use `Bio.SeqIO.parse()` to parse the data we downloaded into 
`SeqRecord` objects. 
</details>

As part of   [_NCBI’s Entrez User Requirements_](#https://www.ncbi.nlm.nih.gov/books/NBK25497/),   `Bio.Entrez` will require for you to probide an email address with each call to   _Entrez_. This  will be used only to contact developers if _NCBI_ observes requests that violate  policies prior to blocking access. 


In [1]:
from Bio import Entrez
#Entrez.email = 'yourname.surname@alum.esci.upf.edu'

Use  Entrez.esearch() to retrieve the list of IDs of records that match a search of a given term from a selected database.

In [2]:
with Entrez.esearch(db="nucleotide", term="5-hydroxytryptamine receptor") as search_handle:
    search_results = Entrez.read(search_handle)
[print(e, search_results[e]) for  e in search_results]

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


Count 38710
RetMax 20
RetStart 0
IdList ['192447382', '113682342', '2569476866', '2569476864', '2569470074', '2569467569', '2569465640', '2569443331', '2569443328', '2566565164', '2566564934', '2566564888', '2566564851', '2566564791', '2568250319', '2568203924', '2568203921', '2568178572', '2568162887', '2568155347']
TranslationSet []
TranslationStack [{'Term': '5-hydroxytryptamine[All Fields]', 'Field': 'All Fields', 'Count': '39167', 'Explode': 'N'}, {'Term': 'receptor[All Fields]', 'Field': 'All Fields', 'Count': '5345978', 'Explode': 'N'}, 'AND', 'GROUP']
QueryTranslation 5-hydroxytryptamine[All Fields] AND receptor[All Fields]


[None, None, None, None, None, None, None]

In [3]:
list_of_ids = search_results["IdList"] #list of IDs
list_of_ids

['192447382', '113682342', '2569476866', '2569476864', '2569470074', '2569467569', '2569465640', '2569443331', '2569443328', '2566565164', '2566564934', '2566564888', '2566564851', '2566564791', '2568250319', '2568203924', '2568203921', '2568178572', '2568162887', '2568155347']

Use Entrez.efetch to download the records we using the IDs we just obtained giving a specific file format.

In [4]:
with Entrez.efetch(db="nucleotide", id=list_of_ids, rettype="gb", retmode="text") as fetch_handle:
    data = fetch_handle.read()
data

'LOCUS       NM_001128709            1325 bp    mRNA    linear   VRT 27-AUG-2023\nDEFINITION  Danio rerio 5-hydroxytryptamine (serotonin) receptor 1B (htr1b),\n            mRNA.\nACCESSION   NM_001128709 XM_685061\nVERSION     NM_001128709.1\nKEYWORDS    RefSeq.\nSOURCE      Danio rerio (zebrafish)\n  ORGANISM  Danio rerio\n            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\n            Actinopterygii; Neopterygii; Teleostei; Ostariophysi;\n            Cypriniformes; Danionidae; Danioninae; Danio.\nREFERENCE   1  (bases 1 to 1325)\n  AUTHORS   Huang CX, Zhao Y, Mao J, Wang Z, Xu L, Cheng J, Guan NN and Song J.\n  TITLE     An injury-induced serotonergic neuron subpopulation contributes to\n            axon regrowth and function restoration after spinal cord injury in\n            zebrafish\n  JOURNAL   Nat Commun 12 (1), 7093 (2021)\n   PUBMED   34876587\n  REMARK    Publication Status: Online-Only\nREFERENCE   2  (bases 1 to 1325)\n  AUTHORS   Yang H, Liang 

Save as a new file: <span style="color: green">'5htr.gb'</span>

In [5]:
with open('5htr.gb', 'w') as file_out:
    file_out.write(data)