# Introduction to Biopython  1

*Biopython* allows to ease the process of working with bioinformatic-related files and operations. This way, using *Biopython*, you will be able to use few functions to what otherwise, would have taken many more lines of code. In this document you will have an introduction to to *Biopython*'s basic classes and some of other resources along with a set of activities to help you get familiar with working with these tools.
 

### Key concepts

 - **SeqRecord object:** The `SeqRecord` object used in _Biopython_ to hold a sequence, as a `Seq` object, with identifiers `id` and `name`, `description` and optionally `annotation` and other sub-features. 
  <details>
    <summary>
        <span style="color: purple">Click here to display more information</span>
    </summary>
    <p>The <code>SeqRecord</code> object used in <em>Biopython</em> to hold a sequence, as a <a href="#Seq"><code>Seq</code></a> object, with identifiers <code>id</code> and <code>name</code>, description and optionally annotation and sub-features. The following table contains the <code>SeqRecord</code> attributes and the information they hold.</p>

    <blockquote>
            <table>
                <tr>
                   <td>.seq</td>
                   <td>Seq object containing a sequence</td>
                </tr>
                <tr>
                   <td>.id</td>
                   <td>Primary ID used to identify the sequence in a string format</td>
                </tr>
                <tr>
                   <td>.name</td>
                   <td> In some cases this will be the same as the accession number, but it could also be a clone name in a string a string format</td>
                </tr>
                <tr>
                   <td>.description</td>
                   <td>Brief description or expressive name for the sequence</td>
                </tr>
                <tr>
                   <td>.letter_annotations</td>
                   <td>
                   Dictionary of additional information about the letters in the sequence.
                   The keys are the name of the information (e.g. "phread_quality") and the value (as a list, tuple, string,...) has the same length as the
                   sequence itself (e.g. [40, 40, 38, 30, ...]).
                   </td>
                </tr>
                <tr>
                   <td>.annotations</td>
                   <td>
                    A dictionary of additional information about the sequence. The keys
                    are the name of the information, and the information is contained in
                    the value.
                   </td>
                </tr>
                 <tr>
                   <td>.features</td>
                   <td>
                   A list of SeqFeature objects with more structured information about the
                   features on a sequence (e.g. position of genes on a genome, or domains
                   on a protein sequence). See more on section 4.3 of the [documentation][docu].
                   </td>
                </tr>
                <tr>
                   <td>.dbxrefs</td>
                   <td>A list of database cross-references as strings (e.g. ['Project:58037']).</td>
                </tr>
             </table></blockquote>
    
    We will mainly be using the first 4 attributes. For example, the `example1.fa` file contains only two lines:
                
    >example1 this is a simple example<br>
    GATTACA-A
    
</details>



 - **Seq object:** The Seq attribute in the `SeqRecord` object is the minimum information needed to create an instance of this class. It consist on a sequence in the form of a `Python` `string` which offers many of the same methods along with additional ones. 
<details>
    <summary><span style="color: purple">Click here to display more information</span></summary>
    <p>The <code>Seq</code> attribute is the minimum information needed to create an instance of a <code>SeqRecord</code>. Like <code>SeqRecord</code>, the <code>Seq</code> object has its own set of attributes and its own module <code>Bio.Seq</code> which can be imported using <code>from Bio.Seq import Seq</code>.</p>
    <p>You need to keep in mind that like <code>Python</code> strings, <code>Seq</code> objects do not support item assignments and in order to modify them, we need to transform them into a <code>MutableSeq</code> object. To do this we need to import it using <code>from Bio.Seq import MutableSeq</code> and transform the <code>Seq</code> sequence through reassignment (e.g. <code>sequence = MutableSeq(sequence)</code>).</p>

</details>


 
Both of these objects are available in  their own module and can be imported using `from Bio.SeqRecord import SeqRecord` and `from Bio.Seq import Seq`.


------

### 1. Working with ` Bio.Seq` objects

As previously stated, the `Seq` object consist on a sequence in the form of a `Python` `string` which offers many of the same methods they have (e.g. `len()` and `count()`) alongs with additional methods like the `translate()`, `complement()` and `reverse_complement()`. The following command allows you to see an easy example showing how to create a `Seq` object instance and all the available attributes and methods available. You can find more information on the `Seq` methods under the section 3.1 to 3.8 of the [Biopython Documentation](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf).

In [1]:
from Bio.Seq import Seq

seq1 = Seq("GATTACA")

[print(att) for att in   dir(seq1)] 

__add__
__class__
__contains__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__getitem__
__gt__
__hash__
__imul__
__init__
__init_subclass__
__le__
__len__
__lt__
__module__
__mul__
__ne__
__new__
__radd__
__reduce__
__reduce_ex__
__repr__
__rmul__
__setattr__
__sizeof__
__str__
__subclasshook__
__weakref__
_data
back_transcribe
complement
complement_rna
count
count_overlap
encode
endswith
find
index
join
lower
lstrip
reverse_complement
reverse_complement_rna
rfind
rindex
rsplit
rstrip
split
startswith
strip
tomutable
transcribe
translate
ungap
upper


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

Additionaly you can find the definition along with the arguments and examples by using help():

In [2]:
## help(Seq.translate)
# help(Seq.find)

You can see below how some of these methods are implemented.  When executing the chunk, you will get a **warning** notifying you that the **string is not multiple of 3** and suggesting to either trim the sequence or adding trailing N before executing the `translate()` method. This is because otherwise, the **residual bases won't be accounted for** in this case,the obtained amino acids are DY since and residual A isn't accounted for. 

In [3]:
print(seq1)
print("There are",seq1.count("A"), "Adenines")
print("The sequence is ",len(seq1), "bp long")
print(seq1.reverse_complement())
print(seq1.complement())
print(seq1.transcribe())
print(seq1.translate())

GATTACA
There are 3 Adenines
The sequence is  7 bp long
TGTAATC
CTAATGT
GAUUACA
DY




### <span style='color: blue'>Activity 1. PSUDOCODE EXERCISES</span>
    

Previously this week, we have introduced you to the following exercises. Our goal now is for you to use the available methods of the `Seq` objects to see first hand the advantages of **using _Biopython's Seq_** methods whenever you can. You can use the following `sequence` variables to test you codes, when it is the case, the best input will be specified. 

In [4]:
seq2 = Seq("ATG-GATTACA-TGATTTT")
seq3 = Seq("ATGGCCGGGATTTGCTAG")
seq4 = Seq("STRESSED")

------

<span style='color: blue'> **Problem 1: DNA to RNA**: </span>
    
Write a code that given a DNA sequence in the form of a `Seq` object, outputs its RNA version. **Ignore start and stop codons**, simply exchange _Thymines_(T) by _Uracils_(U).

<blockquote>
Example: 

- Input: _GATTACA_
- Output: _GAUUACA_
</blockquote>

In [15]:
print(seq1. ...)
print(seq2. ...)

GAUUACA
AUG-GAUUACA-UGAUUUU


---------

<span style='color: blue'> **Problem 2: Find coding region**</span>

The reading frame starts whenever _ATG_ is encountered and it stops whenever _TAA, TAG, or TGA_ is encountered. Write code that given a DNA strand, finds the coding region not including the start and end codons. Note that if there is not start or no stop codon the function should return nothing (empty string). <span style='color: green'>Use seq2 and seq3</span>

<blockquote>
Example: 

-  Input: ATG-GATTACA-TGATTT
-  Output: -GATTACA-
</blockquote>

In [6]:
seq = seq2

start = ... # use find to get start pos 
end = ... # if no stop codon out will be empty

for # for the rest of the sequence in steps of 3
    
    if  # find a stop
        end = ...
        break

out = seq[...] # resulting string excluding start and stop codons

print(out)

-GATTACA-


------

<span style='color: blue'> **Problem 3: GC content**</span>

The GC content is the percentage of _guanine_ (G) and _cytosine_ (C) nucleotides in a DNA or RNA molecule. Write a CODE than given a DNA or RNA strand it outputs its GC content **in percentage and one rounded decimal**.

<blockquote>
Example:

- Input: ATGGATTACATGATTT
- Output: _25.0%_
</blockquote>



In [8]:
seq = seq2

GC = ...

print(round(GC, 1), "%", sep ="")

21.1%



<span style='color: green'> NOTE: You were supposed to solve this problem using `count()`, however, just so you know the _GC Content_ can also be computed by importing the module `GC` module from `Bio.SeqUtils` this way:<span> 

```python
from Bio.SeqUtils import GC

GC = GC(seq2)
print(GC)
```

-----

<span style='color: blue'> **Problem 4: Flip a string**<span>

We read words left to right. Write a code that given a word it flips it. It should output the word read from right to left.  <span style='color: green'> Use seq4 as a recomendation</span>

<blockquote>
    Example:

- Input: _STRESSED_
- Output: _DESSERTS_
</blockquote>

In [9]:
seq = seq4

revcompl = ...

comp =  ...

print(out)

DESSERTS


-----

### Translation tables

`Seq` objects contain the translate() method which performs translation using an internal codon table objects derived from NCBI. We will only be importing the Standard table but there are many more, check the  [NCBI site for more information](#https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi).
We can import codon tables using `from Bio.Data import CodonTable`. 

In [10]:
from Bio.Data import CodonTable

codons = CodonTable.unambiguous_dna_by_id[1]
print(codons, end = "\n\n")
print(codons.forward_table["ATG"])
print(codons.start_codons) #check that the standard code currently allows initiation
#                           from TTG and CTG in addition to ATG.
print(codons.stop_codons)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

#### Extra Activity

Use the `codons` variable to create a code that translates the sequence _"ATGGCCGGGATTTGCTAG"_. Check if it is correct by running: `Seq.translate(Seq("ATGGCCGGGATTTGCTAG"))`

In [11]:
seq = Seq("ATGGCCGGGATTTGCTAG")
translated_seq = []

for pos  in range(0, len(seq), 3): # not exploring differnent reading frames
    
    codon = str(seq[...]) 
    
    if  ... : # check for start
        inside = ... # flag
    
    if  # inside coding region
        
        if .... : #check for stop
            break 
        translated_seq.append(...) #add current codon's aa
        
        
print("".join(translated_seq))
print(Seq.translate(seq))

MAGIC
MAGIC*


Additionally, we can also use `from Bio.Data import IUPACData` to import a dictionary to change the symbols of the amino acids from one to three (e.g. "G" to "Gly").