# Biopython: Introduction

## Part 0: A review of object-oriented programming in Python

<br>The core of Biopython is built around a few key **_objects_**, so let's take a few minutes to review what that means. 

<br>In python, when you load in a piece of data, you _label_ your data as a particular object type. When you do that, you can then use a particular set of commands specific to that object, and you cannot use other commands that are only available to different objects.


<br>For example, strings and lists are two of the most common objects in Python. They have some things in common, but other commands and attributes are specific to only lists or only strings.

#### Let's say we have a piece of data that looks like this: Welcome <br>We can label this as a list or a string, making two different objects.

In [1]:
welcome_string = "Welcome"
welcome_list = ["W", "e", "l", "c", "o", "m", "e"]

<br>In some ways, these objects **_behave_** the same:

In [2]:
print(welcome_string[0])
print(welcome_list[0])

W
W


<br>In other ways, they **_behave_** differently:

In [3]:
excited_string = welcome_string + "!"
print(excited_string)

Welcome!


In [6]:
excited_list = welcome_list + "!"
print(excited_list)

TypeError: can only concatenate list (not "str") to list

<br><br>**_Attributes_** are both _properties_ of our object and _functions_ that can be applied to our object. Some are shared between objects and others are specific to one object. _They are added on at the end of our object._

<br>You'll notice that many attributes are functions, so they are followed by parantheses. Some of those functions require you to enter **_parameters_** inside the parantheses:

In [7]:
welcome_list.append("!")
print(welcome_list)

['W', 'e', 'l', 'c', 'o', 'm', 'e', '!']


<br>Other functions do not require you to include parameters:

In [8]:
upper_string = excited_string.upper()
print(upper_string)

WELCOME!


<br>Some functions change the object:

In [9]:
welcome_list.reverse()
print(welcome_list)

['!', 'e', 'm', 'o', 'c', 'l', 'e', 'W']


<br>Others only **_return_** a changed object, without actually changing the object:

In [10]:
print(upper_string.replace("!", "?"))
print(upper_string)

WELCOME?
WELCOME!


<br>Some return only True or False:

In [None]:
upper_string.isupper()

<br>You can view all the attributes of an object by using the "dir" command:

In [11]:
dir(welcome_list)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

<br>You can see that some of the attributes start and end with two underscores. These are sometimes called **_dunder_** functions. They are system functions. They are used by developers when writing new python packages. They do work on their own, but they are not meant for everyday use. 

<br>Not all object attributes are functions. Some are _properties_ of the object. These do not use parantheses. Let's use one of the dunder attributes for the list object to illustrate an example of this type of attribute.

<br>list.\_\_doc__ gives us the documentation text for the list object.

In [12]:
welcome_list.__doc__

'Built-in mutable sequence.\n\nIf no argument is given, the constructor creates a new empty list.\nThe argument must be an iterable if specified.'

___


## <br><br>Part One: The Seq object

<br>The first object we are going to work with is the **_Seq_** object. This is a sequence (DNA, protein, etc.)



#### <br>Let's import the code only for the Seq object from the Biopython module:

In [13]:
from Bio.Seq import Seq

<br><br>We can enter a sequence from scratch and assign it to a variable. To do this, we use the function we just imported, Seq, to let the computer know that this is a Seq object and not an ordinary string.

In [14]:
my_seq = Seq("AGTACACTGGT")
my_seq

Seq('AGTACACTGGT')

In [15]:
print(my_seq)

AGTACACTGGT


<br><br>A Seq object actually has two data attributes: the _sequence_, and an _alphabet_ attribute that identifies how the sequence should be read (as DNA, as RNA, as protein, etc.). 

In [16]:
my_seq.alphabet

Alphabet()

<br><br>Our example sequence is actually a DNA sequence. If it is important, we can tell the computer which alphabet to use. First, we need to import the alphabet for DNA. For more about IUPAC, check out [this link.](https://iupac.org/who-we-are/)

In [17]:
from Bio.Alphabet import IUPAC

In [18]:
my_seq.alphabet = IUPAC.unambiguous_dna

In [19]:
my_seq

Seq('AGTACACTGGT', IUPACUnambiguousDNA())

In [20]:
my_seq.alphabet

IUPACUnambiguousDNA()

You may notice that the alphabet attribute of the Seq object does not require parantheses. Alphabet is a characteristic of the object, it tells us something about what the object is, instead of a function, which does something to the object.

<br>Let's view the different attributes of my_seq:

In [21]:
dir(my_seq)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_get_seq_str_and_check_alphabet',
 'alphabet',
 'back_transcribe',
 'complement',
 'count',
 'count_overlap',
 'endswith',
 'find',
 'join',
 'lower',
 'lstrip',
 'reverse_complement',
 'rfind',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'tomutable',
 'transcribe',
 'translate',
 'ungap',
 'upper']

### <br>Exercise 1.

#### Using the my_seq object we created, try out a few of the attributes listed above. Try to identify 1 that does not use (), 1 that uses () with no parameters given, and 1 that uses () with a given parameter.

In [None]:
#Example of an attribute that uses () with a given parameter:
#my_seq.__contains__("A")

### <br>Exercise 2.

#### Create a new object called my_protein with a sequence of MALWMRLLPL. Tell the computer to use the IUPAC alphabet called "protein".

In [None]:
my_protein = Seq("MALWMRLLPL")
my_protein.alphabet = IUPAC.protein
#Or:
#my_protein = Seq("MALWMRLLPL", alphabet=IUPAC.protein)

#### <br>The computer knows this is a protein. Will that change which attributes can be applied to this sequence compared to my_seq? Try out a function that you think probably won't work with a protein sequence and see what happens.

In [None]:
my_protein.transcribe()

___

## <br><br>Part Two: Reading sequence files

So far we've worked with a sequence that we entered by hand. More often, however, we'll be working with sequences from files.


<br>We can use **_SeqIO_** (that's the letter O) from the Bio module to load many different types of sequence files.

[Click here to view a list of file types that work with Biopython](https://biopython.org/wiki/SeqIO)

In [22]:
from Bio import SeqIO

<br>**_SeqIO.parse()_** is the function for reading the file. It takes two parameters: the path to the file, and the type of file in lowercase text. <br>
Let's load in a genbank sequence of the human insulin gene:

In [23]:
insulin_gene = SeqIO.parse("sequence.gb", "genbank")

In [24]:
print(insulin_gene)

<generator object parse at 0x111302e58>


<br>When you use SeqIO.parse, the computer creates a **_generator_** object. This means the sequences are not stored in memory, which is great if you want to filter or loop through a very large file that contains many sequences or very long sequences.
<br><br>To access the data, we can loop through the generator object and read each sequence that is contained inside.

In [25]:
for i in insulin_gene:
    print(i)

ID: NC_000011.10
Name: NC_000011
Description: Homo sapiens chromosome 11, GRCh38.p13 Primary Assembly
Database cross-references: BioProject:PRJNA168, Assembly:GCF_000001405.39
Number of features: 32
/molecule_type=DNA
/topology=linear
/data_file_division=CON
/date=09-SEP-2019
/accessions=['NC_000011', 'REGION:', '2129117..2161209']
/sequence_version=10
/keywords=['RefSeq']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Human chromosome 11 DNA sequence and analysis including novel gene identification', ...), Reference(title='Finishing the euchromatic sequence of the human genome', ...), Reference(title='Initial sequencing and analysis of the human genome', ...)]
/comment=REFSEQ INFORMATION: The reference sequence is identical to
CM000673.2.
On Feb 3, 2014 this 

<br>Our file only contained one sequence.

<br> We can also load our file into memory by loading it as a **_dictionary_**. This might make the data easier to parse, especially if you are familiar with dictionaries. To do this, we use the **_SeqIO.parse()_** function inside the **_SeqIO.to_dict()_** function:

In [26]:
insulin_dict = SeqIO.to_dict(SeqIO.parse("sequence.gb", "genbank"))

In [27]:
insulin_dict

{'NC_000011.10': SeqRecord(seq=Seq('TCATGACTTTTAATGCTTTATTGGGATTGCAAGCGTTACAAGGTTAAAGACAAA...GCT', IUPACAmbiguousDNA()), id='NC_000011.10', name='NC_000011', description='Homo sapiens chromosome 11, GRCh38.p13 Primary Assembly', dbxrefs=['BioProject:PRJNA168', 'Assembly:GCF_000001405.39'])}

<br>Note: The sequence id is used as the key in the dictionary.

### <br>Exercise 3.

#### Load in the fasta file, sequence.fasta, as a dictionary called fasta_dict. You will use "fasta" as the file type.

In [79]:
fasta_dict = SeqIO.to_dict(SeqIO.parse("sequence.fasta", "fasta"))

#### This fasta file was downloaded from NCBI. It contains human insulin sequences on chromosome 11 of length between 100 and 10,000 bp.
#### <br>How many sequences are in fasta_dict?

In [80]:
len(fasta_dict)

52

#### Print the sequence with id "NG_052838.1".

In [81]:
fasta_dict["NG_052838.1"]

SeqRecord(seq=Seq('GGGCAAATGTCTCCAGGAGAGCAAAGCCCTCACCTGGGCCACTTTCCACATTAG...CCT', SingleLetterAlphabet()), id='NG_052838.1', name='NG_052838.1', description='NG_052838.1 Homo sapiens insulin repeat instability region (LOC109623489) on chromosome 11', dbxrefs=[])

___

## <br><br>Part Three: The SeqRecord object

<br> There are several different sequence file types (too many!). Most contain more information than just the sequence. 


<br>A **_SeqRecord_** is another object that has its own attributes and functions. The **_Seq_** object that we learned above is one attribute of the SeqRecord. The Seq object retains all of its own attributes and functions, so you can think of it as a sub-object inside the SeqRecord object. 


<br>When we loaded our insulin sequence above, both as a generator object and as a dictionary, SeqIO labelled each record as a SeqRecord object. Let's look at one SeqRecord:

In [31]:
insulin = insulin_dict['NC_000011.10']

In [32]:
insulin

SeqRecord(seq=Seq('TCATGACTTTTAATGCTTTATTGGGATTGCAAGCGTTACAAGGTTAAAGACAAA...GCT', IUPACAmbiguousDNA()), id='NC_000011.10', name='NC_000011', description='Homo sapiens chromosome 11, GRCh38.p13 Primary Assembly', dbxrefs=['BioProject:PRJNA168', 'Assembly:GCF_000001405.39'])

Let's see what attributes this object has:

In [33]:
dir(insulin)

['__add__',
 '__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__le___',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_per_letter_annotations',
 '_seq',
 '_set_per_letter_annotations',
 '_set_seq',
 'annotations',
 'dbxrefs',
 'description',
 'features',
 'format',
 'id',
 'letter_annotations',
 'lower',
 'name',
 'reverse_complement',
 'seq',
 'translate',
 'upper']

<br>We can print attributes from our record:

In [34]:
insulin.description

'Homo sapiens chromosome 11, GRCh38.p13 Primary Assembly'

In [35]:
insulin.seq

Seq('TCATGACTTTTAATGCTTTATTGGGATTGCAAGCGTTACAAGGTTAAAGACAAA...GCT', IUPACAmbiguousDNA())

In [39]:
insulin.seq.alphabet

IUPACAmbiguousDNA()

___

### <br>Part Four: Parsing records

<br> Now we can filter and transform our sequences in bulk. Let's write a loop to make a new dictionary that returns the reverse complements of all the sequences in the fasta_dict.

In [86]:
#In case you changed it in the previous exercise, we will reload the dictionary first.
fasta_dict = SeqIO.to_dict(SeqIO.parse("sequence.fasta", "fasta"))

In [87]:
reverse_dict = {}

In [88]:
for id, record in fasta_dict.items():
    reverse_dict[id] = record.reverse_complement()

    

<br> We can test the output by looking at the same sequence in the original dictionary and in the reversed dictionary:

In [89]:
fasta_dict["NG_052838.1"]

SeqRecord(seq=Seq('GGGCAAATGTCTCCAGGAGAGCAAAGCCCTCACCTGGGCCACTTTCCACATTAG...CCT', SingleLetterAlphabet()), id='NG_052838.1', name='NG_052838.1', description='NG_052838.1 Homo sapiens insulin repeat instability region (LOC109623489) on chromosome 11', dbxrefs=[])

In [90]:
reverse_dict["NG_052838.1"]

SeqRecord(seq=Seq('AGGCAGCCAGCAGGGAGGGGACCCCTCCCTCACTCCCACTCTCCCACCCCCACC...CCC', SingleLetterAlphabet()), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])

<br>What differences do you see between the two dictionaries other than just the change in the sequence?

### <br>Exercise 4

#### Our fasta_dict contains some minisatellite sequences, which we do not want to include. (Minisatellites are parts of the genome with a high number of repeated DNA segments and are useful for studying population level genetic diversity.) Minisatellites are short, but we cannot filter by length because there are some short exon sequences that we do want to keep. <br>
#### <br> Create a new dictionary called reduced_dict. Use a loop to only include sequences that do not contain the word "minisatellite" in the SeqRecord name.

In [91]:
reduced_dict = {}

In [92]:
for i_d, record in fasta_dict.items():
    if "minisatellite" not in record.description:
        reduced_dict[i_d] = record

#### How many sequences are in the reduced_dict?

In [101]:
len(reduced_dict)

19

In [99]:
for id in reduced_dict:
    print(len(reduced_dict[id]))

3943
7496
1393
4992
598
8416
186
192
4965
9660
5120
272
250
4156
632
703
520
190
490


## Part Five: Writing files