# BTM2018 - Advanced Python

This sessions aims to cover some more advanced areas of Python that will allow you to write more complicated scripts and programs. This worksheet contains information about the topic, example code and some exercises.  We will cover:
* Classes
* List comprehensions
* _Error Handling_
* _Debuging and Logging_
* REST web access (for instance to Ensembl)
* Example Biological libraries

The largest part of the session will be devoted to classes and basic Object Oriented Programming in python, which is one of the most common methods for implementing modern software (in Python and beyond). We will start by briefly covering a few other important topics, which will be useful in the classes section. And all touch on some other topics at the end that might be useful.

In [52]:
import math
import sys
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## List Comprehensions

List comprehensions are a simple and efficient way to build up containers without requiring a manual loop. For example, if we wanted a list of square numbers a simple way to do it would be:

In [11]:
squares = []
for x in range(10):
    squares.append(x**2)

squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

The same thing can be done very simply in a single line using a list comprehension:

In [12]:
[x**2 for x in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

The general format is __[_expression_ for _var_ in _iterator_]__, where the expression depends on _var_ and _iterator_ is any object that can be looped over. This expression can be 

__Exercise__: Write a list copmprehension to generate a list of square roots:

They can also be extended to include a conditional expression to filter which variables that are used to build the list. For example if you wanted a list of cubes of odd numbers:

In [13]:
[x**3 for x in range(10) if x % 2 == 1]

[1, 27, 125, 343, 729]

This is roughly equivalent to writing:

In [14]:
cubes = []
for x in range(10):
    if x % 2 == 1:
        cubes.append(x**3)

cubes

[1, 27, 125, 343, 729]

__Exercise__: Write a list copmprehension to determine all multiples of 7 under 100

Similar syntax can be used to create sets and dictionaries. Similar to a list, a set is an object that contains other objects but it is unordered and contains only one copy of each unique element. They also have additional functions to perform various set opperations. We won't cover sets in detail, see https://docs.python.org/3/tutorial/datastructures.html for more details and further information about all data structures.


In [24]:
set([1,2,3,1,2,4,6])

{1, 2, 3, 4, 6}


Set comprehensions are similar to list comprehensions but use {...}:

In [26]:
{(-1)**x for x in range(10)}

{-1, 1}

Similarly dictionary comprehensions also use curly brackets but require key:value pairs:

In [27]:
{x: x**2 for x in range(10)}

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

However, often you require a key that is different to the value, in which case the comprehension becomes more complicated. The simplest way to do this is if you iterate over a list containing other containers, which lets you asign multiple iterating valriables just as when you iterate over a dictionary: 

In [29]:
{key: value for key, value in [('foo', 1), ('bar', 2), ('baz', 3)]}

{'a': 1, 'b': 2, 'c': 3}

This sort of list can be constructed using the _zip_ function, which groups the 1st, 2nd, 3rd, ... items in the lists you provide:

In [30]:
keys = ['foo', 'bar', 'baz']
values = [1, 2, 3]

{k: v for k,v in zip(keys, values)}

{'foo': 1, 'bar': 2, 'baz': 3}

The same sort of construct can also be used to have multiple variables in a list comprehension:

In [33]:
height = [4,7,8,4]
width = [7,6,2,8]
depth = [12,6,3,4]

volume = [h * w * d for h,w,d in zip(height, width, depth)]
volume

[336, 252, 48, 128]

__Exerise__: Create a list of sequences error rates and then turn it into a dictionary with the names as keys. The expected values are:

{'read1': 0.16666666666666666,
 'read2': 0.12962962962962962,
 'read3': 0.0625,
 'read4': 0.5416666666666666}

In [35]:
seq_names = ['read1', 'read2', 'read3', 'read4']
seqs = ['acgtacgtacgatcgatcgattggctatgc',
        'cagctagctagctattcttcggataacacacatttacactatcggactacgatc',
        'acattcttctcttttttttttgcgagctatcgatcgatatcgagtcga',
        'gatcgatcgatcgtgacgtctagc']
seq_error_count = [5, 7, 3, 13]


Finally, you might intuitively expect that using round brackets would create a tuple comprehension, however when we try it we find it doesn't:

In [39]:
(math.factorial(x) for x in range(10000))

<generator object <genexpr> at 0x10897a138>

What has actually been created is a generator, which is a special type of iterator that only yields the next value when it is required, making them much more efficient. They are another very useful feature that we don't have time to cover, you can find out more at https://docs.python.org/3/howto/functional.html#generators (indeed the whole page is good). For a simple example of how powerful they can be look carefully at the computation you just did and think how long it took - doing the same in a list comprehension is unlikely to be a good idea!

It is possible to create a tuple using a comprehension but requires calling the _tuple_ constructor function on another true comprehension:

In [42]:
tuple([x**2 for x in range(10)])
tuple(x**3 for x in range(10))

(0, 1, 4, 9, 16, 25, 36, 49, 64, 81)

(0, 1, 8, 27, 64, 125, 216, 343, 512, 729)

## Classes

Object oriented programming is a style of programming based on the manipulation of objects, each of which has bits of data and methods to act on them associated with them. It aims to encourage programs that relate conceptually to real life and are easy to read. Classes allow you define how your own objects, of which you can then create instances in your program to work on. The best way to learn is through examples, so we will build up some simple classes to represent biological sequences.

We will start off with an RNA sequence. It might be useful for you to have the complete code of these classes available as we go, so copy this starter section into a fresh python script or jupyter notebook and add the various sections we cover to it. This will let you get a complete view of things.

The basic syntax to define a class is very simple:

In [46]:
class RNA:
    """An RNA sequence"""
    pass

We have now created a class called _RNA_, with a docstring giving a brief description of it (this is what appears when you call the help function on an object). However, the class doesn't really do anything yet, so lets add the sequence to it. We can do this my adding a method to the class, which has the same syntax as declaring a function. \_\_Methods\_\_ with two leading and trailing underscores are special with specific predefined roles (for more information on the use of underscores in names see [here](https://hackernoon.com/understanding-the-underscore-of-python-309d1a029edc)). The \_\_init\_\_ method is called when the class is created and defines the variables used to initialise the class:

In [49]:
class RNA:
    """An RNA sequence"""
    def __init__(self, seq):
        # Make sequence lower case
        seq = str(seq).lower()


        # Set the sequence
        self.seq = seq

RNA('actgcthg')

<__main__.RNA at 0x1089b2940>

We can now create an RNA instance using the syntax ClassName(...), which uses the \_\_init\_\_ method, in our case assigning the given sequence to the class _seq_ variable.
You will have noticed the _self_ variable that defined in the \_\_init\_\_ method is not given when we called RNA(_seq_). This is because the first variable in a method, named _self_ by convention, is implicitally assigned as the instance of the class, allowing you to modify class instance variables and call methods on the instance. In our \_\_init\_\_ method we used this to assign the sequence to the instance variable _seq_ : `self.seq = seq`. Classed can also have [class methods and static methods](https://julien.danjou.info/guide-python-static-class-abstract-methods/) that have the class itself (rather than an instance) or no implicit first argument respectively, which are defined using [decorators](https://hackernoon.com/decorators-in-python-8fd0dce93c08). Both are more advanced topics than we will cover here.

Classes have two kinds of variables; class variables and instance variables. Both types can be accessed using `class.var` syntax. We have just encountered and instanace variable, _seq_, which is a bit of data associated with a specific instance of the class:

In [51]:
rna1 = RNA('aaa')
rna2 = RNA('ccc')

rna1.seq
rna2.seq

'aaa'

'ccc'

Class variables on the other hand are shared by all members of the class. Lets add an `alphabet` variable to store the allowed bases in the class and use it to check the input, along with an appropriate error:

In [63]:
class NucleotideError(Exception):
    """Error for a sequence containing a character outside the allowed alphabet"""
    pass

class RNA:
    """An RNA sequence"""
    alphabet = ('a', 'c', 'g', 'u')
    
    def __init__(self, seq):
        # Make sequence lower case
        seq = str(seq).lower()

        for base in seq:
            if not base in self.alphabet:
                raise NucleotideError('{} is not in the alphabet {}'.format(base, self.alphabet))
                
        # Set the sequence
        self.seq = seq

Class variables are defined outside class methods, without using the `self.x` syntax and are shared across all class instances. Now if we try to create an RNA molecule with a bad sequence an error is thrown:

In [65]:
RNA('acugacuHBHacugac')

NucleotideError: h is not in the alphabet ('a', 'c', 'g', 'u')

Other useful special methods include:
* \_\_str\_\_ - called when you convert a function to a string (including when printing it)
* Similar functions \_\_int\_\_, \_\_float\_\_ etc.
* \_\_repr\_\_ - The 'cannonical' string representation, which should return the object if evaluated
* \_\_len\_\_ - used by the builtin `len()` function to return an objects length

Alongside [many more](http://www.diveintopython3.net/special-method-names.html). We will implement the three listed ones:

In [67]:
class RNA:
    """An RNA sequence"""
    alphabet = ('a', 'c', 'g', 'u')

    def __init__(self, seq):
        # Make sequence lower case
        seq = str(seq).lower()


        # Check all bases are part of the sequence alphabet
        for base in seq:
            if not base in self.alphabet:
                raise NucleotideError('{} is not in the alphabet {}'.format(base, self.alphabet))

        # Set the sequence
        self.seq = seq

    def __str__(self):
        return "5'-{}-3'".format(self.seq)

    def __repr__(self):
        return '{}({})'.format(type(self).__name__, self.seq)

    def __len__(self):
        return len(self.seq)

In [70]:
rna = RNA('aucgaucguac')
rna
print(rna, len(rna), sep='\n')

RNA(aucgaucguac)

5'-aucgaucguac-3'
11


We now have a simple container for RNA sequences, but it doesn't really do anything yet beyond slightly fancy printing. For that we can add a normal method. In this case we'll add a method to calculate the molecular weight of the molecular, also adding another class variable. This completes the RNA class:

#### RNA Class

In [None]:
class RNA:
    """An RNA sequence"""
    alphabet = ('a', 'c', 'g', 'u')

    # Molecular weights from thermofisher
    base_weights = {'a':329.2, 'c':305.2, 'g':345.2, 'u':306.2}

    def __init__(self, seq):
        # Make sequence lower case
        seq = str(seq).lower()


        # Check all bases are part of the sequence alphabet
        for base in seq:
            if not base in self.alphabet:
                raise NucleotideError('{} is not in the alphabet {}'.format(base, self.alphabet))

        # Set the sequence
        self.seq = seq

    def __str__(self):
        return "5'-{}-3'".format(self.seq)

    def __repr__(self):
        return '{}({})'.format(type(self).__name__, self.seq)

    def __len__(self):
        return len(self.seq)

    def molecular_weight(self):
        """Calculate the (approximate) molecular weight of the molecule, excluding potential end phosphate"""
        weight = 0
        for base in self.seq:
            weight += self.base_weights[base]

        return weight

__Potential Exercises__:
* Add another method (gc content? simple alignment?)
* Add an \_\_iter\_\_ method to allow looping through the sequence

Since these may take some time and research don't spend too long on them, you can always come back later if you find it interesting.

### Inheritance
Inheritance is a very useful feature of classes that allows you to easily define new variations on an existing class. We've already seen a basic example in the error that we defined, which inherited from the `Exception` class. Indeed that gives the the basic syntax to use:

```
class child_class(parent_class):
    pass
```
A class inherits all the methods and class variables of it's parent class unless you expliciatally overwrite them.

Let's use our `RNA` class to define a ssDNA class, the only real change we'll need to make is to the alphabet and molecular weights.

__Bonus Exercise__: Change the `RNA` class such that we don't even need to change the alphabet (hint: we list the possbile bases somewhere else)

#### ssDNA Class

In [72]:
class ssDNA(RNA):
    """A ssDNA sequence"""
    alphabet = ('a', 'c', 'g', 't')

    # Molecular weights from thermofisher
    base_weights = {'a':313.2, 'c':289.2, 'g':329.2, 't':304.2}

    def __init__(self, seq):
        super().__init__(seq)

This also demonstrates the `super()` function, which fetches a classes parent class, allowing you to access it's methods. We use it to call the `RNA.__init__()` method, but note that it still uses the DNA alphabet not the RNA one becuase we used `self.alphabet` in the RNA method, which points to the DNA alphabet in our new class.

ssDNA is very similar to RNA, and was consequently fairly trivial to implement. Lets see what changes we need to make dsDNA. Initially this seems simple, the alphabet and weights are the same as ssDNA so we can trivially inherit from that:

In [74]:
class dsDNA(ssDNA):
    """A dsDNA sequence"""
    def __init__(self, seq):
        super().__init__(seq)

print(dsDNA('actcagcgat'))

5'-actcagcgat-3'


That doesn't really seem a good representation of dsDNA though, it might be nice to see the reverse strand too? This can be done by overwriting the `__str__` function. Similarly the molecular weight will only account for one strand, so that will need to be changed too.
__Exercise__: Modify the template below to do this

In [None]:
class dsDNA(ssDNA):
    """A dsDNA sequence"""
    def __init__(self, seq):
        super().__init__(seq)

Here's one possible solution, using another class variable to store watson-crick pairing (which is constant between molecules) and calculating the complement as we initialise instances, because it is useful in both functions:

#### dsDNA Class

In [75]:
class dsDNA(ssDNA):
    """A dsDNA sequence"""
    base_complements = {'a':'t', 'c':'g', 'g':'c', 't':'a'}
    def __init__(self, seq):
        super().__init__(seq)
        self.complement = ''.join([self.base_complements[base] for base in self.seq])

    def __str__(self):
        foreward = super().__str__()
        reverse = "3'-{}-5'".format(self.complement)
        return '{}\n   {}   \n{}'.format(foreward, '|'*len(self), reverse)

    def molecular_weight(self):
        foreward_strand = super().molecular_weight()
        reverse_strand = 0
        for base in self.complement:
            reverse_strand += self.base_weights[base]

        return foreward_strand + reverse_strand
    
print(dsDNA('actcagcgat'))

5'-actcagcgat-3'
   ||||||||||   
3'-tgagtcgcta-5'


This is possibly an overly elaborate way of printing dsDNA, depending on what your program is for, but it serves to show the power of overwriting parent methods. Both `__str__` and `molecular_weight` again make use of `super()` to call the `ssDNA` equivalent method as a starting point to build on; often a useful way to avoid writing the same function twice



## Error Handling

When you start to write longer scripts and programs it is important to 

## Database Access

# Biological Libraries