## Advanced Python Course 
## MoBi - University Heidelberg 2019
### by Christian Fufezan 

christian@fufezan.net

https://fufezan.net


<img src="https://octodex.github.com/images/Professortocat_v2.png" width="100" height="100" style="float: right;"/>

In [None]:
# %load topics.py
import pandas as pd
import psutil

pd.set_option("display.max_colwidth" , 300)

df_high_level = pd.DataFrame(
    data=[
        {'day': 'Monday', 'Topic': 'Check-In, recaps and functions'},
        {'day': 'Tuesday', 'Topic': 'Coding philosophy, data flow and some more useful std modules'},
        {'day': 'Wednesday', 'Topic': 'Test driven development, python module, sphinx'},
        {'day': 'Thursday', 'Topic': 'OOP - Object oriented programming'},
        {'day': 'Friday', 'Topic': 'Q&A and code clean up'},
        {'day': '', 'Topic': ''},
        {'day': 'Monday', 'Topic': ''},
        {'day': 'Tuesday', 'Topic': ''},
        {'day': 'Wednesday', 'Topic': ''},
        {'day': 'Thursday', 'Topic': ''},
        {'day': 'Friday', 'Topic': 'Q&A and Tutorium'},


    ]
)

df_details = pd.DataFrame(
    data=[
        {'day': 1, 'Topic': 'Check-in'},
        {'day': 1, 'Topic': 'Procedural stuff'},
        {'day': 1, 'Topic': "python basic in 5'"},
        {'day': 1, 'Topic': 'lists and generators'},
        {'day': 1, 'Topic': 'bisect module'},
        # ----------------------------
        {'day': 2, 'Topic': 'Functions'},
        {'day': 2, 'Topic': 'Zen of Python and general coding philosophy'},
        {'day': 2, 'Topic': 'csv module'},
        {'day': 2, 'Topic': 'Collections module'},
        {'day': 2, 'Topic': 'Exercises 1 & 2'},
        # ----------------------------
        {'day': 3, 'Topic': 'Basic plotting with plotly'},
        {'day': 3, 'Topic': "String format"},
        {'day': 3, 'Topic': 'dicts'},
        {'day': 3, 'Topic': 'itertools'},
        {'day': 3, 'Topic': 'data flow'},
        {'day': 3, 'Topic': 'Exercises 3 & 4'},
        # -----------------------------
        {'day': 3, 'Topic': "Basic Python package"},
        {'day': 3, 'Topic': "Test Driven development"},
        {'day': 3, 'Topic': "Auto documentation with Sphinx"},
        # -----------------------------
        {'day': 4, 'Topic': "OOP"},
    ]
)


def display_topics(day=1, df=None):
    if df is None:
        df = df_details
    return df[df['day'] == day][['day', 'Topic']].head(20)


# Day 2
## Overview

In [None]:
display_topics(day=2)

## Functions
Functions are encapsulated code blocks. Useful because:
* code is reusable (can be used in different parts of the code or even imported from other scripts)
* can be documented 
* can be tested

In [2]:
import hashlib
def calculate_md5(string):
    """Calculate the md5 for a given string
    
    Args:
        string (str) string for which the md5 hex digest is calculated. 
            can be byte of string instance
        
    Returns:
        str: md5 hex digest
    """
    m = hashlib.md5()
    if isinstance(string, str):
        m.update(string.encode("utf-8"))
    elif isinstance(string, bytes):
        m.update(string)
    else:
        raise TypeError("This function supports only string input")
    return m.hexdigest()
    

In [3]:
calculate_md5("The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men.")

'671f4888cb99472cea6bc35c10388537'

In [None]:
calculate_md5(b"The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men.")

SideNote: Personally, I find googles docstring format the most readable. We will use this format in day 3. Example of google style python docstrings can be found [here](https://www.sphinx-doc.org/en/1.5/ext/example_google.html). If you wonder why we test for byte strings and use encode, please read [this](https://realpython.com/python-encodings-guide/) well written blog post about it.

### Dangerous mistakes using functions
What are the outcomes of these lines

In [6]:
def extend_list_with_three_none(input_list=[]):
    """Extend input_list with 3 * None"""
    input_list += [None, None, None]
    return input_list

In [7]:
extend_list_with_three_none(input_list=['3', 2 , 1])

['3', 2, 1, None, None, None]

In [8]:
extend_list_with_three_none()

[None, None, None]

In [9]:
extend_list_with_three_none()

[None, None, None, None, None, None]

In [10]:
extend_list_with_three_none()

[None, None, None, None, None, None, None, None, None]

## Setting up functions properly
**Never** set default kwargs in functions to mutable objects as they are initialized once, exist until program is stopped and will behave strangly.

In [None]:
def extend_list_with_three_none_without_bug(input_list = None):
    """Extend input_list with 3 None"""
    if input_list is None:
        input_list = []
    input_list += [None, None, None]
    return input_list

In [None]:
extend_list_with_three_none_without_bug(input_list=['3', 2 , 1])

In [None]:
extend_list_with_three_none_without_bug()

In [None]:
extend_list_with_three_none_without_bug()

In [None]:
extend_list_with_three_none_without_bug()

# The csv module

There are several ways to interact with files that contain data in a "comma separated value" format.

We cover the [basic csv module](https://docs.python.org/3/library/csv.html), as it is sometimes really helpful to retain only a fraction of the information of a csv to avoid memory overflow.

In [None]:
import csv

with open("../data/amino_acid_properties.csv") as aap:
    aap_reader = csv.DictReader(aap, delimiter=",") 
    for line_dict in aap_reader:
        print(line_dict)
        break

We can also use the csv module to write csvs, or tab separated value files if we change the delimiter to "\t"

In [None]:
with open("../data/test.csv", "w") as output:
    aap_writer = csv.DictWriter(output, fieldnames=["Name", "3-letter code"])
    aap_writer.writeheader()
    aap_writer.writerow({"Name": "Alanine", "3-letter code": "Ala", "1-letter code": "A"})

# What do you expect to happen ?

In [None]:
!cat ../data/test.csv

In [None]:
# fix it
with open("c", "w") as output:
    aap_writer = csv.DictWriter(output, fieldnames=["Name", "3-letter code"], extrasaction='ignore')
    aap_writer.writeheader()
    aap_writer.writerow({"Name": "Alanine", "3-letter code": "Ala", "1-letter code": "A"})

# Collections - high performance containers ... sorta

## [collections.Counter](https://docs.python.org/3.7/library/collections.html#counter-objects)
A counter tool is provided to support convenient and rapid tallies. For example

In [27]:
from collections import Counter
s = """
MQRLMMLLATSGACLGLLAVAAVAAAGANPAQRDTHSLLPTHRRQKRDWIWNQMHIDEEK
NTSLPHHVGKIKSSVSRKNAKYLLKGEYVGKVFRVDAETGDVFAIERLDRENISEYHLTA
VIVDKDTGENLETPSSFTIKVHDVNDNWPVFTHRLFNASVPESSAVGTSVISVTAVDADD
PTVGDHASVMYQILKGKEYFAIDNSGRIITITKSLDREKQARYEIVVEARDAQGLRGDSG
TATVLVTLQDINDNFPFFTQTKYTFVVPEDTRVGTSVGSLFVEDPDEPQNRMTKYSILRG
DYQDAFTIETNPAHNEGIIKPMKPLDYEYIQQYSFIVEATDPTIDL RYMSPPAGNRAQVI
"""
Counter(s)

Counter({'\n': 7,
         'M': 8,
         'Q': 14,
         'R': 20,
         'L': 24,
         'A': 29,
         'T': 28,
         'S': 23,
         'G': 20,
         'C': 1,
         'V': 31,
         'N': 16,
         'P': 17,
         'D': 28,
         'H': 10,
         'K': 18,
         'W': 3,
         'I': 23,
         'E': 21,
         'Y': 13,
         'F': 13})

In [28]:
# Counter objects can be added together
Counter("AABB") + Counter("BBCC")

Counter({'A': 2, 'B': 4, 'C': 2})

In [29]:
# Works with any type of object that are comparable
Counter([(1, 1), (1, 2), (2, 1), (1, 1)])

Counter({(1, 1): 2, (1, 2): 1, (2, 1): 1})

## [collections.deque](https://docs.python.org/3.7/library/collections.html#deque-objects)
Deque \[deck\] or double-ended queue can be used for many tasks, e.g. building a sliding window

In [30]:
from collections import deque
s = """MQRLMMLLATSGACLGLLAVAAVAAAGANPAQRDTHSLLPTHRRQKRDWIWNQMHIDEEKNTSLPHHVGKIKSSVSRKNAKYLLKGEYVGKVFRVDAETGDVFAIERLDRENISEYHLTA"""
window = deque([], maxlen=5)

In [31]:
for pos, aa in enumerate(s):
    window.append(aa)
    print(window)
    if pos > 7:
        break

deque(['M'], maxlen=5)
deque(['M', 'Q'], maxlen=5)
deque(['M', 'Q', 'R'], maxlen=5)
deque(['M', 'Q', 'R', 'L'], maxlen=5)
deque(['M', 'Q', 'R', 'L', 'M'], maxlen=5)
deque(['Q', 'R', 'L', 'M', 'M'], maxlen=5)
deque(['R', 'L', 'M', 'M', 'L'], maxlen=5)
deque(['L', 'M', 'M', 'L', 'L'], maxlen=5)
deque(['M', 'M', 'L', 'L', 'A'], maxlen=5)


In [None]:
Counter(window)

## [collections.defaultdicts](https://docs.python.org/3.7/library/collections.html#defaultdict-objects)
Defaultdicts are like dicts yet they treat missing values not with an error, thus testing if key exists is not neccessary and makes life easier :) Ofcourse, one needs to define the default value that is taken if a key is not existent. 

I use it a lot for counting 
```python
counter["error"] += 1
```
or collecting elements in lists
```python
sorter["typeA"].append({"name": "John"})
```

In [32]:
from collections import defaultdict

ddict_int = defaultdict(int)
#                        ^---- default factory
ddict_list = defaultdict(list)

In [33]:
ddict_int[10] += 10
ddict_int

defaultdict(int, {10: 10})

In [34]:
ddict_int[0]

0

In [35]:
def default_factory_with_prefilled_dictionary():
    return {"__name": "our custom dict", "errors": 0}
ddict_custom = defaultdict(default_factory_with_prefilled_dictionary)


In [37]:
ddict_custom[10] += 10

In [None]:
ddict_custom[10]['errors'] += 10

In [38]:
ddict_custom

defaultdict(<function __main__.default_factory_with_prefilled_dictionary()>,
            {10: {'__name': 'our custom dict', 'errors': 10}})

# Excersise No. 1

## Count Amino acid propensity in human proteome

Got to Uniprot and download the latests [here](https://www.uniprot.org/uniprot/?query=*&fil=organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+AND+reviewed%3Ayes#)

The file containes protein sequences in FASTA format, i.e.
```txt
> (( Identifier line ))
(( Amino acid Sequence, can stretch over multiple lines ))
> (( next identifier line ))
```

Write a standalone script that parses the uniprot fasta file and count amino acid frequency.
The script should be callable like 
``` bash
$ ./count_aas.py "uniprot-filtered-proteome%3AUP000005640+AND+reviewed%3Ayes+AND+organism%3A%22Hom--.fasta"
```
and produce an output like this
``` bash
A: xxxx
C: yyyy
D: zzzz
...
```

# Excersise No. 2

## Find those sequences that have the most extreme values for ..

The data folder in this repo hold a file called **amino_acid_properties.csv**
Copy your script from excersise no.1 and extent it to

* parse the amino_acid_properties.csv as well using a *csv.DictReader*
* find those sequences that have the largest and smallest mass
* find those sequences that have the most extreme pIs
* find those sequences that have on average the highest / lowest mass
* the most extreme (average) hydropathy values

## Excersise No. 2b*

Plot the outputs of the scripts from excersise 1 & 2 using your favourit plotting library.