# Exercise 1

## Exercise 1A

Write a function called middle that takes a list and returns a new list that contains all but the first and last elements. So middle([1,2,3,4]) should return [2,3]

In [1]:
def middle_naive(l):
    if len(l) <= 2:
        return l
    
    result = []
    for i in range(1, len(l) - 1):
        result.append(l[i])
    return result
    
def middle_slice(l):
    if len(l) <= 2:
        return l
    return l[1:-1]

for middle in (middle_naive, middle_slice):
    print(middle)
    assert [] == middle([])
    assert [1] == middle([1])
    assert [2, 3] == middle([1,2,3,4])

<function middle_naive at 0x106c33158>
<function middle_slice at 0x106c331e0>


## Exercise 1B

Write a function called chop that takes a list, modifies it by removing the first and last elements, and returns None.

In [2]:
def chop_inplace(l):
    l.pop(0)
    l.pop()
    return None

def chop_wrong1(l):
    t = []
    for i in range(1, len(l) - 1):
        t.append(l[i])
    l = t
    return None

def chop_wrong2(l):
    l = l[1:-1]
    return None

for chop in (chop_inplace, ):
    l = [1,2,3,4]
    chop(l)
    assert l == [2,3]
    
for chop in (chop_wrong1, chop_wrong2):
    l = [1,2,3,4]
    chop(l)
    assert l == [1,2,3,4]

# Exercise 2

## Exercise 2A

Write a function that takes a list of numbers and returns the cumulative sum; that is, a new list where the ith element is the sum of the first i + 1 elements from the original list. For example, the cumulative sum of [1, 2, 3] is [1, 3, 6].

In [3]:
def cumsum_naive(l):
    result = []
    for i in range(len(l)):
        el = 0
        for j in range(0, i + 1):
            el += l[j]
        result.append(el)
    return result

def cumsum_naive2(l):
    result = []
    for i in range(len(l)):
        result.append(sum(l[:i + 1]))
    return result

import numpy as np

def cumsum_smart(l):
    return list(np.cumsum(l))

for cumsum in (cumsum_naive, cumsum_naive2, cumsum_smart):
    print(cumsum)
    assert [1,3,6] == cumsum([1,2,3])

<function cumsum_naive at 0x106c1a378>
<function cumsum_naive2 at 0x106c1af28>
<function cumsum_smart at 0x106c1aea0>


## Exercise 2B

Write a function called nested_sum that takes a nested list of integers and adds up the elements from all of the nested lists.

In [4]:
def nested_sum_naive1(l):
    s = 0
    for i in range(len(l)):
        for j in range(len(l[i])):
            s += l[i][j]
    return s

def nested_sum_naive2(l):
    s = 0
    for subl in l:
        for el in subl:
            s += el
    return s
    
def nested_sum_smart(l):
    return sum(sum(element) for element in l)

for nested_sum in (nested_sum_naive1, nested_sum_naive2, nested_sum_smart):
    print(nested_sum)
    assert 10 == nested_sum([[1,2], [7]])
    assert 0 == nested_sum([[]])
    assert 10 == nested_sum([[1,2,7]])
    assert 10 == nested_sum([[1,2], [7], []])
    assert 10 == nested_sum([[1,2], [7], [0]])

<function nested_sum_naive1 at 0x1073140d0>
<function nested_sum_naive2 at 0x107314268>
<function nested_sum_smart at 0x1073142f0>


## Exercise 2C

What about a list like [1,[2,3,[4,5]],[[6,7],8,[9,[10]]]] that contains numbers nested in lists of arbitrary depth. Write a function depth_sum that takes a list containing integers nested in lists of arbitrary depth and returns the sum of all the numbers. Hint: Use recursion.

In [5]:
def depth_sum(l):
    if isinstance(l, int):
        return l
    elif isinstance(l, list):
        return sum(depth_sum(el) for el in l)
    else:
        raise Exception("Type Error")

assert 6 == depth_sum(6)
assert 0 == depth_sum([0])
assert 0 == depth_sum([])
assert 0 == depth_sum([[]])
assert 0 == depth_sum([[0]])
assert 0 == depth_sum([[], 0])
assert 55 == depth_sum([1,[2,3,[4,5]],[[6,7],8,[9,[10]]]])
assert 55 == depth_sum([1,[2,3,[4,5]],[[6,7],8,[9,[10, []]]]])

# Exercise 3

Reverse Pairs

## Exercise 3A

Write a function that reads the file words.txt and builds a list with one element per word. Write two versions of this function, one using the append method and the other using the idiom t = t + [x]. Which one takes longer to run? Why?

In [6]:
def read_words_cat(path):
    words = []
    with open(path) as f:
        for line in f:
            words += [line]
    return words

def read_words_append(path):
    words = []
    with open(path) as f:
        for line in f:
            words.append(line)
    return words

def read_words_iter(path):
    words = []
    with open(path) as f:
        return list(f)

### Explanation

The concatenation solution is slower because it needs to build a new list for every element. The append solution addresses this, but using the built-in iterator is the best solution because of speed and elegance.

## Exercise 3B

Two words are a “reverse pair” if each is the reverse of the other. Write a program that finds all the reverse pairs in the word list words.txt. To this end, first read in all the words of words.txt in a (sorted) list. You can test whether a word is already in the list using the in operator. However, you can also use the bisect method of ex. 10.10 of Think Python or use the bisect module, which is part of the Python standard library.

In [7]:
def is_reverse_pair_naive(a, b):
    if len(a) != len(b):
        return False
    for i in range(len(a)):
        if a[i] != b[-(i + 1)]:
            return False
    return True

def is_reverse_pair_easy(a, b):
    return a == b[::-1]

for is_reverse_pair in (is_reverse_pair_naive, is_reverse_pair_easy):
    print(is_reverse_pair)
    assert is_reverse_pair("", "")
    assert not is_reverse_pair("A", "")
    assert not is_reverse_pair("", "A")
    assert is_reverse_pair("ABA", "ABA")
    assert is_reverse_pair("ABC", "CBA")
    
def get_reverse_pairs_naive(words):
    pairs = []
    for i in range(len(words)):
        for j in range(i + 1, len(words)):
            if is_reverse_pair_easy(words[i], words[j]):
                pairs.append((words[i], words[j]))
    return pairs            

from itertools import combinations

def get_reverse_pairs_easy(words):
    return [(a, b) for a, b in combinations(words, 2) if is_reverse_pair_easy(a, b)]

def get_reverse_pairs_sets(words):
    result = []
    for word in words:
        if word[::-1] in words and word[::-1] != word and (word[::-1], word) not in result:
            result.append((word, word[::-1]))
    return result

test_words = ["", "AB", "BA", "ABC", "EDF"]
for get_reverse_pairs in (get_reverse_pairs_naive, get_reverse_pairs_easy, get_reverse_pairs_sets):
    assert {("AB", "BA")} == set(get_reverse_pairs(test_words))

<function is_reverse_pair_naive at 0x107314ae8>
<function is_reverse_pair_easy at 0x10730f598>


## Exercise 3C

Modify the code of the previous exercise and read in all the words as keys of a dictionary (or a set) with arbitrary value (e.g. set the value to None) in each case. In this case use the in operator because bisect will only work on lists, and it will only work correctly, if the lists are sorted.

In [8]:
def read_words_dict(path):
    with open(path) as f:
        return {line: None for line in f}
    
test_words_dict = {word: None for word in test_words}

## Same implementation as above, just better results because hashmap has fast lookup time

# Exercise 4

Anagrams

## Exercise 4A

Write a program that reads a word list from the file words.txt and prints all the sets of words that are anagrams.
Here is an example of what the output might look like:
          
          ['deltas', 'desalt', 'lasted', 'salted', 'slated', 'staled']
          ['retainers', 'ternaries']
          ['generating', 'greatening']
          ['resmelts', 'smelters', 'termless']

Hint: you might want to build a dictionary that maps from a set of letters to a list of words that can be spelled with those letters. The question is, how can you represent the set of letters in a way that can be used as a key?

In [9]:
## use disjoint union structure

from collections import Counter

def word_to_hist(words):
    return {word: Counter(word) for word in words}

test_words = ['deltas', 'desalt', 'lasted', 
               'salted', 'slated', 'staled', 
              'retainers', 'ternaries', 'generating', 
              'greatening', 'resmelts', 'smelters', 'termless', 
             'XXYZYZZZ', 'XYYZZXZZ', 'XYXYZZZZ', 'XYYZZZXZ', 'XZXYYZZZ', 'ZXXYYZZZ']

hist = word_to_hist(test_words)
hist

{'XXYZYZZZ': Counter({'X': 2, 'Y': 2, 'Z': 4}),
 'XYXYZZZZ': Counter({'X': 2, 'Y': 2, 'Z': 4}),
 'XYYZZXZZ': Counter({'X': 2, 'Y': 2, 'Z': 4}),
 'XYYZZZXZ': Counter({'X': 2, 'Y': 2, 'Z': 4}),
 'XZXYYZZZ': Counter({'X': 2, 'Y': 2, 'Z': 4}),
 'ZXXYYZZZ': Counter({'X': 2, 'Y': 2, 'Z': 4}),
 'deltas': Counter({'a': 1, 'd': 1, 'e': 1, 'l': 1, 's': 1, 't': 1}),
 'desalt': Counter({'a': 1, 'd': 1, 'e': 1, 'l': 1, 's': 1, 't': 1}),
 'generating': Counter({'a': 1,
          'e': 2,
          'g': 2,
          'i': 1,
          'n': 2,
          'r': 1,
          't': 1}),
 'greatening': Counter({'a': 1,
          'e': 2,
          'g': 2,
          'i': 1,
          'n': 2,
          'r': 1,
          't': 1}),
 'lasted': Counter({'a': 1, 'd': 1, 'e': 1, 'l': 1, 's': 1, 't': 1}),
 'resmelts': Counter({'e': 2, 'l': 1, 'm': 1, 'r': 1, 's': 2, 't': 1}),
 'retainers': Counter({'a': 1,
          'e': 2,
          'i': 1,
          'n': 1,
          'r': 2,
          's': 1,
          't': 1}),
 'sal

In [10]:
def anagram_lists(words):
    hist = word_to_hist(words)
    dju = [[word] for word in words]
    j = 0 
    while j < len(dju):
        moved = set()
        for i in range(j + 1, len(dju)):
            if hist[dju[i][0]] == hist[dju[j][0]]:
                dju[j].extend(dju[i])
                moved.add(i)   
        dju = [el for i, el in enumerate(dju) if i not in moved]
        j += 1

    return dju

als = anagram_lists(test_words)
als

[['deltas', 'desalt', 'lasted', 'salted', 'slated', 'staled'],
 ['retainers', 'ternaries'],
 ['generating', 'greatening'],
 ['resmelts', 'smelters', 'termless'],
 ['XXYZYZZZ', 'XYYZZXZZ', 'XYXYZZZZ', 'XYYZZZXZ', 'XZXYYZZZ', 'ZXXYYZZZ']]

## Exercise 4B

Modify the previous program so that it prints the largest set of anagrams first, followed by the second largest set, and so on.

In [11]:
def sort_by_size(lol):
    """takes a list and sorts so the largest list comes first"""
    return sorted(lol, key=lambda l: len(l), reverse=True)

#sort_by_size([[1], [2,2,2], [7,7,7,7,7,7], [4,4,4,4]])


sort_by_size(als)

[['deltas', 'desalt', 'lasted', 'salted', 'slated', 'staled'],
 ['XXYZYZZZ', 'XYYZZXZZ', 'XYXYZZZZ', 'XYYZZZXZ', 'XZXYYZZZ', 'ZXXYYZZZ'],
 ['resmelts', 'smelters', 'termless'],
 ['retainers', 'ternaries'],
 ['generating', 'greatening']]

## Exervise 4C

In Scrabble a “bingo” is when you play all seven tiles in your rack, along with a letter on the board, to form an eight-letter word. What set of 8 letters forms the most possible bingos? Hint: there are seven.

In [12]:
def find_bingos(words):
    als = sort_by_size(anagram_lists(words))
    als = [al for al in als if len(al[0]) == 8]
    return als

find_bingos(test_words)

[['XXYZYZZZ', 'XYYZZXZZ', 'XYXYZZZZ', 'XYYZZZXZ', 'XZXYYZZZ', 'ZXXYYZZZ'],
 ['resmelts', 'smelters', 'termless']]

# Exercise 5

Histograms

In [15]:
def histogram1(s):
    d = dict()
    for c in s:
        if c not in d:
            d[c] = 1 
        else:
            d[c] += 1
    return d
        
def histogram2(s):
    d = dict()
    for c in s:
        d[c] = d.get(c, 0) + 1
    return d

from collections import defaultdict

def histogram3(s):
    d = defaultdict(int)
    for c in s:
        d[c] += 1
    return d

def histogram4(s):
    return Counter(s)

for histogram in (histogram1, histogram2, histogram3, histogram4):
    print(histogram)
    assert histogram('') == {}
    assert histogram('a') == {'a': 1}
    assert histogram('abb') == {'a': 1, 'b': 2}

<function histogram1 at 0x107314b70>
<function histogram2 at 0x10733ebf8>
<function histogram3 at 0x10733e268>
<function histogram4 at 0x107314950>


## Exercise 5A

Take a look at the file ecoli-genome.fnafoundin\\bitsmb\groups\workshops\proglab2\ containing the DNA of a strain of escherichia coli. Write a function fasta_frequency(filename) that takes a filename as a parameter and returns a histogram containing the frequency of each nucleotide (or amino acid).

In [None]:
def parse_fasta_iterator(it):
    # TODO: DO THIS 
    pass

def fasta_frequency(path):
    with open(path) as f:
        return histogram(parse_fasta_iterator(fasta))

## Exercise 5B

Write a function print_hist(h) that takes a histogram as parameter and prints a table of keys and values in alphabetical order of the keys.

In [16]:
import sys
def print_hist(h, file=sys.stdout):
    for key in sorted(h):
        print("{}: {}".format(key, h[key]), file=file)
        
print_hist(histogram("asfkhadlkghjkdsgla"))

a: 3
d: 2
f: 1
g: 2
h: 2
j: 1
k: 3
l: 2
s: 2


## Exercise 5C

Combine the functions into a script where the user is asked to input a filename of a file in fasta format that calculates the frequencies and prints a table of nucleotides/amino acid symbols and their respective frequency.

In [None]:
import sys, argparse
def simulate_script(argstring):
    parser = argparse.Parser()
    
    parser.add_argument("--input", "-i", default=sys.stdin, type=argparse.FileType('r'))
    parser.add_argument("--output", "-o", default=sys.stdout, type=argparse.FileType('w'))
    
    args = parser.parse_args(argstring)
    
    print_hist(histogram(parse_fasta_iterator(args.input)), file=args.output)

# Exercise 6

## Exercise 6A

Make sure that the function fasta_frequency(filename) from the previous exercise also works with multiple fasta sequences in one file. A single histogram should be returned combining the frequencies of all sequences. Calculate the histogram for the amino acids for all proteins combined.

## Exercise 6B

Now write a function that sorts the amino-acids by their relative frequency and print a table of amino acids and their relative frequency in decreasing order of frequency. Hint: The entries of a dictionary can be converted to a list of key–value tuples using the method items(). See section 12.6 of Think Python.

## Exercise 6C

Inordertomakethismoreusefulcombinethefunctionsinasinglescriptfasta_frequency.py that asks the user for a filename and as a results prints the occurring amino-acids/nucleotides in decreasing order of frequency.

## Exercise 6D

Frequently, scripts work in a non-interactive way. That is, they do not expect the user to enter anything while the program is running. Instead, all the information the program needs is passed to the program before the start using command line parameters. The program can then called from the command line:
         W:\> python fasta_frequency.py sequence.fasta
or from the notebook
         In [ ]: %run fasta_frequency.py sequence.fasta
Here, sequence.fasta is the filename containing the data you want to investigate. The parameters entered after the script name is available to the python script itself. It is stored in a list of strings that can be found in the module sys. E.g., consider a script named args.py
         import sys
         print(sys.argv) # argv is short for argument vector
Then running the script:
         In [ ]: %run args Was it a car or a cat I saw
will print
         ['args.py','Was','it','a','car','or','a','cat','I','saw']
Modify the script so that it reads the filename from the command line (to be found in sys.argv[1]).


## Exercise 6E

Apply your script to ecoli-proteome.faa and to one or more of the cromosomal proteomes of drosophila melangoster: drosophila-proteome-chrX.faa, drosophila-proteome-chr2.faa, drosophila-proteome-chr3.faa.
Are there noticeable differences in the amino acid composition between escherichia coli and drosophila melangoster?
How well do the frequencies correspond to the published results found in the publication amino-acid-composition.pdf (Fig. 1, p. 601):
5
Bogatyreva, N.; Finkelstein, A.V.; Galzitskaya, O.V. Trend of amino acid com- position of proteins of different taxa. Journal of Bioinformatics and Compu- tational Biology 4 (2006), 597–608.

# Exercise 7

A nice word puzzle and an even better programming exercise (Ex. 12.4 of Think Python) Here's another Car Talk Puzzler (http://www.cartalk.com/content/puzzlers):
What is the longest English word, that remains a valid English word, as you remove its letters one at a time?
Now, letters can be removed from either end, or the middle, but you can't rearrange any of the letters. Every time you drop a letter, you wind up with another English word. If you do that, you're eventually going to wind up with one letter and that too is going to be an English word—one that's found in the dictionary. I want to know what's the longest word and how many letters does it have?
I'm going to give you a little modest example: Sprite. Ok? You start off with sprite, you take a letter off, one from the interior of the word, take the r away, and we're left with the word spite, then we take the e off the end, we're left with spit, we take the s off, we're left with pit, it, and I.
Write a program to find all words that can be reduced in this way, and then find the longest one.
This exercise is a little more challenging than most and a few hints are given for solving this problem are given in Think Python. When trying this exercise please note that the provided list of words words.txt does not contain the one letter words ‘a' and ‘i'.