<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Tuples and dictionaries

<div class="alert alert-warning">
Parts of this section have been adapted from copyrighted material in *Jones, M: Python for Biologists: A complete programming course for beginners (2013)*.

**Please do not distribute it!**

The core Python language contains a number of *data structures* (or *compound types*).  We've already encountered one:  The *list*.  The time has come to learn some further built-in data structures:  The *tuple* and the *dictionary*.

Dictionaries are extremely useful *mapping* data structures that are used to map a set of keys to corresponding values.  Tuples can be harder to explain to the beginner, so let's start with them:

## Tuples

Tuples are used to group together a small number of closely related values (possibly of different types), so that one could treat them as a single unit.

You can think of a tuple as a "record" of data.  Let's say we're writing an address book application.  For each entry we record:

* a numerical ID
* the person's name
* an email address
* a physical address
* postal code

For example:

In [1]:
record = (5,
          "Johann Visagie",
          "visagie@eva.mpg.de",
          "Deutscher Platz 6\nLeipzig",
          '04317')
print(record)

(5, 'Johann Visagie', 'visagie@eva.mpg.de', 'Deutscher Platz 6\nLeipzig', '04317')


Another obvious use for tuples would be complex numbers.  As we know from high school, an complex number has a real and imaginary part, e.g 5+4*i*.  We could create a simple tuple that pairs together two numbers and regard the first one as the real part, and the second one as the imaginary part.  So this could be a list of complex numbers:

In [2]:
complex_numbers = [(5, 4), (3, 1), (2, 8), (7, -3)]
print(complex_numbers)

[(5, 4), (3, 1), (2, 8), (7, -3)]


How many elements are there in that list?  Can you guess?

In [3]:
len(complex_numbers)

4

We can subscript tuples like lists or strings, and the indices are zero-based:

In [4]:
for c in complex_numbers:
    print("Real part:", c[0], "\tImaginary part:", c[1])

Real part: 5 	Imaginary part: 4
Real part: 3 	Imaginary part: 1
Real part: 2 	Imaginary part: 8
Real part: 7 	Imaginary part: -3


You can use `len` to find the length of a tuple too:

In [5]:
len(record)

5

Let's try to change the numerical ID of the address book record we created earlier:

In [6]:
record[0] = 8

TypeError: 'tuple' object does not support item assignment

Nope!  Tuples are *immutable*.

So far we've always surrounded tuples by parentheses “`()`”, but these are only required when the syntax would otherwise be ambiguous.  If there's no ambiguity, you can leave out the parentheses:

In [None]:
my_tuple = 5, 6, "Fred"
print(my_tuple)

Note that the textual representation of a tuple given by the `print` statement includes the parentheses again.

### Tuple unpacking

Python allows us to *unpack* a tuple when we do an assignment, for example:

In [None]:
a, b = 3, 5
print(a)
print(b)

This is an extremely common (and useful) idiom!

In [None]:
for c in complex_numbers:
    r, i = c
    print("Real part:", r, "\tImaginary part:", i)

You can even do the unpacking right in the `for` statement:

In [None]:
for r, i in complex_numbers:
    print("Real part:", r, "\tImaginary part:", i)

Note that unpacking only works when the number of variables to the left of the assignment *equals* the number of elements in the tuple being assigned:

In [None]:
a, b = 3, 5, "Fred"

### Iterating over a tuple

Iterating over a tuple is actually quite rare, due to the way in which one uses tuples.  However, it works as you'd expect.  (And note that the order of items in a tuple is fixed, as is the case with a list.)

In [None]:
for field in record:
    print(field)

### Tuples vs. lists

Some introductory texts describe tuples as "immutable lists".  This isn't really fair or accurate.  Here are some of the differences between a list and a tuple:

| Lists | Tuples |
|---|---|
| delimited by `[...]` | delimited by `(...)` (if required) |
| mutable | immutable |
| use for unbounded number of items of same type | use for (generally) short collection of (potentially) multiple types |
| each element of a list is a data point | each entire tuple represents a data point |

### Zipping lists

It's fairly common to have two or more related lists which we want to convert to a sequence of tuples, so that the first tuple contains all the first elements of the original lists, the second tuple all the second elements, and so forth…

In fact, this is so common that Python has a built-in command called `zip()` to do it:

In [7]:
list1 = [1, 2, 3, 4, 5, 6, 7, 8]
list2 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

zipped = zip(list1, list2)

for t in zipped:
    print(t)

(1, 'a')
(2, 'b')
(3, 'c')
(4, 'd')
(5, 'e')
(6, 'f')
(7, 'g')
(8, 'h')


Note that `zip` doesn't actually return a list of tuples;  rather, it returns a special object that *generates* such a list of tuples when iterated over:

In [8]:
zipped

<zip at 0x10812a788>

>This is an example of a *lazy* calculation.  The `zip` function doesn't have to calculate the entire list of tuples upfront.  Rather, it returns a generator that calculates each tuple in turn only when necessary.  We'll see later how to write such generators ourselves.

As we've seen before, we can use the built-in function `list` to iterate over something implicitly, and return a list of the results:

In [9]:
list(zip(list1, list2))

[(1, 'a'),
 (2, 'b'),
 (3, 'c'),
 (4, 'd'),
 (5, 'e'),
 (6, 'f'),
 (7, 'g'),
 (8, 'h')]

>Note:  In Python 2, `zip()` returned a simple list of tuples.

## Dictionaries

### Storing paired data

Suppose we want to count the number of `A`s in a DNA sequence. Carrying out the calculation is quite straightforward; in fact it’s one of the first things we did in this course:

In [18]:
dna = "ATCCCCCGGGATCGAAAATCGTAAACGAACTAGAA"
a_count = dna.count("A")

How will our code change if we want to generate a complete list of base counts for the sequence? We’ll add a new variable for each base:

In [19]:
dna = "ATCCCCCGGGATCGAAAATCGTAAACGAACTAGAA"
a_count = dna.count("A")
t_count = dna.count("T")
g_count = dna.count("G")
c_count = dna.count("C")

…and now our code is starting to look rather repetitive. It’s not too bad for the four individual bases, but what if we want to generate counts for the 16 dinucleotides:

```python
dna = "ATCCCCCGGGATCGAAAATCGTAAACGAACTAGAA"
aa_count = dna.count("AA")
at_count = dna.count("AT")
ag_count = dna.count("AG")
# ...etc.
```

Clearly, the simple solution doesn't scale.  Let's see if we can come up with something that works better, using only the data structures we know already.  Let's make two lists:

* a list of nucleotides (the *keys*)
* and a list of counts of each of these nucleotides in our sequence (the *values*)

In [25]:
keys = []
values = []

# Iterating over a string yields characters
for nucleotide in "ATGC":
    keys.append(nucleotide)
    values.append(dna.count(nucleotide))
    
print(keys)
print(values)

['A', 'T', 'G', 'C']
[14, 5, 7, 9]


These lists will have to remain synchronised for this to work, i.e. the key at position _n_ in the `keys` list must correspond to the value at position _n_ in the `values` list.

We can use the `zip()` function we've recently learned to convert these two lists into a single list of tuples, thereby implicitly ensuring synchronisation:

In [26]:
nuc_counts = list(zip(keys, values))
print(nuc_counts)

[('A', 14), ('T', 5), ('G', 7), ('C', 9)]


Presumably we created this lookup table because we want to use the values in some computation.  That means we need a way of looking up the value associated with a specific key.  It's fairly trivial to do manually with a short lookup table like the one we have here, but what if our table contained hundreds or thousands of entries?

In that case, the best way to proceed would probably be:

* iterate over the list of tuples
* when we reach the tuple where the first element is the key we're looking for…
* …get the second element of that tuple

Let's write a little function to implement that:

In [27]:
def lookup_in_tuplelist(L, search_key):
    """Given a list of key/value tuples, L,
    return the value associated with search_key"""
    for key, value in L:
        if key == search_key:
            return value

print(lookup_in_tuplelist(nuc_counts, 'C'))

9


There are two things to note about the function `lookup_in_tuplelist`:

* I've given the function a *docstring*.  A docstring is a string that occurs, all on its own, as the first statement inside a function.  The docstring — if defined — becomes the help text for this function.

     (Note that there's a difference between a docstring and a comment that starts with “`#`”!)  

In [28]:
?lookup_in_tuplelist

* Whenever a `return` statement is encountered, execution of the function stops *immediately* and a value is returned.  I.e. in a call to `lookup_in_tuplelist()`, if the key we're searching for occurs halfway through the list of tuples, Python will break off the iteration there (break out of the `for` loop) and return the found value immediately.

We now have a way of looking up a key in a list of key/value tuples.  It works, but would be inefficient for very long key/value lists, because our function searches through the list *linearly* — it starts at the beginning and evaluates each element of the list in turn until it finds the one it's looking for.  Hence, the time taken to search the list of key/value pairs will scale *linearly* with the size of the key/value list.  (It will on average take 1000 longer to scan a list of 1000000 items than a list of 1000.)

This solution won't really work in production code.

The need to store key/value pairs and look up a value *efficiently* by its key is incredibly common in programming, for instance, of…

* protein sequence names and their sequences
* DNA restriction enzyme names and their motifs
* codons and their associated amino acid residues
* colleagues’ names and their email addresses
* sample names and their co-ordinates
* words and their definitions

The last example in this list – words and their definitions – is an interesting one because we have a tool in the physical world for storing this type of data:  a *dictionary*. Because this problem is so very common, Python provides a built-in data structure to solve it, and this built-in data structure is also called a *dictionary*.

### Creating a dictionary

The syntax for creating a dictionary is similar to that for creating a list, but we use curly braces rather than square ones. Each pair of data, consisting of a key and a value, is called an item. When storing items in a dictionary, we separate them with commas. Within an individual item, we separate the key and the value with a colon. Here’s a bit of code that creates a dictionary of restriction enzymes (using data from the previous section) with three items:

In [29]:
enzymes = {'EcoRI': r'GAATTC', 'AvaII': r'GG(A|T)CC', 'BisI': r'GC[ATGC]GC'}

In this case, the keys and values are both strings. The keys are the names of the enzymes, and the values represent regular expressions that can be used to match them.  Splitting the dictionary definition over several lines makes it easier to read:

In [30]:
enzymes = {
    'EcoRI': r'GAATTC',
    'AvaII': r'GG(A|T)CC',
    'BisI': r'GC[ATGC]GC'
}

…but doesn’t affect the code at all.  (As we've said, you can split a Python statement at any whitespace inside `()`, `[]`, or `{}`.)

Let's check the type of our dictionary:

In [31]:
type(enzymes)

dict

In [32]:
enzymes

{'AvaII': 'GG(A|T)CC', 'BisI': 'GC[ATGC]GC', 'EcoRI': 'GAATTC'}

### Retrieving an item from a dictionary

To retrieve a bit of data from the dictionary – i.e. to look up the motif for a particular enzyme – we write the name of the dictionary, followed by the key in square brackets:

In [35]:
enzymes['BisI']

'GC[ATGC]GC'

This looks very similar to the index notation we used with strings and lists, but instead of giving the numerical index of the element we want, we’re giving the *key* and retrieving the associated *value*.

What happens when we try to reference a key that doesn't exist in our dictionary?

In [36]:
enzymes['EcoRV']

KeyError: 'EcoRV'

…we get a `KeyError`, logically enough.

### Adding and removing items

It’s relatively rare that we’ll want to create a dictionary all in one go like in the example above. More often, we’ll want to create an empty dictionary, then add key/value pairs to it.

To create an empty dictionary we simply write a pair of curly braces on their own, and to add elements, we use the square-brackets subscript notation on the left-hand side of an assignment:

In [38]:
enzymes = {}
enzymes['EcoRI'] = r'GAATTC'
enzymes['AvaII'] = r'GG(A|T)CC'
enzymes['BisI'] = r'GC[ATGC]GC'

print(enzymes)

{'EcoRI': 'GAATTC', 'AvaII': 'GG(A|T)CC', 'BisI': 'GC[ATGC]GC'}


Note that this means dictionary are *mutable* (like lists, but unlike strings).

There are multiple ways to remove an item from a dictionary.  One way is to use a new Python keyword we haven't yet encountered:  `del`

In [39]:
del enzymes['AvaII']
print(enzymes)

{'EcoRI': 'GAATTC', 'BisI': 'GC[ATGC]GC'}


If we use `del` the key/value pair is simply removed and discarded.  More often when we delete a key from a dictionary, we actually want to retrieve the associated value.  We can do this using the `pop)(` method of the dictionary type. `pop()` returns the value and deletes the key at the same time:

In [33]:
enzymes = {
    'EcoRI' : r'GAATTC',
    'AvaII' : r'GG(A|T)CC',
    'BisI'  : r'GC[ATGC]GC'
}

# remove the EcoRI enzyme from the dict
EcoRI_motif = enzymes.pop('EcoRI')

print("dict:", enzymes)
print("motif:", EcoRI_motif)

dict: {'AvaII': 'GG(A|T)CC', 'BisI': 'GC[ATGC]GC'}
motif: GAATTC


We can now repeat our previous example — where we built a list of tuples — using a dictionary.  We can build up the dictionary in a loop, as follows:

In [40]:
dna = "AATCCCCCGGGATCGAAAATCGTAAACGAACTAGAATCGATCGATCGTACGCTGA"

# Create an empty dictionary:
nuc_counts = {}

for nucleotide in "ATGC":
    nuc_counts[nucleotide] = dna.count(nucleotide)

print(nuc_counts)

{'A': 19, 'T': 10, 'G': 12, 'C': 14}


And now we can easily look up the count for a specific nucleotide:

In [41]:
nuc_counts['G']

12

Why is this more desirable than our list of tuples?  The reason has to do with how dictionaries are implemented behind the scenes in Python.

The key-based lookup is built on a data structure known as a *hash table*, which is well beyond the scope of this course.  However, you should be aware that this implementation is *incredibly* efficient.

In the average case, Python can look up a key in a dictionary in *constant time* (as opposed to linear time), meaning you can make very large dictionary structures and still use them realistically in your code.

Python's dictionaries are **incredibly** useful, but they come with some restrictions:

* Not all types of data are allowed to be used as keys:  Most simple types like strings and numbers are fine, but we can't use (say) file objects.  (Values can be whatever type of data we like, however.)


* Keys must be unique – we can’t store multiple values for the same key.  (Which value will be returned when we look up a duplicate key?)

  Usually this is exactly what we want, but there are cases when you might want to create a dictionary-like objects that allow duplicate keys.  This is possible but beyond the scope of this introductory course!

### Creating a dictionary with `dict()`

We've seen that the built-in function `list()` can iterate over a container implicitly, and return a list of the contents of that container.  Python also has a built-in `dict()` function that can create a dictionary, though it's rather picky about what it creates a dictionary *out of*.

One thing `dict` *can* create a dictionary from is a **list of tuples**, such as the lookup list of nucleotide counts we defined earlier:

In [42]:
print("Keys:\t", keys)
print("Values:\t", values)

nuc_counts = dict(zip(keys, values))
print("Dict:\t", nuc_counts)

Keys:	 ['A', 'T', 'G', 'C']
Values:	 [14, 5, 7, 9]
Dict:	 {'A': 14, 'T': 5, 'G': 7, 'C': 9}


### Retrieving values using `get()`

We've seen before that the simplest way to retrieve the value associated with a key in a dictionary is to use the subscript (square brackets) notation:

In [None]:
enzymes = {'EcoRI': r'GAATTC', 'AvaII': r'GG(A|T)CC', 'BisI': r'GC[ATGC]GC'}

print(enzymes['AvaII'])

An alternative is to use the `get()` method of the dictionary type:

In [None]:
print(enzymes.get('AvaII'))

These two methods behave rather differently if you're trying to retrieve a value for a key that doesn't exist in the dictionary:

In [None]:
enzymes['EcoRV']

In [None]:
print(enzymes.get('EcoRV'))

Using the subscript notation raises an exception, whereas `get()` just quietly returns `None`.  In fact, `get()` has another trick up its sleeve:  It can return a default value if a key is not found.  Let's look at an example where this can be useful.

Here we have a sequence `seq` in the IUPAC ambiguous nucleotide alphabet.  We write a few lines of code to create a dictionary `nuc_counts` which maps the letters that occur in the sequence (the keys) to counts of the number of times each occurs (the values):

In [None]:
seq = "NNHYDGSCARSDYSVTWATSBSARYVBNTBHCTDARTT"

nuc_counts = {}
for nucleotide in seq:
    if nucleotide in nuc_counts:
        nuc_counts[nucleotide] += 1
    else:
        nuc_counts[nucleotide] = 1

print(nuc_counts)

We use two new things in the code block above:

Firstly, we use the membership test keyword `in` with a dictionary.  We've done membership tests using `in` on lists and strings before.  When we use `in` with a dictionary, it tests whether an object is present among the dictionary's *keys*.

Secondly, we use the *increment operator*, “`+=`”.  This line of code…

```python
nuc_counts[nucleotide] += 1
```

…is exactly equivalent to this…

```python
nuc_counts[nucleotide] = nuc_counts[nucleotide] + 1
```

…but much more compact and easier to read.  Try it:

In [None]:
a = 5
a += 3
a

In [None]:
a -= 2
a

We can now print out how many times each of the letters in the IUPAC ambiguous alphabet occurs in `seq`:

In [None]:
# IUPAC ambiguous nucleotide codes
ambig_nucs = "ABCDGHKMNRSTVWY"

for nuc in ambig_nucs:
    print(nuc + ":", nuc_counts[nuc])

…but we get an error!  It turns out that not all the letters in the alphabet were represented in `seq`, and therefore the "missing" letters aren't present as keys in our dictionary!

We can get around this problem by using the `get()` method, and letting it supply a default count of zero for non-present letters:

In [None]:
for nuc in ambig_nucs:
    print(nuc + ":", nuc_counts.get(nuc, 0))

We could also have used `get()` to simplify the construction of our `nuc_counts` dictionary in the first place by entirely eliminating the `if ... else` statement.  Instead of this…

In [None]:
nuc_counts = {}
for nucleotide in seq:
    if nucleotide in nuc_counts:
        nuc_counts[nucleotide] += 1
    else:
        nuc_counts[nucleotide] = 1

…we can do this:

In [None]:
nuc_counts = {}
for nucleotide in seq:
    nuc_counts[nucleotide] = nuc_counts.get(nucleotide, 0) + 1
    
print(nuc_counts)

Can you explain how that works?

### Iterating over a dictionary

What if, instead of looking up a single item from a dictionary, we want to do something for all items?

For example, imagine that we wanted to take our `nuc_counts` dictionary variable from the code above and print out all the key/value pairs.

It turns out there's more than one way to do it.

### Iterating over keys

The `keys()` method of a dictionary returns an object that generates all the keys of the dictionary when iterated over.  We can use `list()` to iterate over it implicitly and create a list:

>In Python 2, `keys()` returned a simple list of keys.

In [43]:
list(nuc_counts.keys())

['A', 'T', 'G', 'C']

So one way to iterate over a dictionary would be to iterate over the result returned by its `keys()` method:

In [44]:
for key in nuc_counts.keys():
    print(key + ':\t', nuc_counts[key])

A:	 14
T:	 5
G:	 7
C:	 9


Now, you might wonder what happens if we simply iterate over the dictionary *itself*. Let's try it:

In [45]:
for something in nuc_counts:
    print(something)

A
T
G
C


Wait! Those are the keys of the `counts` dictionary again! So, **iterating over a dictionary is the same as iterating over its keys**. This means we can rewrite the previous example again in a slightly simpler fashion:

In [None]:
for nuc in nuc_counts:
    print(nuc + ':\t', nuc_counts[nuc])

Notice that the output dictionaries are inherently *unordered*.  Iterating over a dictionary (or its `keys()`) is not guaranteed to return keys in any specific order.

If we want to return the keys in an (alphanumerically) sorted order, we can use the built-in function `sorted()`:

In [46]:
for key in sorted(nuc_counts.keys()):
    print(key + ':\t', nuc_counts[key])

A:	 14
C:	 9
G:	 7
T:	 5


### Iterating over items

In the example code above, the first thing we need to do inside the loop is to look up the value for the current key. This is a very common pattern when iterating over dictionaries – so common, in fact, that Python has a special shorthand for it. Instead of doing this:

```python
for key in my_dict.keys():
    value = my_dict.get(key)
    # do something with key and value
```

…we can use the items method to iterate over pairs of data, rather than just keys:

```python
for key, value in my_dict.items():
    # do something with key and value
```

The `items()` method of dictionary objects returns an object which, when iterated over, returns list of *tuples* of values. That’s why we have to give two variable names at the start of the loop — we unpack each tuple as we receive it.

>In Python 2, `items()` returned a simple list of tuples.

Let's inspect the result of calling the `items()` method of our `nuc_counts` dictionary.  We'll again use `list()` to iterate over it implicitly:

In [47]:
list(nuc_counts.items())

[('A', 14), ('T', 5), ('G', 7), ('C', 9)]

We could therefore rewrite our example using `items()`:

In [48]:
for nuc, count in nuc_counts.items():
    print(nuc + ':\t', count)

A:	 14
T:	 5
G:	 7
C:	 9


This method is generally preferred for iterating over items in a dictionary, as it makes the intention of the code very clear.

---

## Exercises

### Counting words

In the `files` subdirectory is a file called `words.txt`, containing a single long line of text, broken up into words.

Assuming that all words are demarcated by spaces, build a dictionary that documents how many times each word occurs in the text.

Test yourself:  In the following sentence…

    We tried list and we tried dicts also we tried Zen

…the word counts are as follows:

| word | count |
|------|:-----:|
| `and`   | 1 |
| `We`    | 1 |
| `tried` | 3 |
| `dicts` | 1 |
| `list`  | 1 |
| `we`    | 2 |
| `also`  | 1 |
| `Zen`   | 1 |

In [50]:
%cd files

/Users/sabineurban/EVOP2017/files


In [55]:
# Exercise 1


# open file, read the single line (readline), remove the rightmost character and split the strings into list
words = open("words.txt", 'r').readline().rstrip().split(" ")

# create a dictionary wordcount
wordcount = {}

for word in words:
    wordcount[word] = wordcount.get(word, 0) + 1
    
print(wordcount)

#empty space
print()

# reverse sorted by word frequency:
from operator import itemgetter
print(list(reversed(sorted(wordcount.items(), key=itemgetter(1)))))

{'When': 1, 'I': 2, 'find': 1, 'myself': 1, 'in': 4, 'times': 1, 'of': 11, 'trouble': 1, 'Mother': 2, 'Mary': 2, 'comes': 2, 'to': 3, 'me': 4, 'Speaking': 3, 'words': 7, 'wisdom': 7, 'let': 30, 'it': 36, 'be': 41, 'And': 3, 'my': 1, 'hour': 1, 'darkness': 1, 'she': 1, 'is': 4, 'standing': 1, 'right': 1, 'front': 1, 'Let': 6, 'Whisper': 4, 'when': 2, 'the': 4, 'broken': 1, 'hearted': 1, 'people': 1, 'living': 1, 'world': 1, 'agree': 1, 'There': 4, 'will': 5, 'an': 4, 'answer': 4, 'For': 1, 'though': 1, 'they': 2, 'may': 1, 'parted': 1, 'there': 2, 'still': 2, 'a': 2, 'chance': 1, 'that': 2, 'see': 1, 'night': 1, 'cloudy': 1, 'light': 1, 'shines': 1, 'on': 1, 'Shine': 1, 'until': 1, 'tomorrow': 1, 'wake': 1, 'up': 1, 'sound': 1, 'music': 1, 'yeah': 2}

[('be', 41), ('it', 36), ('let', 30), ('of', 11), ('wisdom', 7), ('words', 7), ('Let', 6), ('will', 5), ('answer', 4), ('an', 4), ('There', 4), ('the', 4), ('Whisper', 4), ('is', 4), ('me', 4), ('in', 4), ('And', 3), ('Speaking', 3), ('to'

### Inverting a table

The file `iupac_codes.txt` in the `files` subdirectory contains a text representation of a table listing the IUPAC ambiguity codes and their meanings:

In [None]:
%load iupac_codes.txt

Each row starts with a code letter, followed by a tab character, followed by a comma-separated list of nucleotides which are represented by that letter. Additionally, the table has a header line.

What we want to do is construct a file that is the “inverse” of this one. In other words, one that looks like this:

```
Nucl.	Codes
A	A,M,R,W,V,H,D,N
C	C,M,S,Y,V,H,B,N
T	T,W,Y,K,H,D,B,N
G	G,R,S,K,V,D,B,N
```

Write some code that reads in `iupac_codes.txt`, generates the inverse table and writes it out to a file again.

In [None]:
# Exercise 2

reverse_dict = {}

with open("iupac_codes.txt", 'r') as infile:
    
    header = infile.readline()
    
    for line in infile:
        line = line.rstrip()
        code, meanings = line.split('\t')
        for meaning in meanings.split(','):
            reverse_dict[meaning] = reverse_dict.get(meaning, [])
            reverse_dict[meaning].append(code)
            
with open("iupac_codes_out.txt", 'w') as outfile:
    
    print("Nucl.", "Codes", sep='\t', file=outfile)
    for nuc in sorted(reverse_dict):
        print(nuc, ','.join(sorted(reverse_dict[nuc])), sep='\t', file=outfile)

In [None]:
# Exercise 2

### DNA translation

>Note:  This is a more complex exercise and we'll skip it if we're short of time.

Write a program that will translate a DNA sequence into protein.

* When a codon doesn’t code for anything (e.g. stop codon), you may use “`*`” in the output.


* Choose just one reading frame — begin at the beginning of the sequence.  Ignore the extra bases at the end if the sequence length is not an exact multiple of 3.


* Decide how you want to handle ambiguous codes. (The easiest way is to say that you won't, and only handle input containing `A`, `C`, `G` and `T`. But if you want a challenge, try handling ambiguous codes too.)

Some tables that may come in handy:

In [None]:
codon_table = {
    'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L', 'TCT': 'S',
    'TCC': 'S', 'TCA': 'S', 'TCG': 'S', 'TAT': 'Y', 'TAC': 'Y',
    'TGT': 'C', 'TGC': 'C', 'TGG': 'W', 'CTT': 'L', 'CTC': 'L',
    'CTA': 'L', 'CTG': 'L', 'CCT': 'P', 'CCC': 'P', 'CCA': 'P',
    'CCG': 'P', 'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'ATT': 'I',
    'ATC': 'I', 'ATA': 'I', 'ATG': 'M', 'ACT': 'T', 'ACC': 'T',
    'ACA': 'T', 'ACG': 'T', 'AAT': 'N', 'AAC': 'N', 'AAA': 'K',
    'AAG': 'K', 'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
    'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V', 'GCT': 'A',
    'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAT': 'D', 'GAC': 'D',
    'GAA': 'E', 'GAG': 'E', 'GGT': 'G', 'GGC': 'G', 'GGA': 'G',
    'GGG': 'G'
}

# Extra data in case you want it
stop_codons = ['TAA', 'TAG', 'TGA']
start_codons = ['TTG', 'CTG', 'ATG']