<h1 id="toctitle">Dictionaries exercise solutions</h1>
<ul id="toc"/>

### Common and scientific names

The first step in this exercise is to read the file line by line, split where we see a comma, and assign variable names to the two bits of each line. The code for doing this is quite similar to that from the exon exercise:

In [2]:
for line in open("names.txt"):
    fields = line.rstrip("\n").split(",")
    scientific_name = fields[0]
    common_name = fields[1]
    print(scientific_name, common_name)

('Anthus novaeseelandiae', 'Australasian pipit')
('Canna flaccida', 'golden canna')
('Elopidae', 'ladyfishes')
('Platymantis corrugatus', 'rough-backed forest frog')
('Sylvisorex johnstoni', "Johnston's shrew")
('Ephedra', 'ma huang')
('Choristoneura fumiferana multiple nucleopolyhedrovirus', 'spruce budworm nuclear polyhedrosis virus')
('Lepidocolaptes affinis', 'spot-crowned woodcreeper')
('Foeniculum vulgare var. dulce', 'sweet fennel')
('Tropaeolaceae', 'nasturtium family')
('Parulidae', 'wood-warblers')
('Alyogyne pinoniana', 'sand-hibiscus')
('Melanargia russiae', "Esper's marbled white butterfly")
('Huperzia serrata', 'toothed club-moss')
('Lycaena helloides', 'purplish copper')
('Victoria cruziana', 'irupe')
('Pilosella caespitosa', 'yellow fox-and-cubs')
('Cirsium rhothophilum', 'surf thistle')
('Fulica atra', 'common coot')
('Ilex verticillata', 'black-alder')
('Ceratopetalum apetalum', 'tarwood')
('Hypsiboas boans', 'giant gladiator treefrog')
('Meandrusa payeni', 'sickle sw

At this point, you might notice that we don't actually need a dict to find the common name for a given species - we can just check each line until we find the one that we want:

In [4]:
for line in open("names.txt"):
    fields = line.rstrip("\n").split(",")
    scientific_name = fields[0]
    common_name = fields[1]
    if scientific_name == "Homo sapiens":
        print(common_name)

man


but this is slow and inefficient as we have to re-read the whole file whenever we want to find a name. Let's store the names in a dict instead:

In [5]:
names = {}
for line in open("names.txt"):
    fields = line.rstrip("\n").split(",")
    scientific_name = fields[0]
    common_name = fields[1]
    
    names[scientific_name] = common_name

Now we can look up any name just using a single call to the `get()` function:

In [6]:
for scientific_name in ['Homo sapiens', 'Milax gagates', 'Ovis aries']:
    common_name = names.get(scientific_name, 'none')
    print("common name for " + scientific_name + " is " + common_name)

common name for Homo sapiens is man
common name for Milax gagates is greenhouse slug
common name for Ovis aries is wild sheep


To count the number of frogs, we are interested in the values of the dict. We can iterate over the values and check whether each one contains the word 'frog':

In [7]:
counter = 0
for common_name in names.values():
    if 'frog' in common_name:
        counter = counter + 1
print("counted " + str(counter) + ' frogs')

counted 624 frogs


To get their scientific names, we need to iterate over pairs of data i.e. items i.e. both common and scientific names:

In [8]:
for scientific_name, common_name in names.items():
    if 'frog' in common_name:
        print(scientific_name)

Platymantis corrugatus
Chiromantis rufescens
Xenorhina
Litoria pallida
Nyctibatrachus sanctipalustris
Cyclorana maculosa
Philautus tectus
Ptychadena chrysogaster
Raorchestes signatus
Litoria peronii
Theloderma horridum
Hylarana erythraea
Calyptocephallela gayi
Ptychadena porosissima
Hyla avivoca
Andinobates minutus
Micrixalus fuscus
Sylvirana spinulosa
Nanorana liebigii
Ramanella obscura
Odorrana versabilis
Amolops granulosus
Hyloscirtus alytolylax
Hylarana chalconota
Hydrophylax gracilis
Craugastor crassidigitus
Odorrana grahami
Odorrana swinhoana
Microhyla borneensis
Rugosa rugosa
Hyperolius benguellensis
Leptolalax gracilis
Pyxicephalus edulis
Ameerega bilinguis
Rana catesbeiana
Barbourula kalimantanensis
Liurana xizangensis
Andinobates opisthomelas
Limnonectes palavanensis
Discoglossus sardus
Limnonectes macrocephalus
Dermatonotus muelleri
Ameerega petersi
Leptodactylus knudseni
Arthroleptis adolfifriederici
Fejervarya keralensis
Limnonectes khasianus
Myobatrachidae
Leptopelis conc

Now we can think about processing the second file which has the sequence counts in it. For each line we need to split it into a scientific name and a count, then look up the corresponding common name using our original dict:

In [11]:
for line in open("seq_counts.csv"):
    scientific_name = line.split(",")[0]
    sequence_count = line.split(",")[1].rstrip("\n")
    
    common_name = names[scientific_name]
    print(common_name, sequence_count)

('Australasian pipit', '75')
('golden canna', '10')
('ladyfishes', '1')
('Ethiopian fruit fly', '43')
('rough-backed forest frog', '165')
("Johnston's shrew", '525')
('scentless mayweed', '16')
('spruce budworm nuclear polyhedrosis virus', '26')
('spot-crowned woodcreeper', '88')
('sweet fennel', '1')
("Quoy's coral snail", '17')
('wood-warblers', '2490')
('sand-hibiscus', '968')
("Esper's marbled white butterfly", '1526')
('toothed club-moss', '121')
('purplish copper', '256')
('irupe', '64')
('surf thistle', '9')
('common coot', '8')
('black-alder', '180')
('tarwood', '426')
('mizutengu', '41')
('winged tobacco', '906')
('sickle swallowtail', '14')
('black horehound', '48')
('chrysolite-lily', '758')
('spotted flagtail', '100')
('Leyland-cypress', '1648')
('brush whitewood', '5')
('white foxtail', '1650')
('yarrow', '244')
('western foam-nest tree frog', '1')
('heath spotted-orchid', '9')
('moose maple', '148')
('burnet ragwort', '77')
('sheep sorrel', '10')
('gray-veined white butte

Once we have the lookup working, we can switch from printing the information to putting it in a new file:

In [12]:
output = open("common_name_seq_counts.csv", "w")

for line in open("seq_counts.csv"):
    scientific_name = line.split(",")[0]
    sequence_count = line.split(",")[1].rstrip("\n")
    
    common_name = names[scientific_name]
    output.write(common_name + "," + sequence_count + "\n")
output.close()

### Bonus exercise: DNA translation 

First, we need to think about splitting up the DNA into codons. We can do the first few manually:

In [1]:
dna = "ATGTTCGGT"
codon1 = dna[0:3]
codon2 = dna[3:6]
codon3 = dna[6:9]
print(codon1, codon2, codon3)

('ATG', 'TTC', 'GGT')


until we see the pattern. The start position goes up by three each time, and the stop position is always three greater than the start. So with a range:

In [3]:
dna = "ATGTTCGGT" 
for start in range(0,7,3): 
    codon = dna[start:start+3] 
    print("one codon is " + codon) 

one codon is ATG
one codon is TTC
one codon is GGT


This works for the particular sequence, but we need a more general solution. We always start at zero, and always go up by three, but the middle argument to range is tricky. We need the last start position to be two bases back from the end of the sequence:

In [6]:
dna = "ATGTTCGTGACGAGGGT" 

# calculate the start position for the final codon
last_codon_start = len(dna) - 2 

# process the dna sequence in three base chunks
for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    print("one codon is " + codon) 

one codon is ATG
one codon is TTC
one codon is GTG
one codon is ACG
one codon is AGG


This version will work for any length DNA sequence. 

Getting the amino acid for a given codon is quite easy, just look it up:

In [4]:
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}


dna = "ATGTTCGTGACGAGGGT" 

last_codon_start = len(dna) - 2 

for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    aa = gencode[codon]
    print(codon, aa) 

('ATG', 'M')
('TTC', 'F')
('GTG', 'V')
('ACG', 'T')
('AGG', 'R')


Now all we need to do is build up the protein sequence one amino acid at a time:

In [8]:
dna = "ATGTTCGTGACGAGGGT" 

last_codon_start = len(dna) - 2 

protein = ""
for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    aa = gencode[codon]
    protein = protein + aa
    print(codon, aa, protein) 

('ATG', 'M', 'M')
('TTC', 'F', 'MF')
('GTG', 'V', 'MFV')
('ACG', 'T', 'MFVT')
('AGG', 'R', 'MFVTR')


We can see how one amino acid gets added to the `protein` string each time round the loop. At the end of the loop we have the complete protein:

In [9]:
dna = "ATGTTCGTGACGAGGGT" 

last_codon_start = len(dna) - 2 

protein = ""
for start in range(0,last_codon_start,3): 
    codon = dna[start:start+3] 
    aa = gencode[codon]
    protein = protein + aa
    
print(protein) 

MFVTR


This is a good candidate for a function, so let's turn it into one:

In [10]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        aa = gencode.get(codon) 
        protein = protein + aa 
    return protein 

In [11]:
translate_dna("ATGTTCGTGACGAGGGT")

'MFVTR'

And we'll try a few more DNA sequences to see what happens:

In [12]:
print(translate_dna("ATGTTCGGT")) 
print(translate_dna("ATCGATCGATCGTTGCTTATCGATCAG")) 
print(translate_dna("actgatcgtagctagctgacgtatcgtat")) 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

MFG
IDRSLLIDQ


TypeError: cannot concatenate 'str' and 'NoneType' objects

We get an error on the third DNA sequence because it's in lower case. Change everything to upper case:

In [13]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        aa = gencode.get(codon.upper()) 
        protein = protein + aa 
    return protein 

In [14]:
print(translate_dna("ATGTTCGGT")) 
print(translate_dna("ATCGATCGATCGTTGCTTATCGATCAG")) 
print(translate_dna("actgatcgtagctagctgacgtatcgtat")) 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

MFG
IDRSLLIDQ
TDRS_LTYR


TypeError: cannot concatenate 'str' and 'NoneType' objects

Now the third DNA sequence works but we get an error on the fourth one. What is wrong? Let's print out the codon at each step to figure it out:

In [17]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        print(codon)
        aa = gencode.get(codon.upper()) 
        protein = protein + aa 
    return protein 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

ACG
ATC
GAT
CGT
NAC


TypeError: cannot concatenate 'str' and 'NoneType' objects

The error comes from the codon `NAC` i.e. there is an N in the sequence. The best way to fix it is to say that the amino acid for any codon that isn't in the dict is `X`, meaning unknown. To do this we just swich from 

```python
aa = gencode.get(codon.upper())
```

to 

```python
aa = gencode.get(codon.upper(), 'X')
```

like this:

In [37]:
def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        print(codon)
        aa = gencode.get(codon.upper(), 'X') 
        protein = protein + aa 
    return protein 
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))

ACG
ATC
GAT
CGT
NAC
GTA
CGA
TCG
TAC
TCG
TIDRXVRSYS


Now we get a complete protein tranlsation - Notice that it has an X in the middle. 


## Bonus exercise: NCBI taxonomy

The first thing we need to do is turn the file of child->parent relationships into a dict. We read each line, split it on a comma, and store the pair:

In [10]:
child2parent = {}

for line in open("child2parent.txt"):
    fields = line.rstrip("\n").split(",")
    
    # the ids are integers, so remember to turn them from strings to numbers before we store the pair
    child = int(fields[0])
    parent = int(fields[1])
    child2parent[child] = parent
    
    

Now we should be able to look up the parent ID for any given child:

In [13]:
print(child2parent[6669])
print(child2parent[1234])

6668
189779


We can begin to think now about how to find the last common ancestor for a pair of nodes. The simplest way to do it is to get a complete list of ancestors for each node (i.e. all the way back to the root of the tree), then find the first node that occurs in both lists. We'll start with a function that will give us a list of all the ancestors for a given node. The trick is to write a loop so that the parent of the current node becomes the current node for the next iteration:

In [19]:
def get_ancestors(starting_node):
    result = []
    for _ in range(100):
        result.append(starting_node)
        starting_node = child2parent[starting_node]
    return result
        

The tricky thing is that we don't know how many ancestral nodes there will be (i.e. the depth of the starting node) so we don't know how many times to keep going up. In the above function, we use `range()` with a big number to try 1000 times, but the problem is that the root node 1 (which is its own parent) appears many times in the output list:

In [20]:
get_ancestors(6669)

[6669,
 6668,
 77658,
 116561,
 6665,
 84337,
 116557,
 6658,
 6657,
 197562,
 197563,
 6656,
 88770,
 1206794,
 33317,
 33213,
 6072,
 33208,
 33154,
 2759,
 131567,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

The easiest way to solve this is to simply return as soon as we reach the root node:

In [22]:
def get_ancestors(starting_node):
    result = []
    result.append(starting_node)
    for _ in range(100):
        starting_node = child2parent[starting_node]
        result.append(starting_node)
        
        if starting_node == 1:
            return result
    return result

Now as soon as we reach the root node, we stop adding ancestors:

In [23]:
get_ancestors(6669)

[6669,
 6668,
 77658,
 116561,
 6665,
 84337,
 116557,
 6658,
 6657,
 197562,
 197563,
 6656,
 88770,
 1206794,
 33317,
 33213,
 6072,
 33208,
 33154,
 2759,
 131567,
 1]

We can now write the next function, which will take two nodes and look up the list of ancestors for each before finding the first one that appears in both:

In [24]:
def get_lca(node_one, node_two):
    ancestors_one = get_ancestors(node_one)
    ancestors_two = get_ancestors(node_two)
    for ancestor in ancestors_one:
        if ancestor in ancestors_two:
            return ancestor

Let's try it with humans (9606) and chimps (9598):

In [25]:
get_lca(9606, 9568)

9526

It's hard to know if the answer we get back is correct, since we don't know what name that corresponds to. Let's use the second file and make some dicts that will map between names and taxonomy ids:

In [30]:
name2id = {}
id2name = {}
for line in open("id2name.txt"):
    fields = line.rstrip("\n").split(",")
    
    id = int(fields[0])
    name = fields[1]
    
    id2name[id] = name
    name2id[name] = id

If you look at the file you'll see that there are sometimes multiple names for the same id, which means that the values in the `id2name` dict will be overwritten and we'll be left with the final one. Now we can look up the ids for names:

In [35]:
print(name2id['Homo sapiens'])
print(name2id['Pan troglodytes'])

9606
9598


and names for ids:

In [36]:
id2name[9526]

'Catarrhini'

Let's now write a function that takes two names and returns the name of the LCA:

In [38]:
def get_named_lca(name_one, name_two):
    id_one = name2id[name_one]
    id_two = name2id[name_two]
    lca = get_lca(id_one, id_two)
    return id2name[lca]

Note that we don't have to rewrite any of the LCA logic, we just reuse our earlier function. Let's try getting the common ancestor of humans and some model organisms:

In [47]:
print(get_named_lca('Homo sapiens', 'Pan troglodytes'))
print(get_named_lca('Homo sapiens', 'Xenopus laevis'))
print(get_named_lca('Homo sapiens', 'Danio rerio'))
print(get_named_lca('Homo sapiens', 'Ciona intestinalis'))
print(get_named_lca('Homo sapiens', 'Drosophila melanogaster'))
print(get_named_lca('Homo sapiens', 'Arabidopsis thaliana'))
print(get_named_lca('Homo sapiens', 'Escherichia coli'))


Homo/Pan/Gorilla group
tetrapods
bony vertebrates
chordates
Bilateria
eukaryotes
cellular organisms


To extend this idea to work for any number of input taxa, we have to recognize the following pattern: if the LCA of A and B is X, then the LCA of A, B and C is the LCA of C and X. Draw some trees to convince yourself that this is true!

So, to get the LCA of a list of ids we get the LCA of the first two, then find the LCA of that and the third, then find the LCA of that and the fourth, etc. In a similar way to before, we have to make sure that the current LCA at this iteration becomes one of the two starting nodes for the next iteration. This requires two new bits of code. The `pop()` method removes the last element from a list and returns it. A `while` loop keeps repeating as long as a condition is true. 

In [48]:
def get_lca_list(taxa): 
    
    # start with one taxon
    taxon1 = taxa.pop() 
    
    # keep looking as long as there are still some taxa remaining
    while len(taxa) > 0: 
        
        # remove the last taxon and find the lca of it and the current taxon
        taxon2 = taxa.pop() 
        lca = get_named_lca(taxon1, taxon2) 
        print('LCA of ' + taxon1 + ' and ' + taxon2 + ' is ' + lca) 
        
        # the lca becomes the first taxon for the next iteration
        taxon1 = lca 
        
    # once the loop exits, we are done so return the result
    return taxon1 

In [49]:
print(get_lca_list(['Homo sapiens', 'Pan troglodytes', 'Danio rerio']))

LCA of Danio rerio and Pan troglodytes is bony vertebrates
LCA of bony vertebrates and Homo sapiens is bony vertebrates
bony vertebrates


In [50]:
print(get_lca_list(['Caenorhabditis elegans', 'Drosophila melanogaster', 'Daphnia pulex']))

LCA of Daphnia pulex and Drosophila melanogaster is Pancrustacea
LCA of Pancrustacea and Caenorhabditis elegans is Ecdysozoa
Ecdysozoa


In [51]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [52]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")