<h1 id="toctitle">Dictionaries</h1>
<ul id="toc"/>

## Introducing paired data

Say we want to count all the A's in a DNA sequence:

In [1]:
dna = "ATCGATCGATCGTACGCTGA"
a_count = dna.count("A")
a_count

5

That was pretty straightforward. How about all four bases:

In [2]:
dna = "ATCGTATCGATGTACGCTGA"
a_count = dna.count("A")
t_count = dna.count("T")
g_count = dna.count("G")
c_count = dna.count("C")
print(a_count, t_count, g_count, c_count)

5 6 5 4


Getting repetitive. How about dinucldeotides (16 variables):

```python
aa_count = dna.count("AA")
at_count = dna.count("AT")
ag_count = dna.count("AG")
...
```

or trinucleotides (64 variables):

```python
aaa_count = dna.count("AAA")
aat_count = dna.count("AAT")
aag_count = dna.count("AAG")
```

We could use a list to store these counts:

In [3]:
dna = "ATCGTATCGATGTACGCTGA"
dinucleotides = ['AA','AT','AG','AC',
                 'TA','TT','TG','TC',
                 'GA','GT','GG','GC',
                 'CA','CT','CG','CC']
all_counts = []
for dinucleotide in dinucleotides:
    count = dna.count(dinucleotide)
    all_counts.append(count)
print(dinucleotides)
print(all_counts)

['AA', 'AT', 'AG', 'AC', 'TA', 'TT', 'TG', 'TC', 'GA', 'GT', 'GG', 'GC', 'CA', 'CT', 'CG', 'CC']
[0, 3, 0, 1, 2, 0, 2, 2, 2, 2, 0, 1, 0, 1, 3, 0]


But you can see the problem: once they're stored in the list, there's no easy way to look up the count for a given dinucleotide. There's no longer any connection between the dinucleotides and the counts.

This is an example of paired data - also called key/value data

| keys | values |
|------|--------|
|dinucleotide|count|
|name|protein sequence|
|codon|amino acid residue|
|sample|coordinates|
|word|definition|

Python's data structure for storing this type of data is a __dict__ (short for dictionary).

## Creating dicts

### Literal dicts

To make a dict 

- start and end with curly brackets
- separate keys and values with colons
- separate each pair (item) with a comma

In [4]:
enzymes = { 
'EcoRI' : 'GAATTC',
'AvaII' : 'GGACC',
'BisI' : 'GCNGC' 
}

#key has to be unique but value doesnt. so enzyme name is the key and wouldnt be able to use the seq to find the enzyme

We often write dicts on multiple lines. Getting a single value is similar to a list, but instead of giving the numeric index, we give the key for the value we want:

In [5]:
enzymes['BisI']

'GCNGC'

### Building up a dict

We can create an empty dict, and add items to it one at a time:

In [6]:
# create an empty dict
enzymes = {}

# add one key/value pair at a time
enzymes['EcoRI'] = 'GAATTC'
enzymes['AvaII'] = 'GGACC'
enzymes['BisI'] = 'GCNGC'

enzymes

{'EcoRI': 'GAATTC', 'AvaII': 'GGACC', 'BisI': 'GCNGC'}

The thing that goes inside the square brackets is always the key, whether we are setting a value or retrieving a value. 

How does this help us with our dinucletodies problem?

## Counting dinucleotides with a dict

Here's how we store the counts in a dict. We start with an empty dict, and add one key/value pair for each dinucleotide:

In [7]:
dna = "AATGATGAACGAC" 
dinucleotides = ['AA','AT','AG','AC', 
                 'TA','TT','TG','TC', 
                 'GA','GT','GG','GC', 
                 'CA','CT','CG','CC'] 


all_counts = {} #creating a dictionary
for dinucleotide in dinucleotides: 
    count = dna.count(dinucleotide) 
    all_counts[dinucleotide] = count #sets a different key for each part of the dictionary
        
print(all_counts) 

{'AA': 2, 'AT': 2, 'AG': 0, 'AC': 2, 'TA': 0, 'TT': 0, 'TG': 2, 'TC': 0, 'GA': 3, 'GT': 0, 'GG': 0, 'GC': 0, 'CA': 0, 'CT': 0, 'CG': 1, 'CC': 0}


Notice how although it's bigger than our previous examples the `all_counts` dict has the same key/value structure. 

We can now look up the count (value) for a particular dinucleotide (key) very easily:

In [8]:
all_counts['GA']

3

### Removing zero counts

Problem: many of the counts are zero (and for 3mers, 4mer, etc. nearly all the counts will be zero). Solution: just store the counts that are greater than zero:

In [9]:
dna = "AATGATCGATCGTACGCTGA"
all_counts = {}

dinucleotides = ['AA','AT','AG','AC', 
                 'TA','TT','TG','TC', 
                 'GA','GT','GG','GC', 
                 'CA','CT','CG','CC'] 

for dinucleotide in dinucleotides: 
        count = dna.count(dinucleotide)
        if count > 0:
            all_counts[dinucleotide] = count
print(all_counts)

{'AA': 1, 'AT': 3, 'AC': 1, 'TA': 1, 'TG': 2, 'TC': 2, 'GA': 3, 'GT': 1, 'GC': 1, 'CT': 1, 'CG': 3}


Now we are just storing the positive counts. This can lead to trouble when looking up counts for a dinucelotide that doesn't occur in the sequence:

In [10]:
all_counts['AA']

1

In [11]:
all_counts['AG']

KeyError: 'AG'

The `get()` method lets us specify a default for when the key isn't found:


In [12]:
all_counts.get('AG', 0) #getting value from dict but if it isnt there it gives you 0 as the default
#works for non-zeros too

0

## Looping with dicts

The `keys()` method returns a list of all the keys in a dict:

In [13]:
all_counts.keys()

dict_keys(['AA', 'AT', 'AC', 'TA', 'TG', 'TC', 'GA', 'GT', 'GC', 'CT', 'CG'])

There's also a `values()`:

In [17]:
all_counts.values()
all_counts.items()

dict_items([('AA', 1), ('AT', 3), ('AC', 1), ('TA', 1), ('TG', 2), ('TC', 2), ('GA', 3), ('GT', 1), ('GC', 1), ('CT', 1), ('CG', 3)])

To loop over all the key/value pairs in a dict, use the `items()` method. Note that we have to pick **two** variable names for the loop:

In [16]:
# which dinucleotides occur exactly twice in the sequence?

for dinucleotide, count in all_counts.items(): #items gives you a list of matched pairs
    if count == 2:
        print(dinucleotide)

TG
TC


### Lookup vs. iteration

Remember, we don't need to write a loop if we just want to get a single value. If we are looking for the count for 'AT' then we __dont__ need to do this:

In [18]:
for dinucleotide, count in all_counts.items():
    if dinucleotide == 'AT':
        print(count)

3


We can just ask for the value directly:

In [19]:
print(all_counts.get('AT'))

3


## In summary

- If you find yourself with two `list`s containing corresponding items, you probably wanted a `dict`
- Declare your dict with the `{key: value, ...}` syntax or...
- ...add values in a loop using the `dict[key] = value` syntax
- Retrieve values with `dict[key]` syntax or `.get()` to have a default value
- Loop over the whole thing using `.items()` method
- But don't use a loop just to get at one value


## A note on dict ordering

In Python 3.5 and older, `dict`s did not remember their ordering. If you looped through the dictionary you would see the values all muddled up. Since Python 3.6 (released December 2016), somebody found a clever way to make dictionaries fast *and* memory efficient *and* to remember the order of their elements. But if you ever have to run code on a system with a really old Python version then this is a gotcha. 

## Exercises

### Scientific and common names

Look at the file called _names.txt_. Each line contains the scientific name and common name for one species, separated by a comma:

```
Channa gachua,red-tailed snakehead
Jacquinia keyensis,joewood
Homo sapiens,man
Stomias affinis,Guenther's boatfish
Podarcis tauricus,Balkan wall lizard
Tylenchulus semipenetrans,citrus nematode
```

Write a program that will read this file and turn it into a dict where the keys are scientific names and the values are common names. You'll have to read the file line-by-line, split each line into two parts, and add one key to the dict for each pair of names. Test your program by looking up the common names of some of your favourite species. How many common names contain the word 'frog'? What are their scientific names?

In [30]:
names = open("names.txt")

names_dict = {}#creates empty dictionary

for n in names:
    list = n.rstrip("\n").split(",")#looping over open file, opens file for each line and splitting and removing spaces 
    names_dict[list[0]] = list[1]#adding to dictionary, line [0] = scientific name, is first item in lisy 1=common name
    
#print(names_dict)
### OR CAN DO: 

spp_name{}
for line in names_file 
line=line.rstrip
scientific_name, common_name = line.split

There's another file called *seq_counts.csv* which has the same format but stores the number of sequences available in a made up sequence database for each scientific name:

```
Harpadon microchir,41
Nicotiana alata,906
Meandrusa payeni,14
Ballota nigra,48
Hymenocallis latifolia,758
```

Write a program that will use your dict to make a similar file, but with common names rather than scientific names.

### DNA translation 

Here's a variable that stores the genetic code (https://en.wikipedia.org/wiki/DNA_codon_table) using a dict:

In [None]:
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

Write a program (or function) that will take a DNA sequence and translate it into protein using the translation table.

What happens if the DNA sequence contains undetermined bases (e.g. N)?

Can you generate a translation in all three forward frames? All three reverse frames?

### Bonus exercise

__Large File Warning!__

The NCBI taxonomy stores the relationships between all species (and orders, phyla, etc.) in GenBank. Each node in the tree has a unique ID. There are two files which store the taxonomic information. The _child2parent.txt_ file stores the parent for a single relationship on each line: for example, the line

`12,34`

means that node 12 is the child of node 34. (This works because each node has exactly one parent; it wouldn't work the other way round because each node can have more than one child).

The _id2name.txt_ file stores the scientific or common name for each node: for example the line:

`9606,Homo sapiens`

means that node 9606 represents the human species.

Write a function that will take two species IDs as arguments and return the ID of the last common ancestor of the two. We just need the _child2parent.txt_ file for this, not the names one. You can find the ID for your favourite species (or genus, order, etc.) by either browsing the NCBI taxonomy website or by searching in the names file. 

### Super bonus exercise

Extend your function from above so that it can find the last common ancestor of any number of input species. Modify it so that you can give the species names as arguments rather than the species IDs.
