# Dictionaries

Imagine you were tasked with writing a program to translate DNA codons into amino acid residues. There are 64 possible codons (unique) that code for 20 amino acids or indicate translational start and stop sites. If we want to store which codons code for which amino acids, none of the data types we have encountered would let us do that very easily.

For this type of scenario, we use data structures known as dictionaries, which are composed of a series of key and value pairs. Keys must be unique but the values do not need to be. Unlike lists, where you can only access the values stored by referring to a position in the list, with dictionaries you can directly recall the value associated with any dictionary key by specifying the key.

|Keys|Values|
|:-----:|:-----:|
|Must be Unique|Don't need to be unique, repeat values permitted|
|Can be any immutable data type (strings, numbers or tuples)|Can be of any data type (e.g. strings, numbers, tuples, lists, dictionaries)|

Dictionaries are written as key:value pairs, with pairs separated by commas and all contained within curly brackets. Below are examples of two dictionaries - the first dictionary stores a translation table of DNA codons (as keys) and amino acids (as values), the second example stores protein accession numbers as keys and amino acids sequences as values. 

See the example sheet thatr follows this workbook to see a full example of a Python script to translate DNA into protein sequence. 

In [1]:
#This is a dictionary of codons (keys) and the amino acids they code for (values)
translation_dict = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'}

In [2]:
#This dictionary stores protein accession references(keys) and sequences (values)
protein_sequences = {
'AE049599': 'PACGWTYQPPR',
'AR182930': 'ECPWALWQMVL',
'AF094030': 'FYTPWEMRTVE',
'DS302903': 'CADGYTPMWQ',
'CL123214': 'TYRPWTNGGW',
'FE492015': 'IWTASFMACISH'
}

### Accessing dictionaries

Dictionaries can't be indexed by position as they don't have a fixed order - instead dictionary values can be recalled using the key:

In [3]:
protein_sequences['AF094030'] #returns 'FYTPWEMRTVE'

'FYTPWEMRTVE'

### Initializing a dictionary

In the first example on the page, we initialized a dictionary by entering key and value pairs surrounded by curly brackets {}. Since this dictionary now exists, we can continue to add new entries to it. If the dictionary did not exist first, we need to make it before we can add individual entries to it - this can be done in two ways `protein_sequences = {}` or `protein_sequences = dict()`.

### Adding and deleting entries

In [1]:
# create a new, empty dictionary
protein_sequences = dict() 

# add the entries to the dictionary.
protein_sequences['XX046245'] = 'PYTWGASCAR' 
protein_sequences['AE049599'] = 'PACGWTYQPPR'
protein_sequences['AF094030'] = 'FYTPWEMRTVE'

# delete the 'AE049599': 'PACGWTYQPPR' key:value pair from the dictionary
del protein_sequences['AE049599'] 

# delete 'AF094030': 'FYTPWEMRTVE' key:value pair from the dictionary and return the value
sequence = protein_sequences.pop('AF094030')
print(sequence) #returns FYTPWEMRTVE

FYTPWEMRTVE


If you try to set a value for a key that already exists in the dictionary, the value will simply be overwritten with the new one. For example, in the code below, we first set protein_sequences['XX046245'] to 'PYTWGASCAR', then change the value to 'RACSAGAGWTYP', overwriting the existing entry. We demonstrate this by printing the value associated with the key 'XX046245'.


In [4]:
protein_sequences['XX046245'] = 'PYTWGASCAR' # adds the entry to the dictionary.

protein_sequences['XX046245'] = 'RACSAGAGWTYP' # changes value to RACSAGAGWTYP.

print(protein_sequences['XX046245'])

RACSAGAGWTYP


The entry below illustrates one type of use for dictionaries: they can be used as single database entries. 

In [5]:
dna_entry = {
'Genbank_accession': 'AE005672',
'EntryType': 'Genome',
'Organism': 'Streptococcus Pneumoniae TIGR4',
'BasePairs' : 2160842
}

More complicated data structures can be created by using combinations of lists and dicts, for example a genome database might be a list of dictionaries with the entries above. The below list contains three items that are dictionaries. To access the items, we first access the correct item in the list by specifying the index, then we can specify the key we want to access.

In [6]:
genome_db = [
{'Genbank_accession': 'AE005672',
'EntryType': 'Genome',
'Organism': 'Streptococcus Pneumoniae TIGR4',
'BasePairs' : 2160842},
{'Genbank_accession': 'AE004871',
'EntryType': 'Genome',
'Organism': 'Streptococcus Pneumoniae D39',
'BasePairs' : 2060423},
{'Genbank_accession': 'AE019201',
'EntryType': 'Genome',
'Organism': 'Streptococcus Pneumoniae NP5',
'BasePairs' : 2151728}
]
TIGR4_basepairs = genome_db[0]['BasePairs']
print("Size of TIGR4 genome is {} bp".format(TIGR4_basepairs))
NP5_accession = genome_db[2]['Genbank_accession']
print("Accession number for NP5 is {}".format(NP5_accession))


Size of TIGR4 genome is 2160842 bp
Accession number for NP5 is AE019201


### Looping through a dictionary

For loops can also be used to iterate through dictionaries. To loop through the keys of a dictionary, you can use the following:

In [3]:
protein_sequences = {
'AE049599': 'PACGWTYQPPR',
'AR182930': 'ECPWALWQMVL',
'AF094030': 'FYTPWEMRTVE',
'DS302903': 'CADGYTPMWQ'
}
for key in protein_sequences:
    print("Key: {0} Value: {1}".format(key, protein_sequences[key])) #will print value in dict corresponding to key

Key: AE049599 Value: PACGWTYQPPR
Key: AR182930 Value: ECPWALWQMVL
Key: AF094030 Value: FYTPWEMRTVE
Key: DS302903 Value: CADGYTPMWQ


Note: Since dictionaries aren't ordered (unlike lists), the loops will not iterate through the keys in an easily predictable order.

*Additional Information* If the dictionary keys are sortable, is possible to sort the dictionary key when looping through it using the `sorted` keyword.

In [4]:
for key in sorted(protein_sequences):
    print("Key: {0} Value: {1}".format(key, protein_sequences[key])) #will print value in dict corresponding to key

Key: AE049599 Value: PACGWTYQPPR
Key: AF094030 Value: FYTPWEMRTVE
Key: AR182930 Value: ECPWALWQMVL
Key: DS302903 Value: CADGYTPMWQ


It is also possible to iterate through both key and value at the same time using the `.items()` method. 

*Additional information* The `.items()` method returns a list of key, value pairs as tuples (immutable lists), so as we loop through the list we can simultaneously assign the key and value to two variables.

In [2]:
protein_sequences = {
'AE049599': 'PACGWTYQPPR',
'AR182930': 'ECPWALWQMVL',
'AF094030': 'FYTPWEMRTVE',
'DS302903': 'CADGYTPMWQ'
}
for acc_number, sequence in protein_sequences.items():
    print(acc_number, sequence) 

DS302903 CADGYTPMWQ
AE049599 PACGWTYQPPR
AF094030 FYTPWEMRTVE
AR182930 ECPWALWQMVL


### Example: Using dictionaries to keep counts

We could also use a `for` loop to count how many of each type of item is in a list. For example, if we have a list of species sighted in a survey, we could use a dictionary to summarise the list.

In [1]:
species_sighted = ["Gorilla gorilla", "Pongo pygmaeus", "Mus musculus", "Fringilla coelebs",
                   "Chloris chloris", "Carduelis carduelis", "Serinus canaria", "Serinus canaria",
                   "Eresus sandaliatus", "Fringilla coelebs", "Pongo pygmaeus", "Fringilla coelebs",
                   "Carduelis carduelis", "Serinus canaria"]

species_dict = {} #make an empty dictionary to store species sightings
for species in species_sighted:
    
    #if species not already in the dictionary, add with value = 1
    if not species in species_dict:
        species_dict[species] = 1
        
    #if already in the dictionary, add 1 to the value
    else:
        species_dict[species] += 1

#now loop through dictionary and print out each key and value
for key in species_dict:
    print("{} sightings of {}".format(species_dict[key],key))

3 sightings of Serinus canaria
3 sightings of Fringilla coelebs
2 sightings of Pongo pygmaeus
1 sightings of Chloris chloris
1 sightings of Eresus sandaliatus
1 sightings of Gorilla gorilla
1 sightings of Mus musculus
2 sightings of Carduelis carduelis


# Exercises

* Make a dictionary containing the data in the following table.

|Scientific Name|	Common Name|
|:-------|:-------|
|*Fistulina hepatica*	|Beefsteak|
|*Laetiporus sulphureus*	|Chicken-of-the-Woods|
|*Flammulina velutipes* |Enoki|
|*Hericium erinaceus*	|Lion’s Mane|
|*Pholiota nameko*	|Nameko|
|*Pleurotus ostreatus*	|Pearl oyster|
|*Ganoderma lucidium*	|Reishi|
|*Lentinus edodes*	|Shiitake|

  * Add the following to your dictionary: <br>
*Pleurotus pulmonarius*: Summer oyster <br>
*Pleurotus euosmos*: Tarragon <br>
  * Print the common names for *Pholiota nameko* and *Fistulina hepatica*
  * Replace the common name for *Laetiporus sulphureus* with Sulphur Shelf (this is another common name for this species which forms striking golden-yellow shelf-like fungal structures on tree trunks and branches.)
  * Write a loop that loops through the dictionary and prints out all entries (Scientific Name and Common Name)