# An introduction to solving biological problems with Python

## Session 1.4: Collections Sets and dictionaries

- [Sets](#Sets) | [Exercise 1.4.1](#Exercise-1.4.1)
- [Dictionaries](#Dictionaries) | [Exercises 1.4.2](#Exercises-1.4.2)

## Sets

- Sets contain unique elements, i.e. no repeats are allowed
- The elements in a set do not have an order
- Sets cannot contain elements which can be internally modified (e.g. lists and dictionaries)

In [1]:
l = [1, 2, 3, 2, 3] # list of 5 values
s = set(l) # set of 3 unique values
print(s)
e = set() # empty set
print(e)

{1, 2, 3}
set()


Sets are very similar to lists and tuples and you can use many of the same operators and functions, except they are **inherently unordered**, so they don't have an index, and can only contain _unique_ values, so adding a value already in the set will have no effect

In [3]:
s = set([1, 2, 3, 2, 3])
print(s)
print("number in set:", len(s))
s.add(4)
print(s)
s.add(3)
print(s) #will not add anything visible because sets cannot have repeated elements

{1, 2, 3}
number in set: 3
{1, 2, 3, 4}
{1, 2, 3, 4}


You can remove specific elements from the set.

In [4]:
s = set([1, 2, 3, 2, 3])
print(s)
s.remove(3)
print(s)

{1, 2, 3}
{1, 2}


You can do all the expected logical operations on sets, such as taking the union or intersection of 2 sets with the <tt>|</tt> _or_ and <tt>&</tt> _and_ operators 

In [5]:
s1 = set([2, 4, 6, 8, 10])
s2 = set([4, 5, 6, 7])

print("Union:", s1 | s2)
print("Intersection:", s1 & s2)

Union: {2, 4, 5, 6, 7, 8, 10}
Intersection: {4, 6}


## Exercise 1.4.1

1. Given the protein sequence "MPISEPTFFEIF", split the sequence into its component amino acid codes and use a set to establish the unique amino acids in the protein and print out the result.

In [14]:
prot="MPISEPTFFEIF"
aa=list(prot) #transforms a string into a list of strings
aa

['M', 'P', 'I', 'S', 'E', 'P', 'T', 'F', 'F', 'E', 'I', 'F']

In [16]:
unique_aa=set(aa)
unique_aa

{'E', 'F', 'I', 'M', 'P', 'S', 'T'}

## Dictionaries

Lists are useful in many contexts, but often we have some data that has no inherent order and that we want to access by some useful name rather than an index. For example, as a result of some experiment we may have a set of genes and corresponding expression values. We could put the expression values in a list, but then we'd have to remember which index in the list corresponded to which gene and this would quickly get complicated.

For these situations a _dictionary_ is a very useful data structure.

Dictionaries:

- Contain a mapping of keys to values (like a word and its corresponding definition in a dictionary)
- The keys of a dictionary are unique, i.e. they cannot repeat
- The values of a dictionary can be of any data type
- The keys of a dictionary cannot be an internally modifiable type (e.g. lists, but you can use tuples)
- Dictionaries do not store data in any particular order

In [17]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print(dna)

{'A': 'Adenine', 'C': 'Cytosine', 'G': 'Guanine', 'T': 'Thymine'}


You can access values in a dictionary using the key inside square brackets

In [18]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print("A represents", dna["A"])
print("G represents", dna["G"])

A represents Adenine
G represents Guanine


An error is triggered if a key is absent from the dictionary:

In [19]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print("What about N?", dna["N"])

KeyError: 'N'

You can access values safely with the <tt>get</tt> method, which gives back <tt>None</tt> if the key is absent and you can also supply a default values

In [20]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print("What about N?", dna.get("N"))
print("With a default value:", dna.get("N", "unknown"))

What about N? None
With a default value: unknown


You can check if a key is in a dictionary with the <tt>in</tt> operator, and you can negate this with <tt>not</tt>

In [21]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
"T" in dna

True

In [22]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
"Y" not in dna

True

In [23]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
"Adenine" in dna

False

Note that here python doesnt find Adenine because it is checking the keys, not the elements in the dictionary.

The <tt>len()</tt> function gives back the number of (key, value) *pairs* in the dictionary:

In [24]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
print(len(dna))

4


You can introduce new entries in the dictionary by assigning a value with a new key:

In [25]:
dna = {"A": "Adenine", "C": "Cytosine", "G": "Guanine", "T": "Thymine"}
dna['Y'] = 'Pyrimidine'
print(dna)

{'A': 'Adenine', 'C': 'Cytosine', 'G': 'Guanine', 'T': 'Thymine', 'Y': 'Pyrimidine'}


You can change the value for an existing key by reassigning it:

In [26]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
dna['Y'] = 'Cytosine or Thymine'
print(dna)

{'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Cytosine or Thymine'}


You can delete entries from the dictionary:

In [27]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
del dna['Y']
print(dna)

{'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine'}


You can get a list of all the keys (in arbitrary order) using the inbuilt <tt>.keys()</tt> function

In [28]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
print(list(dna.keys()))

['A', 'C', 'T', 'G', 'Y']


And equivalently get a list of the values:

In [29]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
print(list(dna.values()))

['Adenine', 'Cytosine', 'Thymine', 'Guanine', 'Pyrimidine']


And a list of tuples containing (key, value) pairs:

In [31]:
dna = {'A': 'Adenine', 'C': 'Cytosine', 'T': 'Thymine', 'G': 'Guanine', 'Y': 'Pyrimidine'}
print(list(dna.items()))

pairs=list(dna.items())
pairs[0]

[('A', 'Adenine'), ('C', 'Cytosine'), ('T', 'Thymine'), ('G', 'Guanine'), ('Y', 'Pyrimidine')]


('A', 'Adenine')

## Exercises 1.4.2

1. Print out the names of the amino acids that would be produced by the DNA sequence "GTT GCA CCA CAA CCG" ([See the DNA codon table](https://en.wikipedia.org/wiki/DNA_codon_table)). Split this string into the individual codons and then use a dictionary to map between codon sequences and the amino acids they encode.
2. Print each codon and its corresponding amino acid.
3. Why couldn't we build a dictionary where the keys are names of amino acids and the values are the DNA codons?

### Advanced exercise 1.4.3

- Starting with an empty dictionary, count the abundance of different residue types present in the 1-letter lysozyme protein sequence (http://www.uniprot.org/uniprot/B2R4C5.fasta) and print the results to the screen in alphabetical key order.

In [51]:
#1
codon_string= "GTT GCA CCA CAA CCG"
genetic_code={"GTT":"Val","GCA":"Ala","CCA":"Pro","CAA":"Glu","CCG":"Pro"}

codon_list=codon_string.split(" ")
print(codon_list)

['GTT', 'GCA', 'CCA', 'CAA', 'CCG']


In [53]:
#2
print(codon_list[0],"codes for ", genetic_code[codon_list[0]])
print(codon_list[1],"codes for ", genetic_code[codon_list[1]])
print(codon_list[2],"codes for ", genetic_code[codon_list[2]])
print(codon_list[3],"codes for ", genetic_code[codon_list[3]])
print(codon_list[4],"codes for ", genetic_code[codon_list[4]])

GTT codes for  Val
GCA codes for  Ala
CCA codes for  Pro
CAA codes for  Glu
CCG codes for  Pro


In [55]:
#Advanced Exercise:
seq="MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANWMCLAKWESGYNTRATNYNAGDRSTDYGIFQINSRYWCNDGKTPGAVNACHLSCSALLQDNIADAVACAKRVVRDPQGIRAWVAWRNRCQNRDVRQYVQGCGV"

aa_counts={}

aa_counts["A"]=seq.count("A")
aa_counts["C"]=seq.count("C")
#etc...
print('A has', aa_counts['A'],'occurence(s)')
print('C has', aa_counts['C'],'occurence(s)')

A has 15 occurence(s)
C has 8 occurence(s)


## Congratulation! You reached the end of day 1!

Go to our next notebook: [python_basic_2_intro](python_basic_2_intro.ipynb)