# Collections

In the previous notebook we had a look at variables. Often, however, we need a convenient way to represent a collection of data (think a series of proteins, all codons that code for an aminoacid, etc). Python offers several very handy built-in types to deal with this. In the following we will learn about *lists*, *tuples* and *dictionaries*.

In [None]:
# run this cell before you proceed, to enable video playback in the notebook
import sys
if "pyodide" in sys.modules:
    import micropip
    await micropip.install("ipywidgets")

# Lists

Lists are the workhorse data type in Python, so it's important to understand them well. A list is essentially an ordered collection of items:

In [None]:
firstlist=[1,2,3,4,5]
print(firstlist)

Since each value in Python "knows" its type, there's no danger in putting together different types of values (however, in general this is better avoided):

In [None]:
another=["Me","You","Them"]
mixandmatch=["One",2,"Three",4.0]
emptylist=[]
print(another, mixandmatch, emptylist)

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('IJXz53jT4iY',width=resize, height=520*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

**NOTE ON EXAMPLES:** Toy examples on strings and lists get boring quickly. For this reason, I am sourcing some of the examples from Bioinformatics, aka the Empire of Strings. These examples reflect meaningful real-world processing and data, however basic. You don't actually need to know any Biology, and if the terminology bothers you, replace any words you don't like with *foo*, *bar*, *spam* and *ham* - for our purposes, they are just strings.

### Indexing

You can access individual elements of the list by *indexing*. This is done with the ```[]``` operator:

In [None]:
# the Alanine molecule has 4 possible representations in the genome - all of them 3 letter strings
ala=['GCU', 'GCC', 'GCA', 'GCG'] # codons for Alanine
print("Length ", len(ala))
print(ala[0]) # the first element
print(ala[3]) # the last element

Incidentally, the [] operator also works for strings:

In [None]:
'GCU'[1]

so that this is valid code:

In [None]:
ala=['GCU', 'GCC', 'GCA', 'GCG']
print(ala[0][2]) # third character of the first element of the list

A few other handy tricks:

In [None]:
mylist=['a','b','c','d']
print(mylist[-1])  # last element
print(mylist[-2])  # second last
print(mylist[1:3]) # from element 1 included to 3 excluded. This is called "slicing"

You can, of course, use indexing to modify lists:

In [None]:
mylist=['a','b','c','d','e','f']
mylist[-1]='z'
mylist[0]=mylist[1]
mylist[2:4]=['Wah','Wah']

print(mylist)

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('1Y9r-8wBbxM',width=resize, height=450*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

For more examples of indexing and slicing (including skipping elements), see this [tutorial](https://railsware.com/blog/python-for-machine-learning-indexing-and-slicing-for-lists-tuples-strings-and-other-sequential-types/) online.

### Operations on lists

The ```dir``` command gives you a handy way to list operations defined for a type (really methods defined for an object). Disregard the entries beginning and ending with ```__``` that have to do with the internal representation of the object. Let's try this on a list:

In [None]:
dir([])

```append``` and ```pop``` attach or remove an element from the "tail" of the list. They can be used to implement a LIFO (last in first out) queue, also known as a stack.

In [None]:
queue=['Last', 'In', 'First']
queue.append('Out')
print(queue)
print(queue.pop())
print(queue)

To concatenate two lists, use ```extend```

In [None]:
queue.extend(['Second', 'Third'])
print(queue)

In [None]:
queue.remove('In') # just guess
print(queue)

In [None]:
queue.reverse()
print(queue)

In [None]:
queue.sort()
print(queue)

In [None]:
queue.clear() # remove all items
print(queue)

Note that strings are sorted in alphabetical order. More details on sorting available [here](https://wiki.python.org/moin/HowTo/Sorting)

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('BShjj8SBHrc',width=resize, height=420*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

### Lists comprehensions

There is a handy way of defining lists in Python starting from other lists. This is similar to what is done with sets in mathematics. Consider the following set: $A=\{1,2,3,4,5\}$. You can define $B=\{3x | x\in A\}$ (read 3 times x for x in A), which explicitly means  $B=\{3,6,9,12,15\}$. The same is possible with Python lists:

In [None]:
A=range(1,6)
print(list(A))
B=[3*x for x in A]
print(B)

We can also use conditionals in comprehensions to pick elements that satisfy a particular property (we will cover conditionals in detail later on):

In [None]:
even=[x for x in A if x%2==0]
odd=[x for x in A if x not in even] #  "not in" is a Python operator in its own right
# also: odd=[x for x in A if not (x in even)]
print(even)
print(odd)

This can be used to operate on all elements of a list:

In [None]:
# the Leucine molecule has 6 possible representations in RNA - and the same in DNA, if we exchange the Us with Ts
leu=['UUA','UUG','CUU','CUC','CUA','CUG']
leu_DNA=[x.replace('U','T') for x in leu]
print(leu_DNA)

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('8cDSOL7-Nrw',width=resize, height=510*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

### A common pitfall

Be warned that the name of a list is just a reference to a memory area where the list elements are stored. Thus a copy of a list creates another way of accessing the same list containing the exact same objects, not different list or different objects. This makes the copy operation very efficient, but it can lead to some surprises:

In [None]:
a=['We', 'Are', 'All', 'Unique']
b=a
print(b)
b.insert(2,'Not')
print(b)
print(a) # Ops! a has been modified too

If you do want to copy all elements one by one, you can use indexing to enumerate all the elements of list *a* and assign the resulting ~~sub~~list to list *b*:

In [None]:
a=['This', 'Is', 'All']
b=a[:] # this makes all the difference
b[1]="Isn't" # have to use double quotes since string contains a single quote
print (a)
print (b)

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('5IH1Bw0ohqE',width=resize, height=490*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

# Tuples

Tuples are immutable lists. They are more efficient than lists and can be used as keys for dictionaries (see below); with the obvious modificatios, they can be used like lists.

In [None]:
# DNA chains are made up of 4 molecular "links" known as A, T, C, G
nucleotides=('A','T','C','G') # we are unlikely to change this
print (nucleotides[2])
print (nucleotides.index('G'))
nucleotides[0]='Spam' # hmmm...

Notice that methods to change the elements are missing

In [None]:
dir(()) # () obviously is... the empty tuple!

This is slightly tricky:

In [None]:
a=(1,2,3)
b=a # b is the same tuple as a (same as with lists)
a=a+(4,5) # creates a new tuple, does not change a
print(a) # a is an entirely new tuple!
print(b) # b is still the same

In this example we are not changing the tuple, we are actually creating the new tuple (1,2,3,4,5) and assigning it to *a* instead than the old one

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('DoflPsdLWD8',width=resize, height=490*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

# Dictionaries

You can think of lists and tuples as a series of variables indexed by an integer. Dictionaries are series of variables indexed by an arbitrary object, typically (but by all means not always) a string:

In [None]:
# Aminoacids, the building block of proteins, are represented by three-letter symbols.
# this dictionary implements an (incomplete) look-up table for the full name
aminoas={'Ala':'Alanine', 'Cys': 'Cysteine', 'Pro': 'Proline', 'Leu': 'Leucine'}
print(aminoas['Leu'])
aminoas['His']='Histidine'
print(aminoas)

You can add items one by one to an empty dictionary {}:

In [None]:
emptydic={}
print(len(emptydic))
emptydic['Foo']='Bar'
print(len(emptydic))

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('218ne2ab3JA',width=resize, height=425*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

**UPDATE:** The language keeps evolving, and as of Python 3.6 a new implementation of ```dict``` was introduced that is more efficient, and also stores the keys in *insertion order*. Since version 3.7, this has become part of the standard, so printing a dictionary or extracting its keys (see below) will return the keys in the order they were inserted.

### Keys and values

You can display the keys and values contained in a dictionary using the corresponding methods:

In [None]:
print(aminoas.keys())
print(aminoas.values())

Or you can have them all together:

In [None]:
aminoas.items()

Trying to look up an unexisting key can led to an error:

In [None]:
aminoas['Ser']

So it may be better to check (this is a conditional expression, more on this later):

In [None]:
'Ala' in aminoas

Or you can play it safe:

In [None]:
print(aminoas.get('Ala','Oops...'))
print(aminoas.get('Ser','Oops...'))

You can use ```del``` to delete items (or you can ```pop``` the dictionary, which returns the deleted item):

In [None]:
del aminoas['Ala']
print(aminoas.pop('Pro'))
aminoas

We can ```update``` a dictionary with values from another:

In [None]:
more=dict(Ala='Alanine',Pro='Proline') # another way to construct a dictionary
print(more)
aminoas.update(more)
print(aminoas)

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('WMUkwDgZa24',width=resize, height=425*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

And finally, let's turn this around - let's try to use the names as keys, and the three-letter codes as values:

In [None]:
k=aminoas.keys()
v=aminoas.values()
# zip is a useful function - it pairs up the elements of two lists
l=list(zip(v,k))
print(v)
print(k)
print(l) # the output of zip
symbol=dict(l)
print(symbol)
print(symbol['Cysteine'])

Note that turning a dictionary on its head in this way is not normally possible, as two or more keys may have the same value (meaning you cannot in general use the values as keys) - nor is it necessarily useful, but it's a good exercise nevertheless!

In [None]:
# run this cell to show a video, use slider to resize it, type Esc-o to hide it
from IPython.display import YouTubeVideo; from ipywidgets import interactive, IntSlider
def _play(resize): display(YouTubeVideo('uC7-soGdAR0',width=resize, height=425*resize//1100, rel=0, loop=1))
interactive(_play, resize=IntSlider(min=300, max=1100, step=50, value=600, continuous_update=False, readout=False))

### Dict comprehensions

Removing printouts and intermediate variables, the above solution isn't that verbose:

In [None]:
symbol=dict(zip(aminoas.values(), aminoas.keys()))
print (symbol)

however, there is a more elegant way of doing it, by using a **dictionary comprehension**. From the [documentation](https://www.python.org/dev/peps/pep-0274/), "Dict comprehensions are just like list comprehensions, except that you group the expression using curly braces instead of square braces. Also, the left part before the ```for``` keyword expresses both a key and a value, separated by a colon. The notation is specifically designed to remind you of list comprehensions as applied to dictionaries." In our case, this what the required comprehension would look like:

In [None]:
# note the { } instead of [ ] and the : sign
symbol={ v:k for k,v in aminoas.items() }
print(symbol)

the analogy with list comprehensions is evident. You can find a collection of increasingly intricate examples [here](https://towardsdatascience.com/10-examples-to-master-python-dictionary-comprehensions-7aaa536f5960) - you may want to try running and editing them in order to make sense of them.

**(C) 2014,2025 Fabrizio Smeraldi** ([info@patchypython.com](mailto:info@patchypython.com)), all rights reserved.