# An introduction to solving biological problems with Python

## Session 1.3: Collections Lists and Strings

- [Tuples](#Tuples), [Lists](#Lists) and [Manipulating tuples and lists](#Manipulating-tuples-and-lists) | [Exercise 1.3.1](#Exercise-1.3.1)
- [String manipulations](#String-manipulations) | [Exercise 1.3.2](#Exercise-1.3.2)

As well as the basic data types we introduced above, very commonly you will want to store and operate on collections of values, and python has several _data structures_ that you can use to do this. The general idea is that you can place several items into a single collection and then refer to that collection as a whole. Which one you will use will depend on what problem you are trying to solve.

## Tuples

- Can contain any number of items
- Can contain different types of items
- __Cannot__ be altered once created (they are immutable)
- Items have a defined order

A tuple is created by using round brackets around the items it contains, with commas seperating the individual elements.

In [1]:
a = (123, 54, 92) # tuple of 4 integers
b = () # empty tuple
c = ("Ala",) # tuple of a single string (note the trailing ",")
d = (2, 3, False, "Arg", None) # a tuple of mixed types

print(a)
print(b)
print(c)
print(d)

(123, 54, 92)
()
('Ala',)
(2, 3, False, 'Arg', None)


You can of course use variables in tuples and other data structures

In [2]:
x = 1.2
y = -0.3
z = 0.9
t = (x, y, z)

print(t)

(1.2, -0.3, 0.9)


Tuples can be _packed_ and _unpacked_ with a convenient syntax. The number of variables used to unpack the tuple must match the number of elements in the tuple.

In [3]:
t = 2, 3, 4 # tuple packing
print('t is', t)
x, y, z = t # tuple unpacking
print('x is', x)
print('y is', y)
print('z is', z)

t is (2, 3, 4)
x is 2
y is 3
z is 4


To unpack a tuple, you need to give the same number of variables, eg x, y and z, to the tuple already assigned. In this exercise, if you said `x,y=t`, then you would get an error, because you have 3 variables assigned to t.

## Lists

- Can contain any number of items
- Can contain different types of items
- __Can__ be altered once created (they are _mutable_)
- Items have a particular order

Lists are created with square brackets around their items:

In [4]:
a = [1, 3, 9]
b = ["ATG"]
c = []

print(a)
print(b)
print(c)

[1, 3, 9]
['ATG']
[]


Lists and tuples can contain other list and tuples, or any other type of collection:

In [5]:
matrix = [[1, 0], [0, 2]]
print(matrix)

[[1, 0], [0, 2]]


You can convert between tuples and lists with the <tt>tuple</tt> and <tt>list</tt> functions. Note that these create a new collection with the same items, and leave the original unaffected.

In [7]:
a = (1, 4, 9, 16)     # A tuple of numbers
b = ['G','C','A','T'] # A list of characters

print(a)
print(b)

l = list(a)   # Make a list based on a tuple 
print(l)

t = tuple(b)  # Make a tuple based on a list
print(t)

(1, 4, 9, 16)
['G', 'C', 'A', 'T']
[1, 4, 9, 16]
('G', 'C', 'A', 'T')


## Manipulating tuples and lists

Once your data is in a list or tuple, python supports a number of ways you can access elements of the list and manipulate the list in useful ways, such as sorting the data.

Tuples and lists can generally be used in very similar ways.

### Index access

You can access individual elements of the collection using their *index*, note that the first element is at index 0. Negative indices count backwards from the end.

In [6]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]

print('t is', t)
print('t[0] is', t[0])
print('t[2] is', t[2])

print('x is', x)
print('x[-1] is', x[-1])

t is (123, 54, 92, 87, 33)
t[0] is 123
t[2] is 92
x is [123, 54, 92, 87, 33]
x[-1] is 33


### Slices

You can also access a range of items, known as *slices*, from inside lists and tuples using a colon `:` to indicate the beginning and end of the slice inside the square brackets. **Note that the slice notation `[a:b]` includes positions from `a` up to *but not including* `b`**.

In [8]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]
print('t[1:3] is', t[1:3])
print('x[2:] is', x[2:])
print('x[:-1] is', x[:-1])

t[1:3] is (54, 92)
x[2:] is [92, 87, 33]
x[:-1] is [123, 54, 92, 87]


Note that even if you slice refers to not including the last item, it still is asking for the rest of the list in the correct order.

### `in` operator
You can check if a value is in a tuple or list with the <tt>in</tt> operator, and you can negate this with <tt>not</tt>

In [9]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]
print('123 in', x, 123 in x)
print('234 in', t, 234 in t)
print('999 not in', x, 999 not in x)

123 in [123, 54, 92, 87, 33] True
234 in (123, 54, 92, 87, 33) False
999 not in [123, 54, 92, 87, 33] True


### `len()` and `count()` functions
You can get the length of a list or tuple with the in-built <tt>len()</tt> function, and you can count the number of particular elements contained in a list with the <tt>.count()</tt> function.

In [10]:
t = (123, 54, 92, 87, 33)
x = [123, 54, 92, 87, 33]
print("length of t is", len(t))
print("number of 33s in x is", x.count(33))

length of t is 5
number of 33s in x is 1


### Modifying lists
You can alter lists in place, but not tuples

In [11]:
x = [123, 54, 92, 87, 33]
print(x)
x[2] = 33 #modifies the list in variable x, in the position 2 (remember to start indexing at 0)
print(x)

[123, 54, 92, 87, 33]
[123, 54, 33, 87, 33]


Tuples _cannot_ be altered once they have been created, if you try to do so, you'll get an error.

In [12]:
t = (123, 54, 92, 87, 33)
print(t)
t[1] = 4

(123, 54, 92, 87, 33)


TypeError: 'tuple' object does not support item assignment

You can add elements to the end of a list with <tt>append()</tt>

In [13]:
x = [123, 54, 92, 87, 33]
x.append(101)
print(x)

[123, 54, 92, 87, 33, 101]


Appending will add a new element to the end of the list only.

Or you can insert values at a certain position with <tt>insert()</tt>, by supplying the desired position as well as the new value.

In [14]:
x = [123, 54, 92, 87, 33]
x.insert(3, 1111)
print(x)

[123, 54, 92, 1111, 87, 33]


You can remove values with <tt>remove()</tt>

In [15]:
x = [123, 54, 92, 87, 33]
x.remove(123)
print(x)

[54, 92, 87, 33]


and delete values by index with <tt>del</tt>

In [16]:
x = [123, 54, 92, 87, 33]
print(x)
del x[0]
print(x)

[123, 54, 92, 87, 33]
[54, 92, 87, 33]


It's often useful to be able to combine arrays together, which can be done with <tt>extend()</tt> (as <tt>append</tt> would add the whole list as a single element in the list)

In [18]:
a = [1,2,3]
b = [4,5,6]
a.extend(b)
print(a)
a.append(b) #append here will add the list as a new element, so index 6 will be the 'b' list
print(a)

[1, 2, 3, 4, 5, 6]
[1, 2, 3, 4, 5, 6, [4, 5, 6]]


The plus symbol <tt>+</tt> is shorthand for the extend operation when applied to lists:

In [19]:
a = [1, 2, 3]
b = [4, 5, 6]
a = a + b
print(a)

[1, 2, 3, 4, 5, 6]


Slice syntax can be used on the left hand side of an assignment operation to assign subregions of a list

In [22]:
a = [1, 2, 3, 4, 5, 6]
a[1:3] = [9, 9, 9, 9]
print(a)

[1, 9, 9, 9, 9, 4, 5, 6]


You can change the order of elements in a list

In [23]:
a = [1, 3, 5, 4, 2]
a.reverse()
print(a)
a.sort()
print(a)

[2, 4, 5, 3, 1]
[1, 2, 3, 4, 5]


Note that both of these change the list, which means that if you try to access the variable `a` again, it will be in a sorted format and the original format is lost.

If you want a sorted copy of the list while leaving the original untouched, use <tt>sorted()</tt>.

In [24]:
a = [2, 5, 7, 1]
b = sorted(a)
print(a)
print(b)

[2, 5, 7, 1]
[1, 2, 5, 7]


### Getting help from the official Python documentation

The most useful information is online on https://www.python.org/ website and should be used as a reference guide.

- [Python 3.5.2 documentation](https://docs.python.org/3/) is the starting page with links to tutorials and libraries' documentation for Python 3
    - [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
    - [The Python Standard Library Reference](https://docs.python.org/3/library/index.html) is the documentation of all libraries included within Python as well as built-in functions and data types like:
        - [Text Sequence Type — `str`](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)
        - [Numeric Types — `int`, `float`](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex)
        - [Sequence Types — `list`, `tuple`](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range)
        - [Set Types — `set`](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)
        - [Mapping Types — `dict`](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict)
        
### Getting help directly from within Python using `help()`

In [25]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



In [26]:
help(list)

Help on class list in module builtins:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __l

In [27]:
help(list.insert)

Help on method_descriptor:

insert(...)
    L.insert(index, object) -- insert object before index



In [28]:
help(list.count)

Help on method_descriptor:

count(...)
    L.count(value) -> integer -- return number of occurrences of value



## Exercise 1.3.1

1. Create a *list* of DNA codons for the protein sequence CLYSY based on the codon variables you defined previously.
2. Print the DNA sequence of the protein to the screen.
3. Print the DNA codon of the last amino acid in the protein sequence.
4. Create two more variables containing the DNA sequence of a stop codon and a start codon, and replace the first element of the DNA sequence with the start codon and append the stop codon to the end of the DNA sequence. Print out the resulting DNA sequence.

In [76]:
#1
C="TGT"
L="CTT"
Y="TAT"
S="AGT"

DNAcodons=[C,L,Y,S,Y]
print(DNAcodons)

['TGT', 'CTT', 'TAT', 'AGT', 'TAT']


In [77]:
#2
DNAcodons[4]

'TAT'

In [78]:
#2
DNAcodons[-1]

'TAT'

In [79]:
#2
DNAcodons[len(DNAcodons)-1]

'TAT'

In [81]:
#3
stop="TAA"
start="ATG"

DNAcodons[0]=start
DNAcodons
DNAcodons.append(stop)
DNAcodons

['ATG', 'CTT', 'TAT', 'AGT', 'TAT', 'TAA', 'TAA']

## String manipulations

Strings are a lot like tuples of characters, and individual characters and substrings can be accessed and manipulated using similar operations we introduced above.


In [83]:
text = "ATGTCATTTGT"
print(text[0])
print(text[-2])
print(text[0:6])
print("ATG" in text)
print("TGA" in text)
print(len(text))

A
G
ATGTCA
True
False
11


Just as with tuples, trying to assign a value to an element of a string results in an error

In [84]:
text = "ATGTCATTTGT"
text[0:2] = "CCC" 

TypeError: 'str' object does not support item assignment

Python provides a number of useful functions that let you manipulate strings

The `in` operator lets you check if a substring is contained within a larger string, but it does not tell you where the substring is located. This is often useful to know and python provides the <tt>.find()</tt> method which returns the index of the first occurrence of the search string, and the <tt>.rfind()</tt> method to start searching from the end of the string.

If the search string is not found in the string both these methods return -1.

In [85]:
dna = "ATGTCACCGTTT"
index = dna.find("TCA")
print("TCA is at position:", index)
index = dna.rfind('C')
print("The last Cytosine is at position:", index)
print("Position of a stop codon:", dna.find("TGA"))

TCA is at position: 3
The last Cytosine is at position: 7
Position of a stop codon: -1


When we are reading text from files  (which we will see later on), often there is unwanted whitespace at the start or end of the string. We can remove *leading* whitespace with the <tt>.lstrip()</tt> method, *trailing* whitespace with <tt>.rstrip()</tt>, and whitespace from both ends with <tt>.strip()</tt>.

All of these methods return a copy of the changed string, so if you want to replace the original you can assign the result of the method call to the original variable.

In [87]:
s = "    Chromosome Start End                     "
print(len(s), s)
#removes leading whitestrip
s = s.lstrip()
print(len(s), s)
#removes rear whitestrip
s = s.rstrip()
print(len(s), s)
s = "    Chromosome Start End                     "
#removes all whitespace
s = s.strip()
print(len(s), s)

45     Chromosome Start End                     
41 Chromosome Start End                     
20 Chromosome Start End
20 Chromosome Start End


You can split a string into a list of substrings using the <tt>.split()</tt> method, supplying the delimiter as an argument to the method. If you don't supply any delimiter the method will split the string on whitespace by default (which is very often what you want!)

To split a string into its component characters you can simply _cast_ the string to a list 

In [88]:
seq = "ATG TCA CCG GGC"
codons = seq.split(" ")
print(codons)

bases = list(seq) # a tuple of character converted into a list
print(bases)

['ATG', 'TCA', 'CCG', 'GGC']
['A', 'T', 'G', ' ', 'T', 'C', 'A', ' ', 'C', 'C', 'G', ' ', 'G', 'G', 'C']


<tt>.split()</tt> is the counterpart to the <tt>.join()</tt> method that lets you join the elements of a list into a string only if all the elements are of type String:

In [91]:
seq = "ATG TCA CCG GGC"
codons = seq.split(" ")
print(codons)
print("|".join(codons))
print("".join(codons)) #no special divider

['ATG', 'TCA', 'CCG', 'GGC']
ATG|TCA|CCG|GGC
ATGTCACCGGGC


We also saw earlier that the <tt>+</tt> operator lets you concatenate strings together into a larger string.

Note that this operator only works on variables of the same type. If you want to concatenate a string with an integer (or some other type), first you have to cast the integer to a string with the <tt>str()</tt> function.

In [92]:
s = "chr"
chrom_number = 2
print(s + str(chrom_number))

chr2


To get more information about these two methods `split()` and `join()` we could find it online in the Python documentation starting from [www.python.org](http://www.python.org) or get help using the `help()` builtin function.

In [93]:
help(str.split)
help(str.join)

Help on method_descriptor:

split(...)
    S.split(sep=None, maxsplit=-1) -> list of strings
    
    Return a list of the words in S, using sep as the
    delimiter string.  If maxsplit is given, at most maxsplit
    splits are done. If sep is not specified or is None, any
    whitespace string is a separator and empty strings are
    removed from the result.

Help on method_descriptor:

join(...)
    S.join(iterable) -> str
    
    Return a string which is the concatenation of the strings in the
    iterable.  The separator between elements is S.



## Exercise 1.3.2

1. Create a string variable with your full name in it, with your first and last name (and any middle names) seperated by a space. Split the string into a list, and print out your surname.
2. Check if your surname contains the letter "E", and print out the position of this letter in the string. Try a few other letters.

In [98]:
#1
name="Amanda Lopes"
broken=name.split(" ")
broken

['Amanda', 'Lopes']

In [99]:
broken[1]

'Lopes'

In [105]:
#curiosity, not in the excercise
NewName=" ".join(broken)
NewName

'Amanda Lopes'

In [104]:
NewName[1]

'm'

'NewName' now is a string of two words, so asking for index [1] will give me the second letter of the first name.

In [107]:
#2
broken[1].find("e")

3

In [108]:
broken[1].find("E")

-1

Strings are case sensitive as seen above. Remember that -1 here indicates that the find turned out to be unsucessful.

In [110]:
broken[1].find("g")

-1

## Next session

Go to our next notebook: [python_basic_1_4](python_basic_1_4.ipynb)