<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Lists and loops

<div class="alert alert-warning">
Parts of this section have been adapted from copyrighted material in *Jones, M:
Python for Biologists: A complete programming course for beginners (2013)*.

**Please do not distribute it!**

So far in this course, we've always been dealing with one piece of data at a
time.  If we wanted to process multiple items of data in the same way, we had to
repeat the code as many times as we've had items of data.

If this was all Python could do, it would obviously not be very helpful in
solving real-world problems.

Recall the example from the previous Notebook where we read in three DNA
sequences from a file, and then performed the same manipulation on each of them.
At the time we asked ourselves, wouldn't it be nice if we could say _for each
line in the file, do “something”_?  We'd save a lot of repetition and avoid many
possible errors if we could do that.

That is what this Notebook is all about.

## Lists

### Creating lists

When we have a lot of related data items that we want to treat as a unit, we
need a _list_.  A list is what we call a _data structure_ or _compound data
type_.  It expresses the concept of a *collection* of other data items.

To make a new list, we put several items (like strings or numbers) inside square
brackets, separated by commas:

In [1]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
conserved_sites = [24, 56, 132, 289, 427, 558]

>Note how we put a space after every comma in the list definition, and how we
*don't* put a space after the opening bracket or before the closing bracket.

>Python does not, in fact, care whether we put any whitespace in these locations
or not, but it's considered *good style* to use whitespace as we have in the
above definitions.  Following generally accepted style guides means your code
looks more or less like everyone else's, which is a *good thing* since it aids
readability.

As you can see, we assigned a variable to reference each of the two lists we
created. Let's evaluate our our list variables:

In [2]:
apes

['Homo sapiens', 'Pan troglodytes', 'Gorilla gorilla']

In [3]:
conserved_sites

[24, 56, 132, 289, 427, 558]

Just like a simple data type (like a string or number), a list evaluates to
itself.

### Retrieving list elements

Each individual item in a list is called an element. To get a single element
from the list, write the variable name followed by the index of the element you
want in square brackets:

In [1]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
conserved_sites = [24, 56, 132]
print(apes[0])
first_site = conserved_sites[2]
print(first_site)

Homo sapiens
132


Remember that in Python is *zero-indexed*, so we start counting from zero rather
than one.  The first element of a list is always at index zero, the second
element at index 1, and so on.

If we give a negative number, Python starts counting from the end of the list
rather than the beginning – so it’s easy to get the last element from a list:

In [5]:
conserved_sites[-1]

132

If we want to go in the other direction – i.e. we know which element we want but
we don’t know the index – we can use the `index()` method of lists:

In [None]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
chimp_index = apes.index("Pan troglodytes")
print(chimp_index)

Negative indices count back from the end of the list.  So, the last element of
the list has index `-1`:

In [7]:
last_ape = apes[-1]
print(last_ape)

Gorilla gorilla


If we want to count how many times a particular element occurs in a list, we can
use the `count()` method of the list type:

In [6]:
digits = [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0]
ones = digits.count(1)
print(ones)

6


It's probably worth mentioning at this point that a list can indeed *have*
repeated elements.  It's just a list of items, not a (mathematical) set, which
is usually defined as having unique elements.

What if we want to get more than one element from a list? We can use slice
notation to extract a range of elements:

In [8]:
ranks = ["kingdom","phylum", "class", "order", "family"]
lower_ranks = ranks[2:5]
# lower ranks are class, order and family
print(lower_ranks)

['class', 'order', 'family']


Let's play a bit more with slices.  Let's build up a simple list of numbers
first:

In [10]:
ints = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
        11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

>Note how you can split the definition of a list over multiple lines. In fact,
you can break a Python statement across multiple lines at any whitespace within
a pair of parentheses, brackets or braces. Avoiding ridiculously long lines is
also a sign of good coding style.

Evaluate these statements which all use slice notation, and see if they all make
sense to you:

In [11]:
# ...up to just before the 5th element:
ints[:5]

[1, 2, 3, 4, 5]

In [12]:
# from the 5th last element to the end...:
ints[-5:]

[16, 17, 18, 19, 20]

In [13]:
# from the 5th to just before the 15th element, every 5th element:
ints[5:15:5]

[6, 11]

### Calculating the length of a list

To get the length of a list, we can use Python's built-in `len()` function:

In [14]:
print(len(apes))
number_of_conserved_sites = len(conserved_sites)
print(number_of_conserved_sites)

3
3


### Concatenating lists

We can concatenate two lists by using the mathematical plus (`+`) symbol, which
has been overloaded to work on lists:

In [15]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
monkeys = ["Papio ursinus", "Macaca mulatta"]
primates = apes + monkeys
 
print(len(apes), "apes")
print(len(monkeys), "monkeys")
print(len(primates), "primates")

3 apes
2 monkeys
5 primates


As we can see from the output, this doesn’t change either of the two original
lists – it makes a brand new list which contains elements from both.

### Repeating lists

You can repeat a list by using the multiplication operator (`*`):

In [16]:
short_list = [1, 2, 3]
print(short_list * 3)

[1, 2, 3, 1, 2, 3, 1, 2, 3]


In [None]:
zero_list = [0]   # a list with just one element
zeroes = zero_list * 100
print(zeroes)

Note that we can a list by simply passing it as a parameter to the `print`
function.  As usual, `print` prints out a textual representation of the list,
which includes the square brackets and commas.

### Lists and strings

Did any of the above look familiar?  It should have, because it turns out
*lists* have an awful lot in common with *strings*:

* You can extract an dindividual element of a list (or character of a string) by
using the indexing notation `[]`.

* You can extract a sublist of a list, or substring of a string, using
identical slice notation.

* You can concatenate lists and strings, both with the plus operator (`+`).

* You can repeat either lists or strings with the multiplication operator (`*`).

* You can use the built-in function `len()` to calculate the length of a list or
a string.

* Both strings and lists have the methods `index()` and `count()`.

In a way, this makes sense, because a string is almost like a list of
characters, and we now know that we can treat it as such.

The fact that lists and strings have so much in common hints at a deeper
relationship:  We're treating lists and strings as if they're both the same sort
of meta-thing for which we don't yet have a name.

This idea – that we can treat two different things that have similar properties
in the same way;  that two similar things can be seen as special cases of a more
general meta-thing – is a powerful one in Python and we’ll come back to it later
in this section… and beyond.

## Loops

Imagine we wanted to take our list of apes:

In [17]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]

…and print out each element on a separate line, like this:

    Homo sapiens is an ape
    Pan troglodytes is an ape
    Gorilla gorilla is an ape

One way to do it would be to just print each element separately:

In [18]:
print(apes[0], "is an ape")
print(apes[1], "is an ape")
print(apes[2], "is an ape")

Homo sapiens is an ape
Pan troglodytes is an ape
Gorilla gorilla is an ape


…but this is very repetitive and relies on us knowing the number of elements in
the list. What we need is a way to say something along the lines of *“for each
element in the list of apes, print out the element, followed by the words ‘is an
ape’“*. Python’s loop syntax allows us to express those instructions like this:

In [None]:
for ape in apes:
    print(ape, "is an ape")

Let’s take a moment to look at the different parts of this loop.  The loop
consists of two parts:

* The `for` statement, which ends with a colon (“`:`”)
* The *body* of the loop, which is indented

This is the first time we're seeing Python code that doesn't start immediately
on the left margin.  We'll be coming back to that in a second.

The for line is of the format…

```python
for <variable> in <list>
```

…where `<list>` is obviously the list we want to process.  What is `<variable>`?
It's the name of the variable we want to assign to each element of the list in
turn.

`<variable>` — `ape` in our case — is just a normal variable name (so it follows
all the rules that we’ve already learned about variable names). But it behaves
slightly differently to all the other variables we’ve seen so far. In all
previous examples, we create a variable and store something in it, and then the
value of that variable doesn’t change unless we change it ourselves. In
contrast, when we create a variable to be used in a loop, we don’t set its value
– the value of the variable will be automatically set to each element of the
list in turn, and it will be different each time round the loop.

In [19]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
    name_length = len(ape)
    first_letter = ape[0]
    print(ape, "is an ape -- its name starts with", first_letter)
    print("Its name has", name_length, "letters")

Homo sapiens is an ape -- its name starts with H
Its name has 12 letters
Pan troglodytes is an ape -- its name starts with P
Its name has 15 letters
Gorilla gorilla is an ape -- its name starts with G
Its name has 15 letters


The body of the loop in the code above has four statements, two of which are
calls to the `print` function, so each time round the loop we’ll get two lines
of output. If we look at the output we can see all six lines.

Again we see that the `for` line ends with a colon, and the *body* of the loop
is indented.  Unlike the previous example, the body is now made up of more than
one statement — four, to be exact.  Two of those statements are calls to the
`print` function, so we expect the loop to print two lines of output for each
element of the `apes` list.

Indented lines can start with any number of tab or space characters, but they
must all be indented in the same way. (There is a strong recommendation in the
Python community to use *four spaces*.  And using tabs is strongly discouraged.)

A group of indented lines is often called a code *block*.

In this case, we refer to the indented bock as the body of the loop, and the
lines inside it will be executed once for each element in the list. To refer to
the current element, we use the variable name that we wrote in the first line.
The body of the loop can contain as many lines as we like, and can include all
the functions and methods that we’ve learned about.

Why is the above approach better than printing out these six lines in six
separate statements? Well, for one thing, there’s much less redundancy – here we
only needed to write two print statements. This also means that if we need to
make a change to the code, we only have to make it once rather than three
separate times. Another benefit of using a loop here is that if we want to add
some elements to the list, we don’t have to touch the loop code at all.
Consequently, it doesn’t matter how many elements are in the list, and it’s not
a problem if we don’t know how many are going to be in it at the time when we
write the code.

#### Indentation errors

Unfortunately, introducing tools like loops that require an indented block of
code also introduces the possibility of a new type of error – an
`IndentationError`. Notice what happens when the indentation of one of the lines
in the block does not match the others:

In [20]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
    name_length = len(ape)
   first_letter = ape[0]
    print(ape, "is an ape. Its name starts with", first_letter)
    print("Its name has", name_length, "letters")

Homo sapiens is an ape. Its name starts with H
Its name has 12 letters
Pan troglodytes is an ape. Its name starts with P
Its name has 15 letters
Gorilla gorilla is an ape. Its name starts with G
Its name has 15 letters


When you encounter an `IndentationError`, go back to your code and double-check
that all the lines in the block match up.

### Iterating over a string

We’ve already seen how strings can act like lists in many instances.  Can we
also use loop notation to process a string as though it were a list?

In [21]:
name = "martin"
for character in name:
    print("one character is", character)

one character is m
one character is a
one character is r
one character is t
one character is i
one character is n


Yes, it works!

If we write a `for` loop to *iterate* over a string, Python treats each
character in the string as a separate element. This allows us to process a
string one character at a time quite easily.

The process of repeating a set of instructions for each element of a list (or
character in a string) is called *iteration*, and we often talk about *iterating
over* a list or string using a `for` loop.

### Reading a file into a list of strings

Remember our example from the previous notebook?

```python
my_file = open("dna.txt", 'r')

my_line1 = my_file.readline()
my_dna1 = my_line1.rstrip()
print("sequence is", my_dna1, "and its length is", len(my_dna1), "bases")

# ...etc., for the other lines

my_file.close()
```

At the time we said we wished we had a way to say, _for each line in the file, do “something”_.

We're now a step closer to that:  We have a way of saying, _for each element in this list, do “something”_.

As it turns out, file objects also have a `readlines()` method for reading a file into a list of strings, with each line in the file becoming an element in the list.

In [22]:
%cd files

my_file = open("dna.txt", 'r')
list_of_seqs = my_file.readlines()
my_file.close()

list_of_seqs

/Users/sabineurban/EVOP2017/files


['ACTGTACGTGCACTGATC\n', 'CTGGCATAGTCTTATTTT\n', 'CAGGGCGGCGGATCTCTT']

We can now *iterate over* that list using a `for` loop:

In [23]:
for sequence in list_of_seqs:
    sequence = sequence.rstrip()
    print("sequence is", sequence, "and its length is", len(sequence), "bases")

sequence is ACTGTACGTGCACTGATC and its length is 18 bases
sequence is CTGGCATAGTCTTATTTT and its length is 18 bases
sequence is CAGGGCGGCGGATCTCTT and its length is 18 bases


What did we gain?

* Our code is more succinct and readable.


* Our code is less error-prone, since we didn't have to repeat a part of it over and over.


* our code is more *scalable*, since we can now apply it to a file with an arbitrary number of lines.

But we can still do one better…

### Iterating over a file

We've seen before that, in some circumstances, we can treat strings and lists as the same sort of thing.  We've also seen that we can *iterate over* both liss and strings.

Strings and lists are clearly both "sequences" in some sense, and it makes sense that you can iterate over the elements of a sequence.  In fact, it turns out that you can iterate over many
things in Python that are in some sense "containers".  If you come across some new sort of object that in some sense contains other things, try to iterate over it using a `for` loop and you might well surprised.  Or rather, *un*-surprised, since it'll likely do exactly what you'd expect.

>Python prides itself on *consistency*.  Consistency allows you to intuit new features based on what you already know.

Files aren't obviously containers, though.  So what happens when we open a text file and try to iterate over the resulting file object?

Let's go back to our example:

In [24]:
my_file = open("dna.txt", 'r')
for element in my_file:
    print("The element is:", element)
my_file.close()

The element is: ACTGTACGTGCACTGATC

The element is: CTGGCATAGTCTTATTTT

The element is: CAGGGCGGCGGATCTCTT


Python did the logical thing:  `dna.txt` is a text file containing several lines of text, and iterating over it yields one line of the file every time we go around the loop.  Clearly, this is a *very* useful idiom for people in our line of work!

Notice, though, that there's a blank line between every two lines printed by the
`print` function.  Can you guess the reason why?

Yes, every line of the file `dna.txt` is terminated by a newline character `\n`, and the `print()` function appends *another* newline, which means we end up with a blank line after every step.  It's a good idea to strip off that annoying `\n`, and fortunately
we already know just the tool to do it:

In [26]:
seq_file = open("10_sequences.txt", 'r')
for line in seq_file:
    line = line.rstrip()
    print("Line:", line)
seq_file.close()

Line: CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
Line: ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA
Line: TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT
Line: CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
Line: TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC
Line: CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA
Line: GTGCTCTCTTCTCGGAGAGAGAAGGTGGGCTGCTTGTCTGCCGATGTACTTTATTAAATCCAATAA
Line: CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA
Line: TCTGAAAAGTGCAAAGAACGATGATGATGATGATAGAGGAACCTGAGCAGCCATGTCTGAACCTATAGC
Line: GTATTGGTCGTCGTGCGACTAAATTAGGTAAAAAAGTAGTTCTAAGAGATTTTGATGATTCAATGCAAAGTTCTATTAATCGTTCAATTG


Note what happens in line 3:

```python
line = line.rstrip()
```

Here we have the string variable `line`;  we call its `rstrip()` method which produces a *copy of the string* with the whitespace stripped on the right-hand side, and then we assign the variable name `line` again to this copy.

Since `line` has now been assigned to the copy, the original value of `line` is *unfererenced* and will be garbage collected (deleted from memory).

And yes, it's quite possible to change the assignment of `line` ourselves even though it gets assigned a new value in each iteration of the `for` loop.  It'll preserve the value we've assigned it for the current iteration, and then simply get assigned a new value by the `for` loop machinery the next time the loop starts.

Of course, just printing out the lines of the file isn't all that useful.  Whydon't we try to do some actual processing using the sequences in a file?  Let's look at a slightly longer file called `10_sequences.txt`:

In [None]:
# %load 10_sequences.txt
CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA
TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC
CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA
GTGCTCTCTTCTCGGAGAGAGAAGGTGGGCTGCTTGTCTGCCGATGTACTTTATTAAATCCAATAA
CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA
TCTGAAAAGTGCAAAGAACGATGATGATGATGATAGAGGAACCTGAGCAGCCATGTCTGAACCTATAGC
GTATTGGTCGTCGTGCGACTAAATTAGGTAAAAAAGTAGTTCTAAGAGATTTTGATGATTCAATGCAAAGTTCTATTAATCGTTCAATTG


As you can see, `10_sequences.txt` contains 10 lines of variable-length sequence data.  Let's calculate the GC content for each sequence in turn:

In [25]:
for line in open("10_sequences.txt", 'r'):
    line = line.rstrip()
    line = line.upper()
    num_c = line.count('C')
    num_g = line.count('G')
    gc_content = (num_c + num_g) / len(line) * 100
    print("GC content:", gc_content)

GC content: 36.0655737704918
GC content: 38.70967741935484
GC content: 46.666666666666664
GC content: 41.935483870967744
GC content: 47.91666666666667
GC content: 35.0
GC content: 45.45454545454545
GC content: 34.375
GC content: 43.47826086956522
GC content: 32.22222222222222


Note that it was necessary to `rstrip()` the line in line 2 of the code cell,
since otherwise we would've miscalculated the line's length with `len()` in line
6, leading to an incorrect value for the GC content.

Why convert to uppercase using `upper` in line 3?  Well, *just in case!*
`10_sequences.txt` is small enough that we can scan it by eye and ascertain that
it doesn't contain any lowercase data, but that might not be true of every file
we wish to process!

Also note that this time around, we used the `open()` function directly in the
`for` statement.  In other words, we didn't specifically assign a variable name
to the file object, hence the file object is *unreferenced* and will be garbage
collected when no longer needed for the iteration.  And as we've said, Python is
smart enough to close the underlying file when a file object is destroyed, so we don't have to call its `close()` method explicitly in this case.

>When iterating over a file using a `for` loop, it's **not a good idea** also to
read data from the file using the `read()` method of the file object.  Python
keeps track of "where we are in a file" using a so-called *file pointer*.  If
you mix the methods of reading data from a file, you'll probably get unexpected
results.

A final and very important point which I wish to underline:

>**Iterating over a text file is a smart thing to do, because only one line is
read into memory at a time!**

Certainly, you can read the entire contents of a text file into memory using the
`read()` or `readlines()` method of the file object (as a single string or a list of strings, respectively).  But that means *reading the entire file into memory*.  Which is OK for a small file like `10_sequences.txt`, but not for a large file that may be many gigabytes ins
size.

However, *iterating* over that file means that only one line of the file is read
into memory (and processed) at any one time.

**Iterating over a file like this is an extremely common idiom**, so let's show it once more:

```python
for line in open(<filename>, 'r'):
  line = line.rstrip()
  # do somethign with 'line'
```

## More on lists and loops

There's a lot more to be said about both lists and loops, so here's a grab-bag of interesting things:

### Adding elements to a list

Having pointed out the similarity of lists and strings, let's now look at a
couple of methods of lists that are *not* shared with strings.

To add another element onto the end of an existing list, we can use the `append()`
method:

In [28]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
apes.append("Pan paniscus")
print(apes)

['Homo sapiens', 'Pan troglodytes', 'Gorilla gorilla', 'Pan paniscus']


`append()` is an interesting method because *it actually changes the variable on
which it’s used* – in the above example, the `apes` list goes from having three
elements to having four. We say the list has been changed *in place*.

In [29]:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
print("There are", len(apes), "apes")
apes.append("Pan paniscus")
print("Now there are", len(apes), "apes")

There are 3 apes
Now there are 4 apes


The output shows that the number of elements in `apes` really has changed.

If we want to add elements from a list onto the end of another (existing) list, we can use the `extend()` method. `extend()` behaves like `append` but takes a list as its argument rather than a single element.

In [30]:
conserved_sites = [24, 56, 132, 289, 427, 558]
conserved_sites.extend([638, 915])
print(conserved_sites)

[24, 56, 132, 289, 427, 558, 638, 915]


Note that — just like `append()` — `extend()` changes a list in place.

### Nested lists?

What would have happened in the last example if we had accidentally used
`append()` instead of `extend()` in the previous code block?  Can you predict it?

In [31]:
conserved_sites = [24, 56, 132, 289, 427, 558]
conserved_sites.append([638, 915])
print(conserved_sites)

[24, 56, 132, 289, 427, 558, [638, 915]]


What happened there?

Well, the `append()` method of a list object takes one argument, and appends it to
the list.  In this case, the single argument it received was *another* list,
which it duly added to itself.  So clearly, lists can store other lists!

One use for nested lists is representing a 2-dimensional vector, like this:

In [32]:
vector = [[2, 8, 4],
          [3, 0, 1],
          [6, 9, 0]]
print(vector)

[[2, 8, 4], [3, 0, 1], [6, 9, 0]]


To extract the second element of our list (vector), we use normal subscript
notation.  (Remember, zero-based!)

In [33]:
vector[1]

[3, 0, 1]

Can you imagine how we'd extract the 3rd element of that 2nd element?  I think
you could…

In [34]:
vector[1][2]

1

### Changing the order of a list

Here are two more list methods that change the list they’re called on in place: `reverse()`
and `sort()`. Both of these work by changing the order of the elements in the list:

In [35]:
ranks = ["kingdom","phylum", "class", "order", "family"]
print("at the start:   ", ranks)
ranks.reverse()
print("after reversing:", ranks)
ranks.sort()
print("after sorting:  ", ranks)

at the start:    ['kingdom', 'phylum', 'class', 'order', 'family']
after reversing: ['family', 'order', 'class', 'phylum', 'kingdom']
after sorting:   ['class', 'family', 'kingdom', 'order', 'phylum']


If we take a look at the output, we can see how the order of the elements in the
list is changed by these two methods.

By default, Python sorts strings in alphabetical order and numbers in ascending
numerical order.

### Changing elements of a list

We've seen that we can extract an element from a list using subscript notation:

In [36]:
ranks = ["kingdom","phylum", "class", "orrder", "family"]
print(ranks[3])

orrder


Oops.  We misspelled `order`.  Fortunately we can also *assign to an indexed
location*:

In [37]:
ranks[3] = "order"
print(ranks)

['kingdom', 'phylum', 'class', 'order', 'family']


Note that assigning to an element again changes the list *in place*.

If we can assign to a single index, what are the chances we can also assign to a
range of locations using slice notation?  Let's try…

In [38]:
nucs = ['a', 'g', 't', 'a', 'a', 'c', 'c', 't']
nucs[3:6] = ['n', 'n', 'n']
print(nucs)

['a', 'g', 't', 'n', 'n', 'n', 'c', 't']


Well, that worked!  Note that the list you assign to the sliced location need
not be the same length as the slice:

In [39]:
nucs = ['a', 'g', 't', 'a', 'a', 'c', 'c', 't']

# create a replacement list consisting of 8 'n's
replacement = ['n'] * 8
print(replacement)

# do the replacement by assigning to a slice
nucs[3:6] = replacement
print(nucs)

['n', 'n', 'n', 'n', 'n', 'n', 'n', 'n']
['a', 'g', 't', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'c', 't']


Hmm, so if we can change a list in place by assigning to elements or ranges, can
we do the same to strings?

In [None]:
my_dna = "ACTTACG"
my_dna[2] = 'C'

Nope!  In fact, the error message is quite explicit:  The string object does not
support item assignment.

In Python, we say that:

* lists are *mutable* (can be changed *in place*), but
* strings are *immutable* (cannot be changed *in place*)

From now on, when we talk about new types (both simple and compound), we'll often mention whether they're mutable or immutable, as this is a basic characteristic.

### Splitting strings and joining lists

There are plenty of functions and methods that produce lists as their output.  For instance, we've seen how we can produce a list of strings from a text file by using the `readlines()` method of a file object.

Another such method that is particularly interesting to biologists is the `split()` method of string objects. `split()` takes a single argument, called the *delimiter*,  splits the original string wherever it sees the delimiter, and returns a list. Here’s an example:

In [40]:
names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(species)

['melanogaster', 'simulans', 'yakuba', 'ananassae']


We can see from the output that the string has been split wherever there was a
comma leaving us with a list of strings.

What if we want to do the opposite thing, and join a list of strings into a
single string, using a delimiter to separate them?

The syntax for this seems strange to newcomers:  The string type has a `join()` method,
which takes as argument a list of strings.  It returns a single string which is that list of strings joined up, using the original string as delimiter.  It's easier to demonstrate by example:

In [41]:
species = ['melanogaster', 'simulans', 'yakuba', 'ananassae']
names = ','.join(species)
print(names)

melanogaster,simulans,yakuba,ananassae


In [42]:
nucs = ['a', 'g', 't', 't', 'c', 't']
print("---".join(nucs))

a---g---t---t---c---t


In [43]:
words = ["This", "is", "a", "sentence."]
print(' '.join(words))
print(''.join(words))

This is a sentence.
Thisisasentence.


Note that in the last case, the strings in the list were joined into a new
string, delimited by … the empty string.

### Looping with `range()`

Sometimes we want to loop over a list of numbers. Imagine we have a protein
sequence:

In [45]:
protein = "vlspadktnv"

…and we want to print out the first three residues, then the first four
residues, etc.:

    vls
    vlsp
    vlspa
    vlspad
    …etc…

One way to tackle the problem would be to use a loop – we could extract a
substring from the protein sequence and print it in the body of the loop, and
the only thing that would need to change is the stop position in the substring.
But what are we going to iterate over? We can’t just iterate over the protein
string, because that will give us individual residues, which is not what we
want. We can manually assemble a list of stop positions, and loop over that:

In [46]:
stop_positions = [3, 4, 5, 6, 7, 8, 9, 10]
for stop in stop_positions:
    substring = protein[:stop]
    print(substring)

vls
vlsp
vlspa
vlspad
vlspadk
vlspadkt
vlspadktn
vlspadktnv


…but this seems cumbersome, and only works if we know the length of the protein
sequence in advance.

A better solution is to use the `range()` function. `range()` is a built-in Python function that returns a special `range` object.  A `range` object  generates numbers when we iterate over it.  If you evaluate a `range` function, you'll see a textual representation of the `range` object:

In [47]:
my_range = range(5)
my_range

range(0, 5)

>In Python 2, `range()` produced an actual list of numbers.

In [48]:
for n in my_range:
    print(n)

0
1
2
3
4


As you can see, the `range` object produced by `range(5)` yield the numbers from
0 to *just before* 5, when iterated over.

The behaviour of the `range()` function depends on how many arguments we give it.
Let's look at a few examples:

With a single argument, range will count up from zero to that number, excluding
the number itself:

In [49]:
for number in range(6):
    print(number)

0
1
2
3
4
5


With two numbers, `range` will count up from the first number (inclusive) to the
second (exclusive):

In [50]:
for number in range(3, 8):
    print(number)

3
4
5
6
7


With three numbers, `range` will count up from the first to the second with the
step size given by the third:

In [51]:
for number in range(2, 14, 4):
    print(number)

2
6
10


### Building a list from a loop

It's quite common to want to iterate over an object, and build up a list from
the results of the iteration.  Let's say we want a list of numbers from 1,
through 10.  We could generate the elements of that list with a call to `range()`,
and then loop over the `range` object, building up the list as we go along by
using the `append()` method of the list object:

In [52]:
# create an empty list
numbers = []

for num in range(1, 11):
    numbers.append(num)
print(numbers)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


However, there's a quicker way to do this.  Python has a built-in function
`list()` that can take as argument any object you can iterate over, and return the
result of the iteration as a list.  (The `list` function *implicitly* iterates
over the object.)

In [53]:
numbers = list(range(1, 11))
print(numbers)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


This works for any object you can iterate over.  Like a string:

In [54]:
nucs = list("ACTG")
print(nucs)

['A', 'C', 'T', 'G']


…or even a file:

In [56]:
lines = list(open("dna.txt", 'r'))
print(lines)

['ACTGTACGTGCACTGATC\n', 'CTGGCATAGTCTTATTTT\n', 'CAGGGCGGCGGATCTCTT']


Using `list()` with a file as argument has much the same effect as calling the `readlines()` method of that same file.  And the same caveat holds:  This effectively reads the entire file into the computer's memory, no matter how large it may be!

It remains smarter to *iterate* over a file object using a simple `for` loop, as stated earlier!

---

## Exercises

>Note: all the files mentioned in these exercises can be found in the `files` subdirectory.  We have already changed into that subdirectury using the Jupyter magic command `%cd` earlier in this Notebook.

### 1. Processing DNA in a file

The file `input.txt` contains a number of DNA sequences, one per line. Each
sequence starts with the same 14 base pair fragment – a sequencing adapter that
should have been removed. Write a program that will:

1. trim this adapter and write the cleaned sequences to a new file and
2. print the length of each sequence to the screen.

In [23]:
input_file = open("files/input.txt", 'r')
trimmed_file=open("files/trim.txt", 'w')
for line in input_file:
    line = line[15:]
    length = len(line)
    print(length, file=trimmed_file)
input_file.close()
trimmed_file.close()

In [None]:
# %load files/trim.txt
26
11
27
15


### 2. Multiple exons from genomic DNA

In [None]:
outfile = open("stripped_input.txt", 'w')

for line in open("input.txt", 'r'):
    line = line.rstrip()
    print(line[15:], file=outfile)
    
outfile.close()

In [None]:
!cat stripped_input.txt

Use the file of cleaned sequences from Question 1, part (1) as input.  It now contains a genomic DNA sequence per line, with the adapters removed.

The file `exons.txt` contains a list of start/stop positions of exons. Each exon
is on a separate line and the start and stop positions are separated by a comma.

Write a program that will extract the exon segments from the genomic DNA,

concatenate them, and write the concatenated sequence to a new file.

In [None]:
datafile = open("stripped_input.txt", 'r')

exons = []

for line in open("exons.txt", 'r'):
    line = line.rstrip()
    elements = line.split(',')
    start = int(elements[0])
    end = int(elements[1])
    sequence = datafile.readline().rstrip()
    exons.append(sequence[start:end+1])

datafile.close()

print(''.join(exons), file=open("exons_out.txt", 'w'))