<h1 id="toctitle">Lists and loops</h1>
<ul id="toc"/>

Storing multiple values is currently awkward as we saw in the exercise from the previous session:

In [None]:
# three string variables
header1 = "ABC123"
header2 = "DEF456"
header3 = "HIJ789"

## Lists
### Defining a list and getting elements
Python's solution is to have lists of data, which can store multiple values. Lists start and end with square brackets and the __elements__ are separated by commas:

In [1]:
headers = ["ABC123","DEF456","HIJ789"] #instead of having 3 different items (above) have 1 list of items 
headers

['ABC123', 'DEF456', 'HIJ789']

Once we have stored some values in a list, we can get a single element by giving its __index__:

In [2]:
headers[1]
headers[1:2]#is the same thing as above but is a list instead of a singular varibale 

'DEF456'

Index ranges work just like they did for strings:

In [3]:
headers[1:3] #header 2 and 3 remember starts from 0 so does not include first one 

['DEF456', 'HIJ789']

Start counting from zero, inclusive at the start, exclusive at the end. 

### Building up a list

Often we want to start with an empty list and build it up:

In [None]:
headers = [] #create a list

The `append()` method adds a value on to the end of the list:

In [5]:
headers = []
print(headers)
headers.append("ABC123")#adds this into the empty list
print(headers)
headers.append("XYZ123")#adds one more value to the list, 
print(headers)

#can only call append on an already defined list e.g. ['WWWe','OOOhwee'].append('XXX') will not work as list is not saved anywhere

[]
['ABC123']
['ABC123', 'XYZ123']


We can also concatenate two lists:

In [6]:
combined = headers + ["DEF456","HIJ789"] ## adds together two lists can also concatenate to already defined lists 
combined

['ABC123', 'XYZ123', 'DEF456', 'HIJ789']

In [None]:
#.pop() removes the last item you added on the end 

note that this doesn't change the original list like `append()` does:

In [10]:
headers

['ABC123', 'XYZ123']

If we try to concatenate a string value and list, we will get an error:

In [8]:
headers + "DEF456"

TypeError: can only concatenate list (not "str") to list

But we could always put the string value into a single-item list:

In [9]:
headers + ["DEF456"]# combine this list but with 
#.append changes the original list but this way means you can create another 

['ABC123', 'XYZ123', 'DEF456']

### Creating a list by splitting a string

The `split()` method of strings returns a list:

In [11]:
sentence = "one two three four"
print(sentence.split())#splitting on spaces default creating a list from your string

commastring = "Kingdom,Phylum,Class,Order,Family,Genus,Species"
print(commastring.split(","))#changing string into a list, split by comma

['one', 'two', 'three', 'four']
['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']


This doesn't change the original string:

In [12]:
sentence

'one two three four'

Another way to create a list is by using the `range()` function. `range()` can generate lists of numbers. In Python 3, `range()` has a special behaviour that we call __lazy__ - it only generates the numbers as we use them. This means that if we want to see the numbers printed, we have to call `list()` on the result of `range()` (as below). When we want to use a range in a loop (coming up soon!), we don't have to to this though.

With one argument, `range()` counts from zero to just before the number:

In [13]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [20]:
range(0,10) #range object so when we actually use them 

range(0, 10)

with two arguments it counts from the first number to just before the second:

In [14]:
list(range(5,12))

[5, 6, 7, 8, 9, 10, 11]

With three arguments, it uses the third argument as the step size:

In [19]:
list(range(3, 24, 4))#cool! all numbers between 3 and 24 in steps of 4 

[3, 7, 11, 15, 19, 23]

## Tools for manipulating lists

### Sorting and reversing a list

Starting with our headers:

In [21]:
headers = ["DEF456","ABC123","HIJ789"]

Calling the `sort()` method changes the order to alphabetical:

In [22]:
headers.sort()
headers

['ABC123', 'DEF456', 'HIJ789']

or numerical, if the items are numbers:

In [23]:
numbers = [3,18,4,16,9,103,15,72,4,12,1,18,5,13]
numbers.sort()
numbers

[1, 3, 4, 4, 5, 9, 12, 13, 15, 16, 18, 18, 72, 103]

But be careful if your list contains strings that look like numbers. Because the elements are strings the sort order will still be alphabetical.

In [24]:
not_really_numbers = '3 18 4 16 9 103 15 72 4 12 1 18 5 13'.split()
not_really_numbers.sort()
not_really_numbers

['1', '103', '12', '13', '15', '16', '18', '18', '3', '4', '4', '5', '72', '9']

We can reverse the order of a list as well:

In [25]:
headers = ["DEF456","ABC123","HIJ789"]
headers.reverse()
headers

['HIJ789', 'ABC123', 'DEF456']

Note that, like `append()` these methods both change the list they're run on.

## Loops

We know how to do something for one element of a list:

In [26]:
headers = ["DEF456","ABC123","HIJ789"]
one_header = headers[1]
print(one_header)

ABC123


How do we do something for every element of a list?

In [27]:
for header in headers:
    print("one header is " + header)#headers is like i in R for i in blah

one header is DEF456
one header is ABC123
one header is HIJ789


 - `header` is a new variable that you can read inside the loop
 - The `for` line ends with a colon
 - The body of the loop is indented (in Jupyter you may use Tab or Space to do this)
 - The value of header is set to each element in turn:
 
 We can do all the stuff that we already know how to do inside a loop.

In [28]:
seqs = ['actgtacgact', 'catgc', 'atcgatatagctag']
for seq in seqs:
    print("one sequence is " + seq)
    print("it starts with " + seq[0])
    print("its length is " + str(len(seq)))
    print("it contains " + str(seq.count('g')) + ' g bases\n') ## indenting the loops puts it into the loop

one sequence is actgtacgact
it starts with a
its length is 11
it contains 2 g bases

one sequence is catgc
it starts with c
its length is 5
it contains 1 g bases

one sequence is atcgatatagctag
it starts with a
its length is 14
it contains 3 g bases



### Loops and strings

We can write a loop that uses a string as if it were a list - each character will behave like an individual element:

In [29]:
dna = "agcacgacgtagtgatcggcta"

# strings can pretend to be lists
# each element is a character
for base in dna:
    print(base)

a
g
c
a
c
g
a
c
g
t
a
g
t
g
a
t
c
g
g
c
t
a


It's easy to do this by accident so if you see individual characters, check you're looping over the right value. 

### Loops and files

We can treat a file object as if it were a list, each line is a list element:

In [30]:
my_file = open("test.txt")

# files can pretend to be lists
# each element is a line
for line in my_file:
    # remember the lines still end in "\n"
    line_length = len(line.rstrip("\n"))
    print(line_length)

5
6
6


Remember that files are **exhaustible**, so we can't both read the contents of a file and then loop over it:

In [31]:
my_file = open("test.txt")
my_file_contents = my_file.read()
# now my_file is exhausted

# this loop never runs because we are already at the end of the file
for line in my_file:
    line_length = len(line.rstrip("\n"))
    print(line_length)
    
#can also do:
#my_file_contents = my_file.read()
#for line in my_file_contents.split()

and we can't loop over a file twice:

In [32]:
my_file = open("test.txt")

for line in my_file:
    print(line)

# this loop never runs because we are already at the end of the file
for line in my_file:
    line_length = len(line.rstrip("\n"))
    print(line_length)
    ## loop doesnt run, needs to re-open file
    


apple

banana

orange



### The enumerate function

Sometimes, when looping through a list, you really need both the item *and* the index in the list. From what we already know, there's an obvious way to do this:

In [33]:
seqs = ['actgtacgact', 'catgc', 'atcgatatagctag']
for idx in range(len(seqs)):##range is like nrow so is looping over 0,1 and 2 the use index to get other items 
  seq = seqs[idx]
  print("Sequence " + str(idx) + " is " + seq)

Sequence 0 is actgtacgact
Sequence 1 is catgc
Sequence 2 is atcgatatagctag


There's nothing wrong with this code, but if `seqs` were a file object this wouldn't work (you can't call `len()` on the file object). Python has an alternative, which is more succinct and it works with loops over files. 

In [34]:
seqs = ['actgtacgact', 'catgc', 'atcgatatagctag']
for idx, seq in enumerate(seqs):# need to give enumerate two things to work with so idx is the index number here
  print("Sequence " + str(idx) + " is " + seq)

Sequence 0 is actgtacgact
Sequence 1 is catgc
Sequence 2 is atcgatatagctag


Note that you now need two variable names in the `for` line, because the `enumerate()` function returns pairs of results.

## Exercises

### Processing DNA in a file

The file _input.txt_ contains a number of DNA sequences, one per line. Each sequence starts with the same 14 base pair fragment – a sequencing adapter that needs to be removed. 

Write a program that will (a) trim this adapter and write the cleaned sequences to a new file and (b) print the length of each sequence to the screen.

__Side note: if you open this file in Windows Notepad, it will look like it's all one long line. That's because Notepad is a very old program that can't properly read files created on other operating systems, e.g. Linux which I used to prepare the exercise files. Any other text editor, or Jupyter, or Python itself will be able to read the file correctly.__

In [56]:
dna_seq = open("input.txt") #dont open input file with a w!!
output=open('output.txt','w')

for i in dna_seq:
    i=i.rstrip('\n')
    print(i[14:], file=output)#can specify where to print using file=
    print("The length is:"+ str(len(i)), file=output)
    
    
dna_seq.close()
output.close()

### Multiple exons from genomic DNA

The file *genomic_dna2.txt* contains a section of genomic DNA, and the file _exons.txt_ contains a list of start/stop positions of exons. Each exon is on a separate line and the start and stop positions are separated by a comma. The start and stop positions follow Python conventions: i.e. they start counting from zero and are inclusive at the start and exclusive at the end. 

Hint: open these two files up in a text editor before you start coding so you can make sure you understand their format.

Write a program that will process the exons.txt file and print the length of each exon. You'll need to:

 - open the file
 - process it one line at a time using a loop
 - split the line when you see a comma
 - take the first and second parts and assign them to variables
 - make sure they are converted to numbers
 - subtract the start position from the stop position to give the length
 
Next, write a program that will use the start/stop positions to extract the exon segments, concatenate them, and write them to a new file. You'll need to open the genomic DNA file and extract one exon each time round the loop.

Hint: for each file, think about how you want to get the data - do you want to read the whole contents in one go, or do you want to process one line at a time?

In [98]:
genome_dna_file=open("genomic_dna2.txt")
exons_file=open("exons.txt")
outfile=open("outfile_03.txt",'w')

for line in exons_file:
    line = line.rstrip('\n').split(",")
    start=int(line[0])
    end=int(line[1])
    length=end-start
    print("Exon is " + str(length)+"bp long")
    #outfile.write(genome_dna_file[start:end])



Exon is 53bp long
Exon is 61bp long
Exon is 86bp long
Exon is 58bp long


In [93]:

all_exons=all_exons+genome_dna_file[start:end]

TypeError: '_io.TextIOWrapper' object is not subscriptable

### Bonus exercise: sliding windows

Write a program that will print a list of overlapping short segments from a long string (i.e. a sliding window approach). 

E.g. with input `abcdefg` and window size 4:

`abcd, bcde, cdef, defg`

Modify your program to print the AT content of each sliding window rather than the sequence. You can expand your list of segments to include the partial ones at the end.

In [None]:
# break stops the loop
#for idx, letter in enumerate(seq_test[:-3])
# if need to stop 3 before the end .... give letters up to but not including the 3rd letter from the end

## Bonus exercise: alignment columns

Given a list of aligned DNA sequences:

```
['atgctcgatcgctag',
 'aag-tcgctcgct--',
 'atcctc--tcgcggg']
```
write a program that will print out each column of the alignment one at a time:

```
column 1: aaa
column 2: tat
column 3: ggc
column 4: c-c
...
```
 
   
   
## Bonus exercise: restriction fragments

Modify your program from yesterday to calculate the lengths of restriction fragments when the DNA sequence has multiple restriction sites. Make a small test DNA sequence for working on, then try on the sequence in the file *ce1.txt* which contains the sequence for chromosome one of *C. elegans*. Hint: there's an optional second argument to `find()` which tells Python where in the string to start looking. 

__Side note: to solve this exercise in a satisfying way, you need some way of telling Python when to stop looking for additional restriction sites. We will cover the tools required to do this in the next session: for now, just print the first x positions (where x can be a large number) and don't worry if the final ones get printed multiple times.__


