# An introduction to solving biological problems with Python

## Session 2.2: Loops

- [The <tt>for</tt> loop](#The-for-loop)
- [Exercises 2.2.1](#Exercises-2.2.1)
- [The <tt>while</tt> loop](#The-while-loop)
- [Exercises 2.2.2](#Exercises-2.2.2)
- [Skipping and breaking loops](#Skipping-and-breaking-loops)
- [More looping using `range()` and `enumerate()`](#More-looping)
- [Filtering in loops](#Filtering-in-loops)
- [Exercises 2.2.3](#Exercises-2.2.3)

## Loops

When an operation needs to be repeated multiple times, for example on all of the items in a list, we 
avoid having to type (or copy and paste) repetitive code by creating a loop. There are two ways of creating loops in Python, the <tt>for</tt> loop and the <tt>while</tt> loop.

## The <tt>for</tt> loop

The for loop in Python iterates over each item in a sequence (such as a list or tuple) in the order that they appear in the sequence. What this means is that a variable (<tt>code</tt> in the below example) is set to each item from the sequence of values in turn, and each time this happens the indented block of code is executed again.

In [1]:
codeList = ['NA06984', 'NA06985', 'NA06986', 'NA06989', 'NA06991']

for code in codeList:
    print(code)

NA06984
NA06985
NA06986
NA06989
NA06991


A <tt>for</tt> loop can iterate over the individual characters in a string:

In [2]:
dnaSequence = 'ATGGTGTTGCC'

for base in dnaSequence:
    print(base)

A
T
G
G
T
G
T
T
G
C
C


And also over the keys of a dictionary: 

In [3]:
rnaMassDict = {"G":345.21, "C":305.18, "A":329.21, "U":302.16}

for x in rnaMassDict:
    print(x, rnaMassDict[x])

G 345.21
C 305.18
A 329.21
U 302.16


Any variables that are defined before the loop can be accessed from inside the loop. So for example to calculate the summation of the items in a list of values we could define the total initially to be zero and add each value to the total in the loop:

In [4]:
total = 0
values = [1, 2, 4, 8, 16]

for v in values:
    total = total + v
    # total += v
    print(total)

print(total)

1
3
7
15
31
31


Naturally we can combine a <tt>for</tt> loop with an <tt>if</tt> statement, noting that we need two indentation levels, one for the outer loop and another for the conditional blocks:

In [5]:
geneExpression = {
    'Beta-Catenin': 2.5, 
    'Beta-Actin': 1.7, 
    'Pax6': 0, 
    'HoxA2': -3.2
}

for gene in geneExpression: #gene here will look at keys of the dictionary
    if geneExpression[gene] < 0: #looks at the value refering to the key in question
        print(gene, "is downregulated") #states the key of the dictionary
        
    elif geneExpression[gene] > 0:
        print(gene, "is upregulated")
        
    else:
        print("No change in expression of ", gene)

Beta-Catenin is upregulated
Beta-Actin is upregulated
No change in expression of  Pax6
HoxA2 is downregulated


## Exercises 2.2.1

1. Create a sequence where each element is an individual base of DNA. Make the sequence 15 bases long.
2. Print the length of the sequence.
3. Create a `for` loop to output every base of the sequence on a new line.

In [8]:
dna='ATGCCTGAAGTCCAT'
print(len(dna))

for base in dna:
    print(base)

15
A
T
G
C
C
T
G
A
A
G
T
C
C
A
T


## The <tt>while</tt> loop

In addition to the <tt>for</tt> loop that operates on a collection of items, there is a <tt>while</tt> loop that simply repeats while some statement evaluates to True and stops when it is False. Note that if the tested expression never evaluates to False then you have an “infinite loop”, which is not good.

In this example we generate a series of numbers by doubling a value after each iteration, until a limit is reached: 

In [9]:
value = 0.25
while value < 8:
    value = value * 2
    print(value)

print("final value:", value)

0.5
1.0
2.0
4.0
8.0
final value: 8.0


One way to think about a while loop, is that it will keep progressing to a situation where the condition will stop being true and hence the while loop will stop.

Whats going on here is that the value is doubled in each iteration and once it gets to 8 the while test fails (8 is not less than 8) and that last value is preserved. Note that if the test were instead value `<= 8` then we would get one more doubling and the value would reach 16.

## Exercises 2.2.2

1. Reuse the 15 bases long sequence created at the previous exercise where each element is an individual base of DNA.
2. Create a <tt>while</tt> loop similar to the one above that starts at the third base in the sequence and outputs every third base until the 12th.

In [34]:
dna='ATGCCTGAAGTCCAT'

pos=2 #we want starting from position 3
while pos <= 12: #maximum is position 12
    print(pos+1,dna[pos])
    pos +=3 #add on top of and starting at position 3

3 G
6 T
9 A
12 C


## Skipping and breaking loops

Python has two ways of affecting the flow of the <tt>for</tt> or <tt>while</tt> loop inside the block. The <tt>continue</tt> statement means that the rest of the code in the block is skipped for this particular item in the collection, i.e. jump to the next iteration. In this example negative numbers are left out of a summation:

In [12]:
values = [10, -5, 3, -1, 7]

total = 0
for v in values:
    if v < 0:
        continue # Skip this iteration   
    total += v

print(total) # this will add 10+3+7

20


The other way of affecting a loop is with the <tt>break</tt> statement. In contrast to the <tt>continue</tt> statement, this immediately causes all looping to finish, and execution is resumed at the next statement _after_ the loop.

In [11]:
geneticCode = {'TAT': 'Tyrosine',  'TAC': 'Tyrosine',
               'CAA': 'Glutamine', 'CAG': 'Glutamine',
               'TAG': 'STOP'}

sequence = ['CAG','TAC','CAA','TAG','TAC','CAG','CAA']

for codon in sequence:
    if geneticCode[codon] == 'STOP':
        break            # Quit looping at this point
    else:
        print(geneticCode[codon])

Glutamine
Tyrosine
Glutamine


## Looping gotchas

An internal counter is used to keep track of which item is used next, and this is incremented on each iteration. When this counter has reached the length of the sequence the loop terminates. This means that if you delete the current item from the sequence, the next item will be skipped (since it gets the index of the current item which has already been treated). Likewise, if you insert an item in a sequence before the current item, the current item will be treated again the next time through the loop. This can lead to nasty bugs that can be avoided by making a temporary copy using a slice of the whole sequence.

<div class="alert-warning">
**When looping, never modify the collection!** Always create a copy of it first.
</div>

## More looping

### Using `range()`

If you would like to iterate over a numeric sequence then this is possible by combining the `range()` function and a `for` loop. A range requires the definition of a start, end and step (how you want to go through the list).

In [14]:
print(list(range(10)))

print(list(range(5, 10)))

print(list(range(0, 10, 3)))

print(list(range(7, 2, -2))) #since 7 is bigger than 2, the list needs to be printed backwards

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[5, 6, 7, 8, 9]
[0, 3, 6, 9]
[7, 5, 3]


In [16]:
print(list(range(7,2)))

[]


The above example was done similar to the last example, the list given is empty because 7 is bigger than 2, so you need to define that the list can be shown in steps going from 7 down to 2.

Looping through ranges 

In [17]:
for x in range(8):
    print(x*x)

0
1
4
9
16
25
36
49


In [18]:
squares = []
for x in range(8):
    s = x*x
    squares.append(s)
    
print(squares)

[0, 1, 4, 9, 16, 25, 36, 49]


### Using `enumerate()`

Given a sequence, `enumerate()` allows you to iterate over the sequence generating a tuple containing each value along with a corresponding index.

In [21]:
letters = ['A','C','G','T']
for index, letter in enumerate(letters):
    print(index, letter)

0 A
1 C
2 G
3 T


In [20]:
a=['ab','bc','cd']
for i, elem in enumerate(a):
    print(i,elem) #i would refer to the index, and elem would refer to the value in that index position

0 ab
1 bc
2 cd


In [22]:
numbered_letters = list(enumerate(letters))
print(numbered_letters)

[(0, 'A'), (1, 'C'), (2, 'G'), (3, 'T')]


`d.items` is a function that allows you to transform a dictionary into a list of tuples which will have the doctionary keys and values together in pairs in each tuple that is in the list.

## Filtering in loops

In [23]:
city_pops = {
    'London': 8200000,
    'Cambridge': 130000,
    'Edinburgh': 420000,
    'Glasgow': 1200000
}

big_cities = []
for city in city_pops: #city will be the keys in the dictionary
    if city_pops[city] >= 1000000: #want to access the value that refers to that key in the dictionary
         big_cities.append(city) #add only the keys of the dictionary

print(big_cities)

['London', 'Glasgow']


In [24]:
total = 0
for city in city_pops:
    total += city_pops[city] #add the value that refers to that key
print("total population:", total)

total population: 9950000


In [26]:
pops = list(city_pops.values())
print("total population:", sum(pops)) #sums all values in the list

total population: 9950000


## Formating string

Constructing more complex strings from a mix of variables of different types can be cumbersome, and sometimes you want more control over how values are interpolated into a string. Python provides a powerful mechanism for formatting strings using built-in `.format()` function using "replacement fields" surrounded by curly braces `{}` which starts with an optional field name followed by a colon `:` and finishes with a format specification. 

There are lots of these specifiers, but here are 3 useful ones:

    d: decimal integer
    f: floating point number
    s: string

You can specify the number of decimal points to use in a floating point number with, e.g. `.2f` to use 2 decimal places or `+.2f` to use 2 decimal with always showing its associated sign.

In [29]:
print('{:.2f}'.format(0.4567))

0.46


In [27]:
"My age is {}".format(20) #adds your number into the place holder

'My age is 20'

In [28]:
'My age is {age}'.format(age=20) #you can name your place holder or not

'My age is 20'

In [30]:
geneExpression = {
    'Beta-Catenin': 2.5, 
    'Beta-Actin': 1.7, 
    'Pax6': 0, 
    'HoxA2': -3.2
}

for gene in geneExpression:
    print('{:s}\t{:+.2f}'.format(gene, geneExpression[gene])) # s is optional and \t inserts a tab space
    # could also be written using variable names
    #print('{gene:s}\t{exp:+.2f}'.format(gene=gene, exp=geneExpression[gene]))

Beta-Catenin	+2.50
Beta-Actin	+1.70
Pax6	+0.00
HoxA2	-3.20


## Exercises 2.2.3

1. Let's calculate the GC content of a DNA sequence. Use the 15-base sequence you created for the exercises above. Create a variable, `gc`, which we will use to count the number of Gs or Cs in our sequence.
2. Output every base of the sequence alongside its index on a new line.
3. Create a loop to iterate over the bases in your sequence. If the base is a G or the base is a C, add one to your `gc` variable.
4. When the loop is done, divide the number of GC bases by the length of the sequence and multiply by 100 to get the GC percentage. Format the result to only display 2 decimal places.

In [49]:
#1
dna='ATGCCTGAAGTCCAT'
gc=0

#2
for index, base in enumerate(dna):
    print(index,base)
    
#3
for base in dna:
    if base == 'C' or base =='G':
        gc = gc+1
print(gc)   

#4
percent=(gc/len(dna))*100

print('{:.2f}'.format(percent)) #remember the quotations around {:.2f}

0 A
1 T
2 G
3 C
4 C
5 T
6 G
7 A
8 A
9 G
10 T
11 C
12 C
13 A
14 T
7
46.67


## Next session

Go to our next notebook: [python_basic_2_3](python_basic_2_3.ipynb)