http://hplgit.github.io/bioinf-py/doc/pub/bioinf-py.html

# Basic Bioinformatics Examples in Python
The leading Python software for bioinformatics applications is BioPython and for real-world problem solving one should rather utilize BioPython instead of home-made solutions. 

## Counting Letters in DNA Strings

### List Iteration

In [1]:
list('ATCG')

['A', 'T', 'C', 'G']

In [3]:
def count_v1(dna, base):
    i = 0
    for k in list(dna):
        if k == base:
            i += 1
    return i

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v1(dna, base))

3


### String Iteration

In [4]:
for c in 'ATCG':
    print(c)

A
T
C
G


In [5]:
def count_v2(dna, base):
    i = 0
    for k in dna:
        if k == base:
            i += 1
    return i
dna = 'ATGCGGACCTAT'
base = 'C'
n = count_v2(dna, base)

# printf-style formatting
print("%s appears %d times in %s"%(base, n, dna))

# or (new) format string syntax
print("{base} appears {n} times in {dna}".format(dna=dna, base=base, n=n))

C appears 3 times in ATGCGGACCTAT
C appears 3 times in ATGCGGACCTAT


### Program Flow
- printing variables and messages,
- using a debugger,
- using the Online Python Tutor.

In [6]:
def count_v2_demo(dna, base):
    print( 'dna:', dna)
    print ('base:', base)
    i = 0 # counter
    for c in dna:
        print ('c:', c)
        if c == base:
            print ('True if test')
            i += 1
    return i

n = count_v2_demo('ATGCGGACCTAT', 'C')

dna: ATGCGGACCTAT
base: C
c: A
c: T
c: G
c: C
True if test
c: G
c: G
c: A
c: C
True if test
c: C
True if test
c: T
c: A
c: T


### Index Iteration

In [7]:
def count_v3(dna, base):
    i = 0 # counter
    for j in range(len(dna)):
        if dna[j] == base:
            i += 1
    return i
dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v3(dna, base))

3


### While Loops

In [8]:
def count_v4(dna, base):
    i=0
    j=0
    while(i<len(dna)):
        if dna[i]==base:
            j+=1
        i+=1
    return j

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v4(dna, base))

3


### Summing a Boolean List

In [9]:
def count_v5(dna, base):
    bool_list = []
    for i in dna:
        if i==base:
            bool_list.append(True)
        else:
            bool_list.append(False)
    return sum(bool_list)

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v5(dna, base))

3


### Inline If Test

In [10]:
def count_v6(dna, base):
    bool_list = []
    for i in dna:
        bool_list.append(True if i==base else False)
    return sum(bool_list)

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v6(dna, base))

3


### Using Boolean Values Directly

In [11]:
def count_v7(dna, base):
    bool_list = []
    for i in dna:
        bool_list.append(i==base)
    return sum(bool_list)

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v7(dna, base))

3


### List Comprehensions

In [12]:
def count_v8(dna, base):
    bool_list = [i==base for i in dna]
    return sum(bool_list)

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v8(dna, base))

3


In [13]:
def count_v9(dna, base):
    return sum([i==base for i in dna])

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v9(dna, base))

3


### Using a Sum Iterator
Summing without actually storing an extra list is desirable.

In [14]:
def count_v10(dna, base):
    return sum(i==base for i in dna)

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v10(dna, base))

3


### Extracting Indices
Instead of making a boolean list with elements expressing whether a letter matches the given base or not, we may collect all the indices of the matches. This can be done by adding an if test to the list comprehension:

In [15]:
def count_v11(dna, base):
    index_list = [i for i in range(len(dna)) if dna[i]==base]
    return index_list

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v11(dna, base))

[3, 7, 8]


### Using Python's Library
dna.count(base)

In [16]:
def count_v12(dna, base):
    return dna.count(base)

dna = 'ATGCGGACCTAT'
base = 'C'
print(count_v12(dna, base))

3


## Efficiency Assessment
Now we have 11 different versions of how to count the occurrences of a letter in a string. Which one of these implementations is the fastest? To answer the question we need some test data, which should be a huge string dna.

### Generating Random DNA Strings

In [17]:
N = 1000000
dna = 'A'*N

In [18]:
# Generating dna via using all nucleotides
import random

alphabet = list('ACGT')
dna = [random.choice(alphabet) for i in range(N)]
dna = ''.join(dna)

In [19]:
import random
def generate_string(N, alphabet = 'ACGT'):
    return ''.join([random.choice(list(alphabet)) for i in range(N)])
dna = generate_string(600000)

### Measuring CPU Time
Our next goal is to see how much time the various count_v* functions spend on counting letters in a huge string, which is to be generated as shown above. Measuring the time spent in a program can be done by the time module:

perf_counter() should measure the real amount of time for a process to take, as if you used a stop watch.

process_time() will give you the time spent by the computer for the current process, a computer with an OS usually won’t spend 100% of the time on any given process.

In [20]:
import time

t0 = time.perf_counter()
# do stuff
t1 = time.perf_counter()
cpu_time = t1 - t0
cpu_time

2.3399999996343013e-05

In [21]:
import time

t0 = time.process_time()
# do stuff
t1 = time.process_time()
cpu_time = t1 - t0
cpu_time

0.0

In [22]:
import time

functions = [count_v1, count_v2, count_v3, count_v4, count_v5, count_v6, count_v7, count_v8, count_v9,
            count_v10, count_v11, count_v12]
timings = [] # timings[i] holds cpu time for functions[i]

for function in functions:
    t0 = time.process_time()
    function(dna, 'A')
    t1 = time.process_time()
    cpu_time = t1-t0
    timings.append(cpu_time)

In [23]:
for cpu_time, function in zip(timings, functions):
    print("{f:<9s} : {c:.2f}".format(f=function.__name__, c=cpu_time))

count_v1  : 0.05
count_v2  : 0.05
count_v3  : 0.06
count_v4  : 0.16
count_v5  : 0.11
count_v6  : 0.09
count_v7  : 0.09
count_v8  : 0.06
count_v9  : 0.08
count_v10 : 0.08
count_v11 : 0.08
count_v12 : 0.00


## Verifying the Implementations

In [24]:
def test_count_all():
    dna = 'ATTTGCGGTCCAAA'
    exact = dna.count('A')
    for f in functions:
        if f(dna, 'A') != exact:
            print(f.__name__, 'failed')
            

In [25]:
test_count_all()

count_v11 failed


In [26]:
def test_count_all():
    dna = 'ATTTGCGGTCCAAA'
    expected = dna.count('A')

    unctions = [count_v1, count_v2, count_v3, count_v4,
                 count_v5, count_v6, count_v7, count_v8,
                 count_v9, count_v10, count_v11, count_v12]
    for f in functions:
        success = f(dna, 'A') == expected
        msg = '%s failed'%f.__name__
        assert success, msg

In [27]:
test_count_all()

AssertionError: count_v11 failed

## Computing Frequencies