In [None]:
import numpy as np
# we set a random seed to make things reproducible
np.random.seed(1)

We simulate random sequences as a matrix where each row is a sequence represented by numbers 0 (A), 1 (C), 2 (T) and 3 (G):

In [None]:
n_sequences = 10
sequence_length = 30
sequences = np.random.randint(0, 4, (n_sequences, sequence_length))

**Example:** How many C's are there in the sequences? We try to use vanilla Python code only:

In [None]:
def count_c(sequences):
    number_of_cs = 0
    for sequence in sequences:
        for base in sequence:
            if base == 1:
                number_of_cs += 1
    return number_of_cs

In [None]:
%time print("Result: ", count_c(sequences))

Try changing `n_sequences` to a larger number (e.g. 1000000) and measure the time the above code takes to run.

Why do you think the above code is so slow?

What happens inside Python when the above code is run?

We try NumPy instead:

In [None]:
def count_c_using_numpy(sequences):
    is_c = sequences == 1
    return np.sum(is_c)

In [None]:
%time print("Result: ", count_c_using_numpy(sequences))

## Exercise
Assume we have the following base qualities (a number between 0 and 60 for each base in each sequence):

In [None]:
base_qualities = np.random.randint(0, 60, (n_sequences, sequence_length))


Given the above base qualities, make a function returns the number of bases with quality above 30.

Bonus tasks if you found the task easy:

* What is the mean base quality?
* How many reads have mean base quality above 35?
* What is the standard deviation of the base qualities? (Google numpy standard deviation)
* What is the mean base quality of all the bases except the first base of each read?
* What is the mean base quality of all the bases with quality above 30 not considering the first base of each read?
