# Getting started with BioNumPy

This task will get you started with BioNumPy. We will cover two important concepts:

1) Using numpy on BioNumPy datasets
2) Reading datasets and performing operations on dataset chunks


## Install and import

BioNumPy can easily be installed through pip:

In [None]:
!pip install bionumpy

Test that the installation worked by importing BioNumPy and encode a sequence:

In [15]:
import numpy as np
import bionumpy as bnp
sequence = bnp.as_encoded_array("ACTG")
print(sequence)

ACTG


## Part 1: Working with BioNumPy datasets


BioNumPy datasets are usually created by reading files (e.g. fastq, vcfs etc), but we can also create small datasets using the
`bnp.as_encoded_array()`:

In [16]:
sequences = bnp.as_encoded_array([
    "ACTGACG",
    "ACA",
    "ACACGGAAC"
])
print(sequences)

ACTGACG
ACA
ACACGGAAC


The `sequences` object is encoded and is represented using an efficient NumPy-like data structure, but we don't need to know the internal details, and can use it just like any NumPy-matrix, with the additional benefit of thinking of the data as DNA and not numbers. For instance, finding the position of Gs is as simple as:

In [17]:
is_g = sequences == "G"
print(is_g)

[False False False  True False False  True]
[False False False]
[False False False False  True  True False False False]


We can then e.g. take the sum of this mask to count the number of Gs:

In [18]:
np.sum(is_g)

4

What do you think happens if you specify `axis=1` on np.sum()? What does the output tell you? Try running the code below and see if you can make sense of the output:

In [19]:
np.sum(is_g, axis=1)

array([2, 0, 2])

Before you continue, check that you have understood how NumPy can be used on BioNumPy data:

1) Make a mask with the posisions of C
2) Count how many bases are either C or G
3) Compute the GC-content (ps: You can get use of `np.mean` here
4) Make a new set of sequences where the first base pairs are removed

Solution (only look after you have tried your self):

In [None]:
is_c = sequences == "C"
is_c_or_g = is_c | is_g
# number of c or g:
np.sum(is_c_or_g)

print("GC content:")
gc_content = np.mean(is_c_or_g)
print(gc_content)

print("Stripped sequences:")
stripped_sequences = sequences[:, 1:]
print(stripped_sequences)


## Part 2: Working with files

In the previous task, we worked with a very small data set of only three sequences. When working with large datasets, we want to avoid reading the whole data set into memory. Instead, BioNumPy reads chunks of data, and we typically analyse each chunk seperately and combine the results in the end.

In this and the coming exercises, we will work with ChIP-seq data. We start by downloading FASTQ reads for a CTCF ChIP-seq experiment from the Encode Project:

In [None]:
!wget https://www.encodeproject.org/files/ENCFF000RWH/@@download/ENCFF000RWH.fastq.gz

After running the command above, you will get a file `ENCFF000RWH.fastq`. You can open and read a chunk from the file with BioNumPy like this:

In [None]:
f = bnp.open("ENCFF000RWH.fastq.gz")
chunk = f.read_chunk()
print(chunk)

`chunk` is now an object containing a part of the file. We can access the sequences, names and qualities of the entries:

In [None]:
print(chunk.sequence)
print(chunk.name)
print(chunk.quality)

These objects work similarily to the `sequences` object in the previous task. Note that we can index the chunk. For instance, getting the first three reads can be done with:

In [None]:
print(chunk[0:3])

**TASK**: Try to the average base quality value of all the reads in the chunk:

In [None]:
average_base_quality = # ... implement your code here
print(average_base_quality)

**Tasks:**

* Find the average base quality for each read in this chunk (hint: axis=1)
* How many reads have average base quality lower than 20?
* Subset the chunk so that you are left with reads with average base qualities >= 20.
* How many reads are there in your new filtered chunk?
* Put your code for filtering in a function. The function should take a chunk as an argument and return a new "filtered" chunk.


Solution below (don't see before you've tried yourself):

In [None]:
def filter_reads(chunk):
    mask = np.mean(chunk.quality, axis=1) >= 20
    return chunk[mask]

f = bnp.open("ENCFF000RWH.fastq.gz")
chunk = f.read_chunk()
print(chunk)

filtered_chunk = filter_reads(chunk)
print(filtered_chunk)

## Part 3: Working with chunks from files
Above, we've written code for filtering a chunk of sequences. We can then read chunks iteratively from a file, and run our function on each chunk. This way, we keep memory usage low, while working on large-enough chunks to get significant speedup from NumPy (instead of working on single fasta entries, which is common when writing Vanilla Python programs).

Below, we read chunks iteratively using the `file.read_chunks()` method, filter each chunk and write the resulting chunks to a new file:

In [None]:
f = bnp.open("ENCFF000RWH.fastq.gz")
out_file = bnp.open("ENCFF000RWH_filtered.fastq", "w")

for chunk in f.read_chunks():
    print(chunk)
    filtered_chunk = filter_reads(chunk)
    print(filtered_chunk)
    out_file.write(filtered_chunk)

## Part 4: Combining analysis results from chunks

A common pattern when working with big datasets is to perform an analysis on parts of the dataset (chunks) and combine the results.

For instance, assume we want to compute the average base quality for the whole data set, but we don't want to read the whole data set into memory.

BioNumPy lets you do computation on single chunks, and provides utility functions for merging the results. This is done by adding the `bnp.streamable()` decorator.

For instance, here we have defined a function that computes the number of matches of the subsequence CCCTC in **a single chunk**:

In [None]:
from bionumpy.sequence.string_matcher import match_string
def count_reads_with_matches(chunk):
    matches = match_string(chunk.sequence, "CCCTC")
    return np.sum(matches)

If we then want the number of matches for our whole read, we could call the function per chunk like this:

In [None]:
f = bnp.open("ENCFF000RWH_filtered.fastq")
results = []
for chunk in f.read_chunks():
    results.append(count_reads_with_matches(chunk))

print(sum(results))

To avoid doing the for-loop above, and get some clean code, BioNumPy provides a decorator `@streamable`
that can be added above a function in order to make the function able to handle multiple chunks. BioNumPy will automatically
run the function on each chunk and combine the results using the function provided with the decorator, in this case `sum`:

In [None]:
@bnp.streamable(sum)
def count_reads_with_matches(chunk):
    matches = match_string(chunk.sequence, "CCCTC")
    return np.sum(matches)

chunks = bnp.open("ENCFF000RWH_filtered.fastq").read_chunks()
print(count_reads_with_matches(chunks))

## Part 5
Using builtin BioNumPy-functions on chunks.

BioNumPy also provides some useful utility-functions for combining results from multiple chunks. One such function is
`bnp.mean` which can take a generator of cunks and work on the generator as if it only got one large chunk.

For instance, assume we write a function that finds all matches within a chunk:

In [None]:
@bnp.streamable
def get_matches(chunk, sequence):
    return match_string(chunk.sequence, sequence)

Calling this function on chunks gives us a generator containing the matches for each chunk:

In [None]:
f = bnp.open("ENCFF000RWH_filtered.fastq")
chunks = f.read_chunks()
matches = get_matches(chunks, "CCCTC")

If we call `bnp.mean` on matches, `bnp.mean` will compute the mean of all the matches masks as if it only got one single mask for the whole data set: