# Getting started with BioNumPy

This task will get you started with BioNumPy. We will cover three important concepts:

1) Using NumPy on BioNumPy datasets
2) Reading datasets and performing operations on dataset chunks
3) Combining analysis results from chunks of data


## Install and import

BioNumPy can easily be installed through pip:

In [None]:
!pip install bionumpy

Test that the installation worked by importing BioNumPy and encode a sequence:

In [None]:
import numpy as np
import bionumpy as bnp
sequence = bnp.as_encoded_array("ACTG")
print(sequence)

## Part 1: Introduction to BioNumPy datasets

BioNumPy datasets can consist of things like DNA sequences, sequence names, base qualities, proteins sequences, etc.
They are usually created by reading files (e.g. fastq, vcfs etc), but we can also create small datasets using the
`bnp.as_encoded_array()` function:

In [None]:
sequences = bnp.as_encoded_array([
    "ACTGACG",
    "ACA",
    "ACACGGAAC"
])
print(sequences)

The `sequences` object is encoded and represented using an efficient NumPy-like data structure. However, BioNumPy doesn't require you to know about any of internals or details, and we can just treat the data as NumPy-matrix consisting of DNA and not numbers. For instance, getting a boolean mask with the positions of Gs is as simple as:

In [None]:
is_g = sequences == "G"
print(is_g)

We can then take the `np.sum` of this mask to count the number of Gs:

In [None]:
np.sum(is_g)

**Task**

What do you think happens if you specify `axis=1` on np.sum()? What does the output tell you? Try running the code below and see if you can make sense of the output:

In [None]:
np.sum(is_g, axis=1)

**Task**

Before you continue, check that you have understood how NumPy can be used on BioNumPy data:

1) Make a mask with the positions of C
2) Count how many bases are either C or G
3) Compute the GC-content (PS: You might get use of `np.mean`)
4) Make a new set of sequences where the first base pairs are removed

Solution (only look after you have tried your self):

In [None]:
is_c = sequences == "C"
is_c_or_g = is_c | is_g  # the | operator is "or"
# number of c or g:
np.sum(is_c_or_g)

print("GC content:")
gc_content = np.mean(is_c_or_g)  # alternatively sum and divide by number of bases
print(gc_content)

print("Stripped sequences:")
stripped_sequences = sequences[:, 1:]  # we index on last dimension (columns)
print(stripped_sequences)


## Part 2: Working with files

In the previous task, we worked with a very small dataset consisting of only three sequences. When working with larger datasets, we want to avoid reading the whole data into memory. Instead, BioNumPy reads chunks of data, and we typically analyse each chunk seperately and combine the results in the end if necessary.

In this and the coming exercises, we will work with ChIP-seq data. We start by downloading FASTQ reads for a CTCF ChIP-seq experiment from the Encode Project:

In [None]:
!wget https://www.encodeproject.org/files/ENCFF000RWH/@@download/ENCFF000RWH.fastq.gz

After running the command above, you will get a file `ENCFF000RWH.fastq.gz`. You can open and read a chunk from the file with BioNumPy like this:

In [None]:
f = bnp.open("ENCFF000RWH.fastq.gz")
chunk = f.read_chunk()
print(chunk)

`chunk` is now an object containing a part of the file. BioNumPy automatically detected that this is a FASTQ file, and chooses a suitable data structure. We can access the sequences, names and qualities from this data structure:

In [None]:
print(chunk.sequence)
print(chunk.name)
print(chunk.quality)

These objects work similarily to the `sequences` object in the previous task, meaning that NumPy-functions are compatible with them. We can also index the chunk like a NumPy array. For instance, getting the first three reads can be done with:

In [None]:
print(chunk[0:3])

... and similarily, getting the first three bases of each read is as simple as:

In [None]:
print(chunk[:, 0:3])

**Task**:

* Compute the average base quality value of all the reads in the chunk
* Compute the average base quality without considering the first 5 and last 5 base pairs of all the reads

In [None]:
# you can write your code here

Remember that NumPy-functions such as `np.mean` and `np.sum` can take an axis-argument. `axis=0` performs the operation over the first axis (e.g. computes one number for every base if you have reads) wile `axis=1` performs the operation over the rows (e.g. computes one number for every read).

**Task**

* Find the average base quality for **each read** in this chunk (hint: axis=1)
* How many reads have average base quality lower than 20?
* Subset the chunk so that you are left with reads with average base qualities >= 20.
* How many reads are there in your new filtered chunk?
* Put your code for filtering the reads in a function. The function should take a chunk as an argument and return a new "filtered" chunk.


Solution below (don't look before you've tried yourself):

In [None]:
def filter_reads(chunk):
    mask = np.mean(chunk.quality, axis=1) >= 20
    return chunk[mask]

f = bnp.open("ENCFF000RWH.fastq.gz")
chunk = f.read_chunk()
print(chunk)

filtered_chunk = filter_reads(chunk)
print(filtered_chunk)

## Part 3: Working with chunks from files
Above, we've written code for filtering a chunk of sequences. This allows us to read chunks iteratively from a file, and run our function on each chunk. Working on chunks like this let's us keep memory usage low while still getting significant speedup from NumPy (as compared to working on single reads, which is common when writing vanilla Python programs).

Below, we read chunks iteratively using the `file.read_chunks()` method, filter each chunk and write the resulting filtered chunks to a new file:

In [None]:
f = bnp.open("ENCFF000RWH.fastq.gz")
out_file = bnp.open("ENCFF000RWH_filtered.fastq", "w")

for chunk in f.read_chunks():
    print(chunk)
    filtered_chunk = filter_reads(chunk)
    print(filtered_chunk)
    out_file.write(filtered_chunk)

Make sure that the code above runs before continuing. You may need to change the call to `filter_reads` to match the name of the function you created for filtering.

## Part 4: Combining analysis results from chunks

A common pattern when working with big datasets is to perform an analysis on parts of the dataset (chunks) and combine the results.

For instance, assume we want to compute the average base quality for the whole data set, but we don't want to read the whole data set into memory.

BioNumPy lets you do computation on single chunks, and provides utility functions for merging the results. This is done by adding the `bnp.streamable()` decorator.

For instance, here we have defined a function that computes the number of matches of the subsequence CCCTC in **a single chunk**:

In [None]:
from bionumpy.sequence.string_matcher import match_string
def count_reads_with_matches(chunk):
    # Makes a boolean mask with True where we have a match and False elsewhere
    matches = match_string(chunk.sequence, "CCCTC")
    return np.sum(matches)

If we then want the number of matches for our whole read dataset, we could call the function per chunk like this:

In [None]:
f = bnp.open("ENCFF000RWH_filtered.fastq")
results = []
for chunk in f.read_chunks():
    results.append(count_reads_with_matches(chunk))

print(sum(results))

To avoid doing the for-loop above, BioNumPy provides a decorator `@streamable()`
that can be added above a function in order to make the function able to handle multiple chunks. BioNumPy will automatically
run the function on each chunk and combine the results using the function provided with the decorator, in this case `sum`:

In [None]:
@bnp.streamable(sum)
def count_reads_with_matches(chunk):
    matches = match_string(chunk.sequence, "CCCTC")
    return np.sum(matches)

chunks = bnp.open("ENCFF000RWH_filtered.fastq").read_chunks()
print(count_reads_with_matches(chunks))

## Part 5: Using builtin BioNumPy-functions on chunks.

BioNumPy also provides some useful utility-functions for combining results from multiple chunks. One such function is
`bnp.mean` which can take a generator of results computed on chunks and work on the generator as if it only got one big dataset.

For instance, assume we write a function that finds all matches within a chunk:

In [None]:
@bnp.streamable()
def get_matches(chunk, sequence):
    return match_string(chunk.sequence, sequence)

Calling this function on multiple chunks gives us a generator containing the matches for each chunk:

In [None]:
f = bnp.open("ENCFF000RWH_filtered.fastq")
chunks = f.read_chunks()
matches = get_matches(chunks, "CCCTC")
print(type(matches))

Matches is a stream (generator) that will yield the mask for each chunk if we iterate over it.
If we call `bnp.mean` on matches, `bnp.mean` will compute the mean of all the masks in `matches`, combine them correctly,
and return one single number as if it only got **one single mask** for the whole data set:

In [None]:
average_matches_per_base = bnp.mean(matches)

Using this pattern may seem a bit cumbersome until you get used to it, but it enables analysis on larger datasets that can fit into memory.
A common usecase of this pattern is when one wants to compute an average or other statistic over a big data set.

**TASK** Try making a function that simply returns the base qualities for a chunk, and use `bnp.mean` to get the average base qualities for the whole data set.

Compute the average base qualities of the filtered and unfiltered fastq reads (`ENCFF000RWH_filtered.fastq` and `ENCFF000RWH.fastq.gz`).

In [None]:
# write your code here



Is the average base quality higher in the filtered reads?

Solution (don't look before you've tried solving it):

In [None]:
@bnp.streamable()
def get_base_qualities(chunk):
    return chunk.quality

f = bnp.open("ENCFF000RWH_filtered.fastq")  # try with filtered and unfiltered
chunks = f.read_chunks()
qualities = get_base_qualities(chunks)
print(bnp.mean(qualities))


## Final notes

If you've successfully done all the exerices in this document, you are ready to use BioNumPy on a wide range of data sets. A typical workflow with BioNumPy typically looks like this:

1) Read one or more data sets with `bnp.open`
2) Use `np`-functions to slice, index or to analyse the data
3) Either write functions that you decorate with `@streamable()` or use builtin-functions to combine results from multiple chunks

In the coming tasks, you will also see that BioNumPy has a lot of builtin utility-function for typical analysis (such as motif matching, kmers, etc).