# Performance Analysis -  Numba
> Number of effective sequences implemented in Numba
- toc: true
- branch: master
- badges: true
- author: Donatas Repečka
- categories: [performance]

## Introduction

In [the previous post](https://donatasrep.github.io/donatas.repecka/performance/2021/04/27/Performance-comparison.html) I have compared various languages and libraries in terms of their speed. This notebook contains the code used in the comparison as well as some details about the choices made to improve the performance of numba implementation.

From Numba [website](http://numba.pydata.org/): "Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN."

## Setup

In [1]:
# !wget https://github.com/donatasrep/donatas.repecka/blob/master/data/picked_msa.fasta

In [2]:
# ! pip install numpy
# ! pip install pandas
# ! pip install numba

## Getting data

In [3]:
import pandas as pd

In [4]:
def get_data(path):
    fasta_df = pd.read_csv(path, sep="\n", lineterminator=">", index_col=False, names=['id', 'seq'])
    return fasta_df.seq.to_numpy(dtype=str)

In [5]:
seqs = get_data('../data/picked_msa.fasta')

Just to remind the pseudo code looks like this:

```
for seq1 in seqs:
  for seq2 in seqs:
    if count_mathes(seq1, seq2) > threshold:
      weight +=1
  meff += 1/weight
 
meff = meff/(len(seq1)^0.5)
```

As with Numpy and Python versions, we use the same input data. The code is closer to the version of pure Python because wrapping optimised Numpy code turned out to be slower. It seems that you are  better off leaving all optimisation for Numba.


In [6]:
import numpy as np
from numba import jit, prange

In [7]:
def get_nf_numba(seqs, threshold=0.8):
    seqs = seqs.view(np.uint32).reshape(seqs.shape[0], -1)
    n_seqs, seq_len = seqs.shape
    is_same_cluster = np.eye(n_seqs)
    for i in prange(n_seqs):
        c  = 0
        for j in prange(i+1, n_seqs):
            identity = np.equal(seqs[i], seqs[j]).mean()
            is_more = np.greater(identity, threshold)
            is_same_cluster[i,j] = is_more
            is_same_cluster[j,i] = is_more
    meff = 1.0/is_same_cluster.sum(1)
    return meff.sum()/(seq_len**0.5)

There are a couple of things that need to be done in order to utilise Numba fully. Firstly,  Numba uses JIT -(just in time compilation)[https://en.wikipedia.org/wiki/Just-in-time_compilation]. Hence you need to wrap your functions with either `@jit` or ‘jit’ function. Note, the first run of the wrapped function will be slower as Numba needs to compile code. Secondly, there is the `nopython` option that bypasses the Python interpreter. It has its own down sides that allows code to run faster. 

In [8]:
fn = jit(get_nf_numba, nopython=True,parallel=False)
fn(seqs[:100])

0.18006706787628668

In [9]:
%%timeit -n 3 -r 3
fn(seqs[:100])

11 ms ± 642 µs per loop (mean ± std. dev. of 3 runs, 3 loops each)


Another really nice feature of Numba is that it allows to parallelise code with one single option as you can see below. 

In [10]:
fn = jit(get_nf_numba, nopython=True,parallel=True)
fn(seqs[:100])

0.18006706787628668

In [11]:
%%timeit -n 3 -r 3
fn(seqs[:100])

6.11 ms ± 3.13 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)


Finally, if precision is less important and can be sacrificed for extra speed, there is `fastmath`option. From (Numba documentation)[https://numba.readthedocs.io/en/stable/user/performance-tips.html?highlight=fastmath#fastmath]: 

“In certain classes of applications strict IEEE 754 compliance is less important. As a result it is possible to relax some numerical rigour with view of gaining additional performance. The way to achieve this behaviour in Numba is through the use of the fastmath keyword argument”


In [12]:
fn = jit(get_nf_numba, nopython=True,parallel=True, fastmath=True)
fn(seqs[:100])

0.18006706787628665

In [13]:
%%timeit -n 3 -r 3
fn(seqs[:100])

The slowest run took 6.90 times longer than the fastest. This could mean that an intermediate result is being cached.
4.31 ms ± 3.63 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)


Numba seemed to be the fastest library that I tried on CPU and was relatively easy to get started. Of course, there will be cases where Numba will not work, but in general it seems that Numba deserves to be at least considered seriously when looking for ways to improve performance of the code. 