# Performance Analysis -  Cython
> Number of effective sequences implemented in Cython
- toc: true
- branch: master
- badges: true
- author: Donatas Repečka
- categories: [performance]

## Introduction

In [the previous post](https://donatasrep.github.io/donatas.repecka/performance/2021/04/27/Performance-comparison.html) I have compared various languages and libraries in terms of their speed. This notebook contains the code used in the comparison as well as some details about the choices made to improve the performance of Cython implementation.

## Setup

In [1]:
# !wget https://github.com/donatasrep/donatas.repecka/blob/master/data/picked_msa.fasta

In [2]:
# ! pip install numpy
# ! pip install pandas
# ! pip install Cython

## Getting data

In [3]:
import pandas as pd

In [4]:
def get_data(path):
    fasta_df = pd.read_csv(path, sep="\n", lineterminator=">", index_col=False, names=['id', 'seq'])
    return fasta_df.seq.to_numpy(dtype=str)

In [5]:
seqs = get_data('../data/picked_msa.fasta')

Just to remind the pseudo code looks like this:

```
for seq1 in seqs:
  for seq2 in seqs:
    if count_mathes(seq1, seq2) > threshold:
      weight +=1
  meff += 1/weight
 
meff = meff/(len(seq1)^0.5)
```

Cython implementation is quite simple, it exploits the symmetry to reduce calculations required, but does not require any other tricks to have decent performance. The overall performance is somewhere between Numpy and Numba. 

In [6]:
%load_ext Cython

In [7]:
%%cython --annotate

import numpy as np
cimport numpy as cnp

def get_nf_cython(seqs, threshold=0.8):

    cdef cnp.ndarray is_same_cluster
    
    seqs = seqs.view(np.uint32).reshape(seqs.shape[0], -1)
    n_seqs, seq_len = seqs.shape
    is_same_cluster = np.ones([n_seqs, n_seqs],np.bool_)

    for i in range(n_seqs):
        current = seqs[i:]
        pairwise_id = np.equal(current[1:], current[0].T).mean(1)
        is_more = pairwise_id > threshold
        is_same_cluster[i, i+1:] = is_more
        is_same_cluster[i+1:, i] = is_more
    meff = 1.0/is_same_cluster.sum(1)
    return meff.sum()/(seq_len**0.5)

In [8]:
%%timeit -n 3 -r 3
    get_nf_cython(seqs[:100])

16.6 ms ± 1.57 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)


In [9]:
get_nf_cython(seqs[:100])

0.18006706787628665