Create a cython version of `biweight_midcorrelation` to massively speed up computation #9483

jolespin · 2019-10-26T00:03:39Z

Description

To create a cython version of biweight_midcorrelation because the current implementation is very slow. For example, pandas has nancorr and nancorr_spearman which are EXTREMELY fast compared to the normal pairwise iterations that could be done using scipy.

Would it be possible to add this functionality to astropy? Currently, I am unable to use this robust metric for my actual datasets that are quite large. I can only use this for test datasets because it takes very long.

Additional context

Here's an example of the nancorr function implemented in cython: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos.pyx

# Pairwise correlation/covariance


@cython.boundscheck(False)
@cython.wraparound(False)
def nancorr(const float64_t[:, :] mat, bint cov=0, minp=None):
    cdef:
        Py_ssize_t i, j, xi, yi, N, K
        bint minpv
        ndarray[float64_t, ndim=2] result
        ndarray[uint8_t, ndim=2] mask
        int64_t nobs = 0
        float64_t vx, vy, sumx, sumy, sumxx, sumyy, meanx, meany, divisor

    N, K = (<object>mat).shape

    if minp is None:
        minpv = 1
    else:
        minpv = <int>minp

    result = np.empty((K, K), dtype=np.float64)
    mask = np.isfinite(mat).view(np.uint8)

    with nogil:
        for xi in range(K):
            for yi in range(xi + 1):
                nobs = sumxx = sumyy = sumx = sumy = 0
                for i in range(N):
                    if mask[i, xi] and mask[i, yi]:
                        vx = mat[i, xi]
                        vy = mat[i, yi]
                        nobs += 1
                        sumx += vx
                        sumy += vy

                if nobs < minpv:
                    result[xi, yi] = result[yi, xi] = NaN
                else:
                    meanx = sumx / nobs
                    meany = sumy / nobs

                    # now the cov numerator
                    sumx = 0

                    for i in range(N):
                        if mask[i, xi] and mask[i, yi]:
                            vx = mat[i, xi] - meanx
                            vy = mat[i, yi] - meany

                            sumx += vx * vy
                            sumxx += vx * vx
                            sumyy += vy * vy

                    divisor = (nobs - 1.0) if cov else sqrt(sumxx * sumyy)

                    if divisor != 0:
                        result[xi, yi] = result[yi, xi] = sumx / divisor
                    else:
                        result[xi, yi] = result[yi, xi] = NaN

    return result

The text was updated successfully, but these errors were encountered:

pllim · 2019-10-26T00:23:39Z

Hi. Can you post an example usage that causes slow performance? Did you try to profile it? If the bottleneck is upstream somewhere else, there is nothing much we can do.

jolespin · 2019-10-26T00:27:24Z

An example would be doing a pairwise correlation matrix. I haven’t profiled it yet to see where the bottleneck is but I’m basing my comparison off a fast c implementation available in R’s WGCNA package (function is called bicor).

jolespin · 2019-10-29T22:00:08Z

Here is an example of the WGCNA package wrapping C code into their function:

https://github.com/cran/WGCNA/blob/678c9da2b71c5b4a43cb55005224c243f411abc8/R/corFunctions.R#L3

Look at the difference in performance:

# ==============================================================================
from rpy2 import robjects, rinterface
from rpy2.robjects.packages import importr
try:
    from rpy2.rinterface import RRuntimeError
except ImportError:
    from rpy2.rinterface_lib.embedded import RRuntimeError
from rpy2.robjects import pandas2ri
pandas2ri.activate()
NULL = robjects.rinterface.NULL
rinterface.set_writeconsole_regular(None)

# Load R Package
wgcna = importr("WGCNA")

def bicor_wgcna(X):
    labels = X.columns
    rDF_sim = wgcna.bicor(pandas2ri.py2ri(X))
    return pd.DataFrame(pandas2ri.ri2py(rDF_sim), index=labels, columns=labels)

def bicor_astropy(X):
    A = X.values
    labels = X.index
    N = len(labels)
    
    data = defaultdict(dict)
    for i in range(N):
        node_i = labels[i]
        x = A[i]
        for j in range(i, N):
            node_j = labels[j]
            y = A[j]
            data[node_i][node_j] = data[node_j][node_i] = biweight_midcorrelation(x,y)
    df_bicor = pd.DataFrame(data)
    return df_bicor

%timeit df_bicor__wgcna = bicor_wgcna(X.T)
%timeit df_bicor__astropy = bicor_astropy(X)

np.allclose(df_bicor__wgcna, df_bicor__astropy)

# 36.1 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 3.88 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# True

That's about 108x faster...

mhvk · 2019-11-04T21:58:05Z

Having had what is admittedly only a quick look at the current astropy code, I think that code could be sped up similarly simply by allowing multi-dimensional input (i.e., let the work-horse biweight_covariance deal with multi-D input by properly tracking the axis that is to be worked on, and doing the work on all other axes simultaneously using broadcasting). That way, all the hard work is pushed into the C code used by numpy, and should be similarly fast as anything one could do in cython.

pllim added stats Performance Feature Request labels Oct 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a cython version of `biweight_midcorrelation` to massively speed up computation #9483

Create a cython version of `biweight_midcorrelation` to massively speed up computation #9483

jolespin commented Oct 26, 2019

pllim commented Oct 26, 2019

jolespin commented Oct 26, 2019

jolespin commented Oct 29, 2019

mhvk commented Nov 4, 2019

Create a cython version of biweight_midcorrelation to massively speed up computation #9483

Create a cython version of biweight_midcorrelation to massively speed up computation #9483

Comments

jolespin commented Oct 26, 2019

Description

Additional context

pllim commented Oct 26, 2019

jolespin commented Oct 26, 2019

jolespin commented Oct 29, 2019

mhvk commented Nov 4, 2019

Create a cython version of `biweight_midcorrelation` to massively speed up computation #9483

Create a cython version of `biweight_midcorrelation` to massively speed up computation #9483