Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a cython version of biweight_midcorrelation to massively speed up computation #9483

Open
jolespin opened this issue Oct 26, 2019 · 4 comments

Comments

@jolespin
Copy link

Description

To create a cython version of biweight_midcorrelation because the current implementation is very slow. For example, pandas has nancorr and nancorr_spearman which are EXTREMELY fast compared to the normal pairwise iterations that could be done using scipy.

Would it be possible to add this functionality to astropy? Currently, I am unable to use this robust metric for my actual datasets that are quite large. I can only use this for test datasets because it takes very long.

Additional context

Here's an example of the nancorr function implemented in cython: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos.pyx

# Pairwise correlation/covariance


@cython.boundscheck(False)
@cython.wraparound(False)
def nancorr(const float64_t[:, :] mat, bint cov=0, minp=None):
    cdef:
        Py_ssize_t i, j, xi, yi, N, K
        bint minpv
        ndarray[float64_t, ndim=2] result
        ndarray[uint8_t, ndim=2] mask
        int64_t nobs = 0
        float64_t vx, vy, sumx, sumy, sumxx, sumyy, meanx, meany, divisor

    N, K = (<object>mat).shape

    if minp is None:
        minpv = 1
    else:
        minpv = <int>minp

    result = np.empty((K, K), dtype=np.float64)
    mask = np.isfinite(mat).view(np.uint8)

    with nogil:
        for xi in range(K):
            for yi in range(xi + 1):
                nobs = sumxx = sumyy = sumx = sumy = 0
                for i in range(N):
                    if mask[i, xi] and mask[i, yi]:
                        vx = mat[i, xi]
                        vy = mat[i, yi]
                        nobs += 1
                        sumx += vx
                        sumy += vy

                if nobs < minpv:
                    result[xi, yi] = result[yi, xi] = NaN
                else:
                    meanx = sumx / nobs
                    meany = sumy / nobs

                    # now the cov numerator
                    sumx = 0

                    for i in range(N):
                        if mask[i, xi] and mask[i, yi]:
                            vx = mat[i, xi] - meanx
                            vy = mat[i, yi] - meany

                            sumx += vx * vy
                            sumxx += vx * vx
                            sumyy += vy * vy

                    divisor = (nobs - 1.0) if cov else sqrt(sumxx * sumyy)

                    if divisor != 0:
                        result[xi, yi] = result[yi, xi] = sumx / divisor
                    else:
                        result[xi, yi] = result[yi, xi] = NaN

    return result
@pllim
Copy link
Member

pllim commented Oct 26, 2019

Hi. Can you post an example usage that causes slow performance? Did you try to profile it? If the bottleneck is upstream somewhere else, there is nothing much we can do.

@jolespin
Copy link
Author

An example would be doing a pairwise correlation matrix. I haven’t profiled it yet to see where the bottleneck is but I’m basing my comparison off a fast c implementation available in R’s WGCNA package (function is called bicor).

@jolespin
Copy link
Author

Here is an example of the WGCNA package wrapping C code into their function:

https://github.com/cran/WGCNA/blob/678c9da2b71c5b4a43cb55005224c243f411abc8/R/corFunctions.R#L3

Look at the difference in performance:

# ==============================================================================
from rpy2 import robjects, rinterface
from rpy2.robjects.packages import importr
try:
    from rpy2.rinterface import RRuntimeError
except ImportError:
    from rpy2.rinterface_lib.embedded import RRuntimeError
from rpy2.robjects import pandas2ri
pandas2ri.activate()
NULL = robjects.rinterface.NULL
rinterface.set_writeconsole_regular(None)

# Load R Package
wgcna = importr("WGCNA")

def bicor_wgcna(X):
    labels = X.columns
    rDF_sim = wgcna.bicor(pandas2ri.py2ri(X))
    return pd.DataFrame(pandas2ri.ri2py(rDF_sim), index=labels, columns=labels)

def bicor_astropy(X):
    A = X.values
    labels = X.index
    N = len(labels)
    
    data = defaultdict(dict)
    for i in range(N):
        node_i = labels[i]
        x = A[i]
        for j in range(i, N):
            node_j = labels[j]
            y = A[j]
            data[node_i][node_j] = data[node_j][node_i] = biweight_midcorrelation(x,y)
    df_bicor = pd.DataFrame(data)
    return df_bicor

%timeit df_bicor__wgcna = bicor_wgcna(X.T)
%timeit df_bicor__astropy = bicor_astropy(X)

np.allclose(df_bicor__wgcna, df_bicor__astropy)

# 36.1 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 3.88 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# True

That's about 108x faster...

@mhvk
Copy link
Contributor

mhvk commented Nov 4, 2019

Having had what is admittedly only a quick look at the current astropy code, I think that code could be sped up similarly simply by allowing multi-dimensional input (i.e., let the work-horse biweight_covariance deal with multi-D input by properly tracking the axis that is to be worked on, and doing the work on all other axes simultaneously using broadcasting). That way, all the hard work is pushed into the C code used by numpy, and should be similarly fast as anything one could do in cython.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants