Skip to content

Commit

Permalink
update tutorial to include spearman correlation.
Browse files Browse the repository at this point in the history
  • Loading branch information
dominiktraxl committed Sep 23, 2018
1 parent 731f565 commit 9818652
Show file tree
Hide file tree
Showing 3 changed files with 116 additions and 88 deletions.
83 changes: 48 additions & 35 deletions doc/source/tutorials/pairwise_correlations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@
"source": [
"In this short tutorial, we'll demonstrate how DeepGraph can be used to efficiently compute very large correlation matrices in parallel, with full control over RAM usage.\n",
"\n",
"Assume you have a set of ``n_samples`` samples, each comprised of ``n_features`` features and you want to compute the `Pearson correlation coefficients <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ between all pairs of samples. If your data is small enough, you may use `scipy.stats.pearsonr <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr>`_ or `numpy.corrcoef <https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html>`_, but for large data, neither of these methods is feasible. Scipy's pearsonr would be very slow, since you'd have to compute pair-wise correlations in a double loop, and numpy's corrcoef would most likely blow your RAM.\n",
"Assume you have a set of ``n_samples`` samples, each comprised of ``n_features`` features and you want to compute the `Pearson correlation coefficients <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ between all pairs of features (for the `Spearman's rank correlation coefficients <https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient>`_, see the *Note*-box below). If your data is small enough, you may use `scipy.stats.pearsonr <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr>`_ or `numpy.corrcoef <https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html>`_, but for large data, neither of these methods is feasible. Scipy's pearsonr would be very slow, since you'd have to compute pair-wise correlations in a double loop, and numpy's corrcoef would most likely blow your RAM.\n",
"\n",
"Using DeepGraph's :py:meth:`create_edges <.create_edges>` method, you can compute all pair-wise correlations efficiently. In this tutorial, the samples are stored on disc and only the relevant subset of samples for each iteration will be loaded into memory by the computing nodes. Parallelization is achieved by using python's standard library `multiprocessing <https://docs.python.org/3.6/library/multiprocessing.html>`_, but it should be straight-forward to modify the code to accommodate other parallelization libraries. It should also be straight-forward to modify the code in order to compute other correlation/distance/similarity-measures between a set of samples. \n",
"Using DeepGraph's :py:meth:`create_edges <.create_edges>` method, you can compute all pair-wise correlations efficiently. In this tutorial, the data is stored on disc and only the relevant subset of features for each iteration will be loaded into memory by the computing nodes. Parallelization is achieved by using python's standard library `multiprocessing <https://docs.python.org/3.6/library/multiprocessing.html>`_, but it should be straight-forward to modify the code to accommodate other parallelization libraries. It should also be straight-forward to modify the code in order to compute other correlation/distance/similarity-measures between a set of features. \n",
"\n",
"First of all, we need to import some packages"
]
Expand Down Expand Up @@ -63,7 +63,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a set variables and store it as a 2d-matrix ``X`` (``shape=(n_samples, n_features)``) on disc. To speed up the computation of the Pearson correlation coefficients later on, we whiten each variable."
"Let's create a set of variables and store it as a 2d-matrix ``X`` (``shape=(n_features, n_samples)``) on disc. To speed up the computation of the correlation coefficients later on, we whiten each variable."
]
},
{
Expand All @@ -75,9 +75,12 @@
"# create observations\n",
"from numpy.random import RandomState\n",
"prng = RandomState(0)\n",
"n_samples = int(5e3)\n",
"n_features = int(1e2)\n",
"X = prng.randint(100, size=(n_samples, n_features)).astype(np.float64)\n",
"n_features = int(5e3)\n",
"n_samples = int(1e2)\n",
"X = prng.randint(100, size=(n_features, n_samples)).astype(np.float64)\n",
"\n",
"# uncomment the next line to compute ranked variables for Spearman's correlation coefficients\n",
"# X = X.argsort(axis=1).argsort(axis=1)\n",
"\n",
"# whiten variables for fast parallel computation later on\n",
"X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)\n",
Expand All @@ -86,6 +89,16 @@
"np.save('samples', X)\n"
]
},
{
"cell_type": "raw",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. note:: \n",
" On the computation of the `Spearman's rank correlation coefficients <https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient>`_: Since the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables, it suffices to uncomment the indicated line in the above code-block in order to compute the Spearman's rank correlation coefficients in the following. "
]
},
{
"cell_type": "raw",
"metadata": {
Expand Down Expand Up @@ -113,13 +126,13 @@
"\n",
"# connector function to compute pairwise pearson correlations\n",
"def corr(index_s, index_t):\n",
" samples_s = X[index_s]\n",
" samples_t = X[index_t]\n",
" corr = np.einsum('ij,ij->i', samples_s, samples_t) / n_features\n",
" features_s = X[index_s]\n",
" features_t = X[index_t]\n",
" corr = np.einsum('ij,ij->i', features_s, features_t) / n_samples\n",
" return corr\n",
"\n",
"# index array for parallelization\n",
"pos_array = np.array(np.linspace(0, n_samples*(n_samples-1)//2, n_processes), dtype=int)\n",
"pos_array = np.array(np.linspace(0, n_features*(n_features-1)//2, n_processes), dtype=int)\n",
"\n",
"# parallel computation\n",
"def create_ei(i):\n",
Expand Down Expand Up @@ -294,40 +307,40 @@
"name": "stdout",
"output_type": "stream",
"text": [
" 252632 function calls (247856 primitive calls) in 6.961 seconds\n",
" 244867 function calls (239629 primitive calls) in 14.193 seconds\n",
"\n",
" Ordered by: internal time\n",
" List reduced from 528 to 20 due to restriction <20>\n",
" List reduced from 541 to 20 due to restriction <20>\n",
"\n",
" ncalls tottime percall cumtime percall filename:lineno(function)\n",
" 250 3.990 0.016 3.997 0.016 memmap.py:334(__getitem__)\n",
" 125 1.216 0.010 1.216 0.010 {built-in method numpy.core.multiarray.c_einsum}\n",
" 125 0.328 0.003 5.582 0.045 deepgraph.py:4553(map)\n",
" 2 0.302 0.151 0.302 0.151 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects}\n",
" 250 0.114 0.000 0.124 0.000 internals.py:4473(_stack_arrays)\n",
" 125 0.086 0.001 6.017 0.048 deepgraph.py:5289(_select_and_return)\n",
" 4 0.084 0.021 0.084 0.021 {built-in method numpy.core.multiarray.concatenate}\n",
" 129 0.083 0.001 0.083 0.001 {method 'take' of 'numpy.ndarray' objects}\n",
" 2 0.066 0.033 0.180 0.090 algorithms.py:429(safe_sort)\n",
" 126 0.047 0.000 0.047 0.000 api.py:93(_sanitize_and_check)\n",
" 52 0.044 0.001 0.044 0.001 _weakrefset.py:36(__init__)\n",
" 125 0.042 0.000 0.042 0.000 {deepgraph._triu_indices._reduce_triu_indices}\n",
" 125 0.042 0.000 0.042 0.000 {built-in method deepgraph._triu_indices._triu_indices}\n",
" 2 0.039 0.020 0.039 0.020 function_base.py:4684(delete)\n",
" 2 0.032 0.016 0.032 0.016 {built-in method numpy.core.multiarray.putmask}\n",
" 4 0.026 0.006 0.026 0.006 {built-in method pandas._libs.algos.ensure_int16}\n",
" 125 0.021 0.000 5.234 0.042 <ipython-input-3-ddd5575c35f5>:12(corr)\n",
"49804/49196 0.015 0.000 0.074 0.000 {built-in method builtins.isinstance}\n",
" 1 0.015 0.015 6.893 6.893 deepgraph.py:4783(_matrix_iterator)\n",
" 1 0.013 0.013 6.961 6.961 deepgraph.py:178(create_edges)\n",
" 250 9.355 0.037 9.361 0.037 memmap.py:334(__getitem__)\n",
" 125 1.584 0.013 1.584 0.013 {built-in method numpy.core.multiarray.c_einsum}\n",
" 125 1.012 0.008 12.013 0.096 deepgraph.py:4558(map)\n",
" 2 0.581 0.290 0.581 0.290 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects}\n",
" 1 0.301 0.301 0.414 0.414 multi.py:795(_engine)\n",
" 5 0.157 0.031 0.157 0.031 {built-in method numpy.core.multiarray.concatenate}\n",
" 250 0.157 0.001 0.170 0.001 internals.py:5017(_stack_arrays)\n",
" 2 0.105 0.053 0.105 0.053 {pandas._libs.algos.take_1d_int64_int64}\n",
" 889 0.094 0.000 0.094 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n",
" 125 0.089 0.001 12.489 0.100 deepgraph.py:5294(_select_and_return)\n",
" 125 0.074 0.001 0.074 0.001 {deepgraph._triu_indices._reduce_triu_indices}\n",
" 125 0.066 0.001 0.066 0.001 {built-in method deepgraph._triu_indices._triu_indices}\n",
" 4 0.038 0.009 0.038 0.009 {built-in method pandas._libs.algos.ensure_int16}\n",
" 125 0.033 0.000 10.979 0.088 <ipython-input-3-26c4f59cd911>:12(corr)\n",
" 2 0.028 0.014 0.028 0.014 function_base.py:4703(delete)\n",
" 1 0.027 0.027 14.163 14.163 deepgraph.py:4788(_matrix_iterator)\n",
" 1 0.027 0.027 0.113 0.113 multi.py:56(_codes_to_ints)\n",
"45771/45222 0.020 0.000 0.043 0.000 {built-in method builtins.isinstance}\n",
" 1 0.019 0.019 14.193 14.193 deepgraph.py:183(create_edges)\n",
" 2 0.012 0.006 0.700 0.350 algorithms.py:576(factorize)\n",
"\n",
"\n"
]
},
{
"data": {
"text/plain": [
"<pstats.Stats at 0x7f5a9678f4e0>"
"<pstats.Stats at 0x7f814fb2bd68>"
]
},
"execution_count": 7,
Expand All @@ -343,7 +356,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, most of the time is spent by getting the requested samples in the corr-function, followed by computing the correlation values themselves. "
"As you can see, most of the time is spent by getting the requested features in the corr-function, followed by computing the correlation values themselves. "
]
}
],
Expand All @@ -364,7 +377,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.6.5"
}
},
"nbformat": 4,
Expand Down
30 changes: 20 additions & 10 deletions doc/source/tutorials/pairwise_correlations.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

# In[1]:


# data i/o
import os

Expand All @@ -18,16 +19,20 @@
import deepgraph as dg


# Let's create a set variables and store it as a 2d-matrix ``X`` (``shape=(n_samples, n_features)``) on disc. To speed up the computation of the Pearson correlation coefficients later on, we whiten each variable.
# Let's create a set of variables and store it as a 2d-matrix ``X`` (``shape=(n_features, n_samples)``) on disc. To speed up the computation of the correlation coefficients later on, we whiten each variable.

# In[2]:


# create observations
from numpy.random import RandomState
prng = RandomState(0)
n_samples = int(5e3)
n_features = int(1e2)
X = prng.randint(100, size=(n_samples, n_features)).astype(np.float64)
n_features = int(5e3)
n_samples = int(1e2)
X = prng.randint(100, size=(n_features, n_samples)).astype(np.float64)

# uncomment the next line to compute ranked variables for Spearman's correlation coefficients
# X = X.argsort(axis=1).argsort(axis=1)

# whiten variables for fast parallel computation later on
X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)
Expand All @@ -38,6 +43,7 @@

# In[3]:


# parameters (change these to control RAM usage)
step_size = 1e5
n_processes = 100
Expand All @@ -50,13 +56,13 @@

# connector function to compute pairwise pearson correlations
def corr(index_s, index_t):
samples_s = X[index_s]
samples_t = X[index_t]
corr = np.einsum('ij,ij->i', samples_s, samples_t) / n_features
features_s = X[index_s]
features_t = X[index_t]
corr = np.einsum('ij,ij->i', features_s, features_t) / n_samples
return corr

# index array for parallelization
pos_array = np.array(np.linspace(0, n_samples*(n_samples-1)//2, n_processes), dtype=int)
pos_array = np.array(np.linspace(0, n_features*(n_features-1)//2, n_processes), dtype=int)

# parallel computation
def create_ei(i):
Expand Down Expand Up @@ -88,6 +94,7 @@ def create_ei(i):

# In[4]:


# store correlation values
files = os.listdir('tmp/correlations/')
files.sort()
Expand All @@ -102,6 +109,7 @@ def create_ei(i):

# In[5]:


# load correlation table
e = pd.read_hdf('e.h5')
print(e)
Expand All @@ -111,13 +119,15 @@ def create_ei(i):

# In[6]:


g = dg.DeepGraph(v)
p = get_ipython().magic('prun -r g.create_edges(connectors=corr, step_size=step_size)')
p = get_ipython().run_line_magic('prun', '-r g.create_edges(connectors=corr, step_size=step_size)')


# In[7]:


p.print_stats(20)


# As you can see, most of the time is spent by getting the requested samples in the corr-function, followed by computing the correlation values themselves.
# As you can see, most of the time is spent by getting the requested features in the corr-function, followed by computing the correlation values themselves.

0 comments on commit 9818652

Please sign in to comment.