update tutorial to include spearman correlation.

deepgraph · Sep 23, 2018 · 9818652 · 9818652
1 parent 731f565
commit 9818652
Show file tree

Hide file tree

Showing 3 changed files with 116 additions and 88 deletions.
diff --git a/doc/source/tutorials/pairwise_correlations.ipynb b/doc/source/tutorials/pairwise_correlations.ipynb
@@ -33,9 +33,9 @@
    "source": [
     "In this short tutorial, we'll demonstrate how DeepGraph can be used to efficiently compute very large correlation matrices in parallel, with full control over RAM usage.\n",
     "\n",
-    "Assume you have a set of ``n_samples`` samples, each comprised of ``n_features`` features and you want to compute the `Pearson correlation coefficients <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ between all pairs of samples. If your data is small enough, you may use `scipy.stats.pearsonr <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr>`_ or `numpy.corrcoef <https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html>`_, but for large data, neither of these methods is feasible. Scipy's pearsonr  would be very slow, since you'd have to compute pair-wise correlations in a double loop, and numpy's corrcoef would most likely blow your RAM.\n",
+    "Assume you have a set of ``n_samples`` samples, each comprised of ``n_features`` features and you want to compute the `Pearson correlation coefficients <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ between all pairs of features (for the `Spearman's rank correlation coefficients <https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient>`_, see the *Note*-box below). If your data is small enough, you may use `scipy.stats.pearsonr <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr>`_ or `numpy.corrcoef <https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html>`_, but for large data, neither of these methods is feasible. Scipy's pearsonr  would be very slow, since you'd have to compute pair-wise correlations in a double loop, and numpy's corrcoef would most likely blow your RAM.\n",
     "\n",
-    "Using DeepGraph's :py:meth:`create_edges <.create_edges>` method, you can compute all pair-wise correlations efficiently. In this tutorial, the samples are stored on disc and only the relevant subset of samples for each iteration will be loaded into memory by the computing nodes. Parallelization is achieved by using python's standard library `multiprocessing <https://docs.python.org/3.6/library/multiprocessing.html>`_, but it should be straight-forward to modify the code to accommodate other parallelization libraries. It should also be straight-forward to modify the code in order to compute other correlation/distance/similarity-measures between a set of samples. \n",
+    "Using DeepGraph's :py:meth:`create_edges <.create_edges>` method, you can compute all pair-wise correlations efficiently. In this tutorial, the data is stored on disc and only the relevant subset of features for each iteration will be loaded into memory by the computing nodes. Parallelization is achieved by using python's standard library `multiprocessing <https://docs.python.org/3.6/library/multiprocessing.html>`_, but it should be straight-forward to modify the code to accommodate other parallelization libraries. It should also be straight-forward to modify the code in order to compute other correlation/distance/similarity-measures between a set of features. \n",
     "\n",
     "First of all, we need to import some packages"
    ]
@@ -63,7 +63,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's create a set variables and store it as a 2d-matrix ``X`` (``shape=(n_samples, n_features)``) on disc. To speed up the computation of the Pearson correlation coefficients later on, we whiten each variable."
+    "Let's create a set of variables and store it as a 2d-matrix ``X`` (``shape=(n_features, n_samples)``) on disc. To speed up the computation of the correlation coefficients later on, we whiten each variable."
    ]
   },
   {
@@ -75,9 +75,12 @@
     "# create observations\n",
     "from numpy.random import RandomState\n",
     "prng = RandomState(0)\n",
-    "n_samples = int(5e3)\n",
-    "n_features = int(1e2)\n",
-    "X = prng.randint(100, size=(n_samples, n_features)).astype(np.float64)\n",
+    "n_features = int(5e3)\n",
+    "n_samples = int(1e2)\n",
+    "X = prng.randint(100, size=(n_features, n_samples)).astype(np.float64)\n",
+    "\n",
+    "# uncomment the next line to compute ranked variables for Spearman's correlation coefficients\n",
+    "# X = X.argsort(axis=1).argsort(axis=1)\n",
     "\n",
     "# whiten variables for fast parallel computation later on\n",
     "X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)\n",
@@ -86,6 +89,16 @@
     "np.save('samples', X)\n"
    ]
   },
+  {
+   "cell_type": "raw",
+   "metadata": {
+    "raw_mimetype": "text/restructuredtext"
+   },
+   "source": [
+    ".. note:: \n",
+    "    On the computation of the `Spearman's rank correlation coefficients <https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient>`_: Since the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables, it suffices to uncomment the indicated line in the above code-block in order to compute the Spearman's rank correlation coefficients in the following. "
+   ]
+  },
   {
    "cell_type": "raw",
    "metadata": {
@@ -113,13 +126,13 @@
     "\n",
     "# connector function to compute pairwise pearson correlations\n",
     "def corr(index_s, index_t):\n",
-    "    samples_s = X[index_s]\n",
-    "    samples_t = X[index_t]\n",
-    "    corr = np.einsum('ij,ij->i', samples_s, samples_t) / n_features\n",
+    "    features_s = X[index_s]\n",
+    "    features_t = X[index_t]\n",
+    "    corr = np.einsum('ij,ij->i', features_s, features_t) / n_samples\n",
     "    return corr\n",
     "\n",
     "# index array for parallelization\n",
-    "pos_array = np.array(np.linspace(0, n_samples*(n_samples-1)//2, n_processes), dtype=int)\n",
+    "pos_array = np.array(np.linspace(0, n_features*(n_features-1)//2, n_processes), dtype=int)\n",
     "\n",
     "# parallel computation\n",
     "def create_ei(i):\n",
@@ -294,40 +307,40 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "         252632 function calls (247856 primitive calls) in 6.961 seconds\n",
+      "         244867 function calls (239629 primitive calls) in 14.193 seconds\n",
       "\n",
       "   Ordered by: internal time\n",
-      "   List reduced from 528 to 20 due to restriction <20>\n",
+      "   List reduced from 541 to 20 due to restriction <20>\n",
       "\n",
       "   ncalls  tottime  percall  cumtime  percall filename:lineno(function)\n",
-      "      250    3.990    0.016    3.997    0.016 memmap.py:334(__getitem__)\n",
-      "      125    1.216    0.010    1.216    0.010 {built-in method numpy.core.multiarray.c_einsum}\n",
-      "      125    0.328    0.003    5.582    0.045 deepgraph.py:4553(map)\n",
-      "        2    0.302    0.151    0.302    0.151 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects}\n",
-      "      250    0.114    0.000    0.124    0.000 internals.py:4473(_stack_arrays)\n",
-      "      125    0.086    0.001    6.017    0.048 deepgraph.py:5289(_select_and_return)\n",
-      "        4    0.084    0.021    0.084    0.021 {built-in method numpy.core.multiarray.concatenate}\n",
-      "      129    0.083    0.001    0.083    0.001 {method 'take' of 'numpy.ndarray' objects}\n",
-      "        2    0.066    0.033    0.180    0.090 algorithms.py:429(safe_sort)\n",
-      "      126    0.047    0.000    0.047    0.000 api.py:93(_sanitize_and_check)\n",
-      "       52    0.044    0.001    0.044    0.001 _weakrefset.py:36(__init__)\n",
-      "      125    0.042    0.000    0.042    0.000 {deepgraph._triu_indices._reduce_triu_indices}\n",
-      "      125    0.042    0.000    0.042    0.000 {built-in method deepgraph._triu_indices._triu_indices}\n",
-      "        2    0.039    0.020    0.039    0.020 function_base.py:4684(delete)\n",
-      "        2    0.032    0.016    0.032    0.016 {built-in method numpy.core.multiarray.putmask}\n",
-      "        4    0.026    0.006    0.026    0.006 {built-in method pandas._libs.algos.ensure_int16}\n",
-      "      125    0.021    0.000    5.234    0.042 <ipython-input-3-ddd5575c35f5>:12(corr)\n",
-      "49804/49196    0.015    0.000    0.074    0.000 {built-in method builtins.isinstance}\n",
-      "        1    0.015    0.015    6.893    6.893 deepgraph.py:4783(_matrix_iterator)\n",
-      "        1    0.013    0.013    6.961    6.961 deepgraph.py:178(create_edges)\n",
+      "      250    9.355    0.037    9.361    0.037 memmap.py:334(__getitem__)\n",
+      "      125    1.584    0.013    1.584    0.013 {built-in method numpy.core.multiarray.c_einsum}\n",
+      "      125    1.012    0.008   12.013    0.096 deepgraph.py:4558(map)\n",
+      "        2    0.581    0.290    0.581    0.290 {method 'get_labels' of 'pandas._libs.hashtable.Int64HashTable' objects}\n",
+      "        1    0.301    0.301    0.414    0.414 multi.py:795(_engine)\n",
+      "        5    0.157    0.031    0.157    0.031 {built-in method numpy.core.multiarray.concatenate}\n",
+      "      250    0.157    0.001    0.170    0.001 internals.py:5017(_stack_arrays)\n",
+      "        2    0.105    0.053    0.105    0.053 {pandas._libs.algos.take_1d_int64_int64}\n",
+      "      889    0.094    0.000    0.094    0.000 {method 'reduce' of 'numpy.ufunc' objects}\n",
+      "      125    0.089    0.001   12.489    0.100 deepgraph.py:5294(_select_and_return)\n",
+      "      125    0.074    0.001    0.074    0.001 {deepgraph._triu_indices._reduce_triu_indices}\n",
+      "      125    0.066    0.001    0.066    0.001 {built-in method deepgraph._triu_indices._triu_indices}\n",
+      "        4    0.038    0.009    0.038    0.009 {built-in method pandas._libs.algos.ensure_int16}\n",
+      "      125    0.033    0.000   10.979    0.088 <ipython-input-3-26c4f59cd911>:12(corr)\n",
+      "        2    0.028    0.014    0.028    0.014 function_base.py:4703(delete)\n",
+      "        1    0.027    0.027   14.163   14.163 deepgraph.py:4788(_matrix_iterator)\n",
+      "        1    0.027    0.027    0.113    0.113 multi.py:56(_codes_to_ints)\n",
+      "45771/45222    0.020    0.000    0.043    0.000 {built-in method builtins.isinstance}\n",
+      "        1    0.019    0.019   14.193   14.193 deepgraph.py:183(create_edges)\n",
+      "        2    0.012    0.006    0.700    0.350 algorithms.py:576(factorize)\n",
       "\n",
       "\n"
      ]
     },
     {
      "data": {
       "text/plain": [
-       "<pstats.Stats at 0x7f5a9678f4e0>"
+       "<pstats.Stats at 0x7f814fb2bd68>"
       ]
      },
      "execution_count": 7,
@@ -343,7 +356,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As you can see, most of the time is spent by getting the requested samples in the corr-function, followed by computing the correlation values themselves. "
+    "As you can see, most of the time is spent by getting the requested features in the corr-function, followed by computing the correlation values themselves. "
    ]
   }
  ],
@@ -364,7 +377,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.4"
+   "version": "3.6.5"
   }
  },
  "nbformat": 4,

diff --git a/doc/source/tutorials/pairwise_correlations.py b/doc/source/tutorials/pairwise_correlations.py
@@ -5,6 +5,7 @@
 
 # In[1]:
 
+
 # data i/o
 import os
 
@@ -18,16 +19,20 @@
 import deepgraph as dg
 
 
-# Let's create a set variables and store it as a 2d-matrix ``X`` (``shape=(n_samples, n_features)``) on disc. To speed up the computation of the Pearson correlation coefficients later on, we whiten each variable.
+# Let's create a set of variables and store it as a 2d-matrix ``X`` (``shape=(n_features, n_samples)``) on disc. To speed up the computation of the correlation coefficients later on, we whiten each variable.
 
 # In[2]:
 
+
 # create observations
 from numpy.random import RandomState
 prng = RandomState(0)
-n_samples = int(5e3)
-n_features = int(1e2)
-X = prng.randint(100, size=(n_samples, n_features)).astype(np.float64)
+n_features = int(5e3)
+n_samples = int(1e2)
+X = prng.randint(100, size=(n_features, n_samples)).astype(np.float64)
+
+# uncomment the next line to compute ranked variables for Spearman's correlation coefficients
+# X = X.argsort(axis=1).argsort(axis=1)
 
 # whiten variables for fast parallel computation later on
 X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)
@@ -38,6 +43,7 @@
 
 # In[3]:
 
+
 # parameters (change these to control RAM usage)
 step_size = 1e5
 n_processes = 100
@@ -50,13 +56,13 @@
 
 # connector function to compute pairwise pearson correlations
 def corr(index_s, index_t):
-    samples_s = X[index_s]
-    samples_t = X[index_t]
-    corr = np.einsum('ij,ij->i', samples_s, samples_t) / n_features
+    features_s = X[index_s]
+    features_t = X[index_t]
+    corr = np.einsum('ij,ij->i', features_s, features_t) / n_samples
     return corr
 
 # index array for parallelization
-pos_array = np.array(np.linspace(0, n_samples*(n_samples-1)//2, n_processes), dtype=int)
+pos_array = np.array(np.linspace(0, n_features*(n_features-1)//2, n_processes), dtype=int)
 
 # parallel computation
 def create_ei(i):
@@ -88,6 +94,7 @@ def create_ei(i):
 
 # In[4]:
 
+
 # store correlation values
 files = os.listdir('tmp/correlations/')
 files.sort()
@@ -102,6 +109,7 @@ def create_ei(i):
 
 # In[5]:
 
+
 # load correlation table
 e = pd.read_hdf('e.h5')
 print(e)
@@ -111,13 +119,15 @@ def create_ei(i):
 
 # In[6]:
 
+
 g = dg.DeepGraph(v)
-p = get_ipython().magic('prun -r g.create_edges(connectors=corr, step_size=step_size)')
+p = get_ipython().run_line_magic('prun', '-r g.create_edges(connectors=corr, step_size=step_size)')
 
 
 # In[7]:
 
+
 p.print_stats(20)
 
 
-# As you can see, most of the time is spent by getting the requested samples in the corr-function, followed by computing the correlation values themselves.
+# As you can see, most of the time is spent by getting the requested features in the corr-function, followed by computing the correlation values themselves.