<h1><b>NumPy, SciPy, and Pandas: Correlation With Python</b></h1>
<h2><b>1.Introduction</b></h2>
<p>This is the second lesson out of five from Real Python's learning path on math for DS. I'll be dealing with available tools for measuring correlation with Python. In case anyone is interested, follow this <a href='https://realpython.com/numpy-scipy-pandas-correlation-python/'>link</a> to get the entire content. Thanks once again for the whole Real Python team - specially, in this case, <a href='https://realpython.com/team/mstojiljkovic/'>Mirko Stojiljković</a> for this course!</p>
<p>This notebook will cover the following <b>objectives</b>:</p>
<ul>
    <li>What Pearson, Spearman, and Kendall correlation coefficients are</li>
    <li>How to use SciPy, NumPy, and Pandas correlation functions</li>
    <li>How to visualize data, regression lines, and correlation matrices with Matplotlib</li>
</ul>
<hr>
<h2><b>2. Correlation</b></h2>
<p>This notebook will deal with three different statistics that are applied to quantify correlation. They are:</p>
<ul>
<li>Pearson's <code>r</code></li>
<li>Spearman's <code>rho</code></li>
<li>Kendall's <code>tau</code></li>
</ul>
While Pearson's coeffiecient measures <strong>linear correlation</strong>, the following two compare <strong>ranks</strong> of data.</p>
<h2><b>3. Example - NumPy Correlation Calculation</b></h2>
<p>After importing NumPy and defining two arrays, we can use <code>np.corrcoef()</code> to get a matrix of <strong>Pearson correlation coefficients</code>.</p>



In [1]:
import numpy as np

In [2]:
x = np.arange(10, 20)
x

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [3]:
y = np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
y

array([ 2,  1,  4,  5,  8, 12, 18, 25, 96, 48])

<p><code>np.arange()</code> creates an array of <code>x</code> integers between 10(inclusive) and 20(<strong>exclusive</strong>). We can no apply <code>np.corrcoef()</code> using these two arrays as arguments:</p>

In [22]:
r = np.corrcoef(x, y)
r = np.around(r, 2)

In [23]:
print(r[0, 1])

0.76


In [25]:
print(r[1, 0])

0.76


<p>As we can see, the Pearson correlation coefficient is around <strong>0.76</strong>, which displays a positive, somewhat strong correlation between variables.</p>
<hr>
<h2><b>4. Example: SciPy Correlation Calculation</b></h2>
<p>When using <strong>SciPy</strong>, <code>scipy.stats</code> contains  three different methods to calculate the three coefficients that we saw earlier:</p>
<ul>
<li><code>pearsonr()</code></li>
<li><code>spearmanr()</code></li>
<li><code>kendalltau()</code></li>
</ul>
<p>Let's take a look at these functions:</p>


In [26]:
import scipy.stats


In [27]:
scipy.stats.pearsonr(x, y)

PearsonRResult(statistic=0.758640289091187, pvalue=0.010964341301680813)

In [28]:
scipy.stats.spearmanr(x, y)

SpearmanrResult(correlation=0.9757575757575757, pvalue=1.4675461874042197e-06)

In [29]:
scipy.stats.kendalltau(x, y)

KendalltauResult(correlation=0.911111111111111, pvalue=2.9761904761904762e-05)

<p>It is interesting to notice that these functions return <strong>two values</strong>: the correlation coefficient <i>and</i> the <strong>p-value</strong>.