# Synopsis


This notebook considers multivariable stochastic processes.  It reviews the concept of joint distributions, marginal distributions, and describes how to estimate correlations between pairs of variables.

In particular, it describes Pearson's $R$, Spearman's $\rho$, and Kendall's $\tau$

# Load libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns

from operator import itemgetter
from matplotlib import gridspec

from module_libraries.my_stats import half_frame, playing_with_dice

my_fontsize = 15

# Multivariate processes

Many processes generate multiple random variables. For example, the weather "creates" sets of variables such as temperature, pressure, wind speed, wind direction, humidity, ozone levels, particulate levels, and so on. In such a case, we talk of joint probability density (or mass) functions.

## Joint distributions

Let's focus on the case of continuous variables, as the situation is similar for discrete or mixed cases.  The **joint probability density function** satisfies the following properties:

> 1. $f_{XY}(x, y) \ge 0$ for all $x, y$
>
>
> 2. $\int_{S_X} ~ \int_{S_Y} ~ f_{XY}(x, y) ~ dx~dy = 1$
>
>
> 3. For any region $S$ contained in $S_X \times S_Y$, 
>
> $~~~~~~~~~~~~~~~~~~~~~~ P \left((x,y) \in S  \right) = \int \int_S ~ f_{XY}(x, y) ~ dx~dy$


where $X$ is a random variable, $x$ a possible value for $X$, and I use the notation $f_X(x)$ to mean $f(X = x)$.

### Examples

Consider an instrument with two identical independent parts set in parallel so as to provide redundancy.  The failure of each part is described by a exponential process with average time to failure of $\tau = 1$ year.  

1. What is the joint probability density function?

2. What is the probability that the instrument will fail before 1 year?

3. What if the parts where placed in serie?




Consider two dice, $X$ and $Y$.  

1. What is the joint mass probability function?

2. What is the probability that both dice return 2 or less?

3. What is the probability that one of the die returns 2 or less?

In [None]:
n = 6
L = 1000
x = np.arange(1, n+1)

die1_throws = stats.randint.rvs(0, n, size = L)
die2_throws = stats.randint.rvs(0, n, size = L)

# Calculate histogram
#
h = [0]*n
hist = np.array([h]*n)

for i, j in zip(die1_throws, die2_throws):
    hist[i, j] += 1   
hist = hist / L

# Plot data
#
fig = plt.figure( figsize = (6, 4.5) )
fig.text(0, 1, f"Expectation is {1/36:.2f}", fontsize = my_fontsize*1.5)
ax = fig.add_subplot(1,1,1)
my_font_size = 15
half_frame(ax, '$die_1$', '$die_2$', font_size = my_font_size)

temp = ax.imshow(hist, cmap = plt.cm.cividis, vmin = 0, vmax = 0.05)

ax.plot(die1_throws, die2_throws, 'o', 'r', alpha = 0.5)

cbar = ax.figure.colorbar( temp, ax = ax, fraction = .08, shrink = 0.9,
                           ticks = [0, 0.01, 0.02, 0.030, 0.04, 0.05], )
cbar.ax.set_ylabel( 'Probability', rotation = -90, va = "bottom", 
                    fontsize = my_font_size )

ax.set_xticks(x - 1)
ax.set_xticklabels(x)

ax.set_yticks(x - 1)
ax.set_yticklabels(x)

plt.show()



# Marginal distributions

Especially when considering probability density functions of many random variables, one may want to visualize what is going on with the probability for **each** of those variables. In that case, one wants to compute the marginal probability density function.  

Taking the formulation above, the marginal probability density for $X$ is

> $~~~~~~~~~~f_X (x) = \int_{S_Y} ~ f_{XY}(x, y) ~ dy$ 



In [None]:
# Calculate marginal histograms
#
hist1 = np.array([0]*n)
hist2 = np.array([0]*n)
for i, j in zip(die1_throws, die2_throws):
    hist1[i] += 1
    hist2[j] += 1
    
hist1 = hist1 / L
hist2 = hist2 / L

# Plot data
#
ax = []
my_font_size = 15
fig = plt.figure( figsize = (6, 8) )

ax.append( fig.add_subplot(2,1,1) )
half_frame(ax[0], '$die_1$', 'Probability', font_size = my_font_size)
ax[0].bar(x, hist1, color = 'b')
ax[0].hlines(1/n, 0.5, n+0.5, color = 'k', linewidth = 4)
ax[0].set_xticks(x)
ax[0].set_xticklabels(x)

ax.append( fig.add_subplot(2,1,2) )
half_frame(ax[1], '$die_2$', 'Probability', font_size = my_font_size)
ax[1].bar(x, hist2, color = 'r')
ax[1].hlines(1/n, 0.5, n+0.5, color = 'k', linewidth = 4)
ax[1].set_xticks(x)
ax[1].set_xticklabels(x)

plt.tight_layout()
plt.show()

# Conditional distributions

In the example of the instrument failure, the failure of each of the parts is independent of what is happening with the other part.  This means that the joint probability distribution is simply the product of the individual probability distributions. 

However, in many cases of interest, the two outcomes are **not** independent.  In that case, it becomes important to know the conditional probability distribution.  Using Bayes theorem, we can write

> $~~~~~~~~~~ f_{Y|x}(y) = f (Y = y ~ | ~ X = x) = \frac{ f_{XY}(x, y)}{f_X(x)}~~~~~~$ for $~~~~~~~f_X(x) > 0$


Estimating the correlation between random variables is extremely important, as is its figuring out the mechanisms giving rise to it.


In [None]:
def my_function0(die1_throws, die2_throws):
    
    points = die1_throws + die2_throws
    y_max = 6 + 6
    y_min = 2
    y = np.arange(0, y_max + 1)
    
    return points, y, y_max, y_min
    
    
L = 200
points0 = playing_with_dice( L, 10, die1_throws, die2_throws, 
                             my_function0, 10, my_font_size )

# Correlation measures

## Covariance and Pearson's $r$ (see [Wikipedia article](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient))

The covariance between two random variables $X$ and $Y$ is defined as

> $~~~~~~~~~~cov(X, Y) = \sigma_{XY} = E[(X-\mu_x)(Y-\mu_Y)] = E[XY] - \mu_X \mu_Y$

where

> $~~~~~~~~~~E[g(X,Y)] = \int_{S_X} \int_{S_Y} ~ g(x,y)~f_{XY}(x, y) ~ dx ~ dy$ 

The method `numpy.cov` returns a `numpy.array` object with shape (2, 2) with an estimate of the covariance matrix. 

Pearson correlation coefficient, also called Pearson's $r$ is just the covariance normalized by the standard deviations of the two random variables

> corr$(X, Y) = r_{XY} = \frac{cov(X, Y)}{\sigma_X~\sigma_Y}$

The Pearson correlation coefficient is symmetric: $r_{XY} = r_{YX}$. A key mathematical property of the Pearson correlation coefficient is that it is invariant under separate changes in location and scale in the two variables. That is, we may transform $X \to a_0 + a_1X$ and transform $Y \to b_0 + b_1Y$, where $a_0, a_1, b_0, b_1$ are constants with $a_1 b_1 > 0$, without changing the correlation coefficient (if $a_1 b_1 < 0$ then the correlation coefficient changes sign).

The method `scipy.stats.pearsonr` returns estimates of both the correlation coefficient and the p-value. **The calculation of the p-value relies on the assumption that each dataset is normally distributed**. The values of the Pearson correlation coefficients are in $[-1,~1]$. Correlations equal to 1 or $-1$ correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation).  

<img src = "Images/correlation_examples.png" width = 600>

**Note that covariance is a measure of linear association**.  Non-linear monotonic associations will result in non-zero covariances, but do not measure correctly the degree of association between the two variables.

## Spearman's $\rho$ (see [Wikipedia article](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient))

Spearman's rank correlation coefficient or Spearman's $\rho$ is a nonparametric measure of **rank correlation** (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a **monotonic function**. Spearman's coefficient is appropriate for both continuous and **discrete ordinal variables**, such  as you would get from survey answers using a Likert scale.

>  $\rho_{XY} = \frac{cov(R(X), R(Y))}{\sigma_{R(X)}~\sigma_{R(Y)}}$ ,

where $R(X)$ is the rank of $X$. 

The method `scipy.stats.spearmanr` returns estimates of both the correlation coefficient and the p-value. These values are obtained by calling the attributes .`coefficient` and `.pvalue` of the result. **The calculation of the p-value can be obtained considering a permutation test of the ranks or by defining the Fisher transform of $r$

> $F(r) = \frac{1}{2}~\ln \frac{1+r}{1 - r}$ ,

which is approximately normally distributed with zero mean and standard deviation

> $\sqrt \frac{1.06}{n - 3} $.

If there are no repeated data values, a perfect Spearman correlation of +1 or $-1$ occurs when each of the variables is a perfect monotone function of the other. Moreover, 

> $\sigma_{R(X)}~\sigma_{R(Y)} = Var[R(X)] = Var[R(Y)] = \frac{n^2 - 1}{12}$ ,

where $n$ is the sample size.

**Non-linear, or even piecewise linear,  non-monotonic associations can frequently yield no covariance between variables. In this case, considering the covariance of the rank of the random variables is not going to solve the problem at all.**

## Kendall's $\tau$ (see [Wikipedia article](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient))

The Kendall rank correlation coeficient  or Kendall's $\tau$ is an alternative to Spearman's rank correlation coefficient. It is also a nonparametric measure of **rank correlation**.  It is defined as

> $\tau_{XY} = \frac{2}{n(n - 1)}~\sum_{i < j}{\rm sgn}(x_i - x_j)~{\rm sgn}(y_i - y_j)$.

The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of $X$ or $Y$ or the distribution of $(X,Y)$. 

The method `scipy.stats.kendalltau` returns estimates of both the correlation coefficient and the p-value. These values are obtained by calling the attributes .`coefficient` and `.pvalue` of the result. **The calculation of the p-value can be obtained considering a permutation test of the ranks or, for large $n$, by a approximating the distribution of $\tau$ as a normal with zero mean and standard deviation**

> $\frac{2~(2n + 5)}{9n~(n-1)}$.



In [None]:
# Throw two dice L times
L = 200
n = 6
die1_throws = stats.randint.rvs(1, n+1, size = L)
die2_throws = stats.randint.rvs(1, n+1, size = L)

points, y, y_max, y_min = my_function0(die1_throws, die2_throws)


**The throws of the two dice are uncorrelated.**  

We thus expect a very small value of the correlation, and a non-significant p-value.

In [None]:
result = stats.pearsonr(die2_throws, die1_throws)
print(f"Indeed, we find Pearson's r is {result[0]:.3f} with an estimated "
      f"significance level of {result[1]:.6f}\n")


**The function of the points of the two dice is correlated with the points of each dice.**

We thus expect a large correlation, and a significant p-value.

In [None]:
result = np.cov(points, die1_throws)
print(f"The covariance between the two random variables is {result[0,1]:.3f}\n")

result = stats.spearmanr(points, die1_throws)
print(f"Spearman's rho is {result.correlation:.3f} with an estimated "
      f"significance level of {result.pvalue:.6f}\n")

result = stats.kendalltau(points, die1_throws)
print(f"Kendall's tau is {result.correlation:.3f} with an estimated "
      f"significance level of {result.pvalue:.6f}\n")


In [None]:
def my_function1(die1_throws, die2_throws):
    
    points = die1_throws*die1_throws + die2_throws
    y_max = 6*6 + 6
    y_min = 2
    y = np.arange(0, y_max + 1)
    
    return points, y, y_max, y_min
    
# Throw two dice L times
n = 6
x = np.arange(1, n+1)
L = 200

die1_throws = stats.randint.rvs(1, n+1, size = L)
die2_throws = stats.randint.rvs(1, n+1, size = L)

points1 = playing_with_dice( L, n, die1_throws, die2_throws, my_function1, 
                             20, my_fontsize )


In [None]:
def my_function2(die1_throws, die2_throws):
    
    points = (3.5 - die1_throws)*(3.5 - die1_throws) + die2_throws
    y_max = 13
    y_min = 0
    y = np.arange(0, y_max + 1)
    
    return points, y, y_max, y_min
    
    

# Throw two dice L times
n = 6
x = np.arange(1, n+1)
L = 200

die1_throws = stats.randint.rvs(1, n+1, size = L)
die2_throws = stats.randint.rvs(1, n+1, size = L)

points2 = playing_with_dice( L, n, die1_throws, die2_throws, my_function2, 
                             12, my_fontsize )



In [None]:
2.5**2 + 6

# Correlation is not causation

Since at least the time of the Kabbalah, there has been a tradition of finding meaning and associations between unrelated things. From those that ["demonstrated" that the dimensions of the three Giza pyramids are in perfect scale to some planetary distances in the solar system](https://www.ancient-origins.net/forum/great-pyramid-decoded-solar-system-changes-history-002631), to astrology and so on.

There is even a [site](https://www.tylervigen.com/spurious-correlations) dedicated to it.


In [None]:
# Generate N random data sets of size L drawn from whatever distribution you like
#
L = 20
N = 200
random_samples = []

for i in range(N):
    random_samples.append( stats.expon.rvs(0, 1, size = L) )

print(f"I have generated {len(random_samples)} samples of length "
      f"{len(random_samples[0])}.")

# Calculate the correlation between every pair of samples
#
spearman_correlation = []
for i in range(N-1):
    sample1 = random_samples[i]
    for j in range(i+1,N):
        sample2 = random_samples[j]
        correlation, p_value = stats.pearsonr(sample1, sample2)
        my_tuple = (abs(correlation), abs(correlation)/correlation, 
                                      p_value, i, j)
        
        spearman_correlation.append( my_tuple )
        
# What is the maximum and minimum correlation coefficient your observe

my_max = max(spearman_correlation, key = itemgetter(0))
print(f"The maximum correlation, {my_max[0]:.3f}, occurs for samples "
      f"{my_max[3]} and {my_max[4]}.")

my_min = min(spearman_correlation, key = itemgetter(0))
print(f"The minimum correlation, {my_min[0]:.3f}, occurs for samples "
      f"{my_min[3]} and {my_min[4]}.")

ax = []
fig = plt.figure( figsize = (12, 8) )
gs = gridspec.GridSpec(4, 1)

ax.append( fig.add_subplot(gs[0]) )
half_frame(ax[0], '', '$X_i, X_j$', font_size = my_font_size)
ax[0].plot( np.arange(0,L), random_samples[my_max[3]], 'o-', color = 'orange', 
           alpha = 0.8, label = f"i = {my_max[3]}" )
ax[0].plot( np.arange(0,L), random_samples[my_max[4]], 'o-', color = 'b', 
           alpha = 0.8, label = f"j = {my_max[4]}" )
ax[0].set_xlim(-0.5,L)
ax[0].set_xticks(range(0, L, 2))
ax[0].legend(loc = 'best', fontsize = my_fontsize)

ax.append( fig.add_subplot(gs[1]) )
half_frame(ax[1], '', '$|X_i - X_j|$', font_size = my_font_size)
ax[1].bar( np.arange(0,L), abs(random_samples[my_max[3]] - random_samples[my_max[4]]), 
           color = 'r', alpha = 0.8, width = 0.3 )
ax[1].set_ylim(0,4)
ax[1].set_xlim(-0.5,L)
ax[1].set_xticks(range(0, L, 2))
ax[1].hlines([2, 4], -0.5, L+0.5, color = '0.4', lw = 2)

ax.append( fig.add_subplot(gs[2]) )
half_frame(ax[2], '', '$X_i, X_j$', font_size = my_font_size)
ax[2].plot( np.arange(0,L), random_samples[my_min[3]], 'o-', color = 'orange', 
           alpha = 0.8, label = f"i = {my_min[3]}" )
ax[2].plot( np.arange(0,L), random_samples[my_min[4]], 'o-', color = 'b', 
           alpha = 0.8, label = f"j = {my_min[4]}" )
ax[2].set_xlim(-0.5,L)
ax[2].set_xticks(range(0, L, 2))
ax[2].legend(loc = 'best', fontsize = my_fontsize)

ax.append( fig.add_subplot(gs[3]) )
half_frame(ax[3], 'Event', '$|X_i - X_j|$', font_size = my_font_size)
ax[3].bar( np.arange(0,L), abs(random_samples[my_min[3]] - random_samples[my_min[4]]), 
           color = 'r', alpha = 0.8, width = 0.3 )
ax[3].set_ylim(0,4)
ax[3].set_xlim(-0.5,L)
ax[3].set_xticks(range(0, L, 2))
ax[3].hlines([2, 4], -0.5, L+0.5, color = '0.4', lw = 2)


plt.tight_layout()


<br>

<br>

**The pithy statement $-$ many times used as a cudgel $-$ is that correlation is not the same as causation.**  

However, this mantra is many times used as a way to dismiss potentially important hypotheses and studies.  While experimentation or randomization of subjects to conditions is a desired way to attempt to exclude the impact of unconsidered factors, either of them is just as robust as their size sample, and actual control of unconsidered factors. 


# Exercise

Investigate how the sample distribution and sample size affect the number of random samples needed in order to reliably find a pair of highly-correlated samples.

Specifically:

> 1. Fix the distribution of the samples (keep it to exponential, for example)
> 
> > 2. Select a set of sample sizes for simulation, for example [10, 20, 40, 80, 160]
> >
> > > 3. Select a number of samples to generate, for example 10 
> > > 
> > > > 4. Generate the necessary sample and calculate the maximum value of the correlation coefficient between all those samples.
> > > >
> > > > 5. Repeat 4 a few times in order to see whether the maximum value is relatively constant
> > > > 
> > > > 6. Until the maximum correlation coefficient is large, go to 3 and increased the number of samples (for example, by a factor of 2)
> > >
> > > Repeat with a different sample size
> > >
> > > Plot number of samples at which you get large correlation coefficients for a given sample size.
> >
> > Repeat for a different distribution.


# Let's keep going...

[Next lesson](nb_12_Estimation.ipynb)