# Try this exercise

In [1]:
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt

## Problem 1

It is said that the non-parametric Spearman Rank test is "efficient", by which it is meant that when the Spearman Rank test is run on gaussian distributed data, it performs as well as the Pearson's correlation coefficient.  More specifically, if you run the Spearman rank correlation on a set of data, and you run the Pearson's test on the same set of data, the Spearman test will yield the same level of significance as the Pearson's test some large fraction $X$ of the time.

Your job in this problem is to design a numerical experiment to test this statement and to evaluate $X$.

### Answer:

To look at the difference in these statistics, I do 1,000 trials of the following procedure. I generate a random bivariate normal distribution and then take a sample of 3,000 from this population. From this sample, I calculate the Pearson's $r$ and Spearman coefficient, and look at their absolute percent difference. Finally, I calculated $X$ to be the number of percent differences that are under 5% (my arbitrary decision) divided by the total number of trials, 1,000.

We see that this is indeed a large fraction, $X \approx 0.98$ (on this run of the code). We see that indeed for about 98% of the time the Spearman coefficient will approximate the Pearson coefficient within 5% of its value.

In [2]:
def gen_sample(N=3000, dim=2):
    '''Generate a sample from a random bivariate normal (around zero).'''
    mu = np.zeros(dim)
    
    # Make random semi-positive definite covariance matrix)
    M = np.random.rand(dim, dim)
    cov = np.dot(M, M.T)

    # Calculate statistics from a sample of the population
    X = np.random.multivariate_normal(mu, cov, size=N)
    
    return X
    
def pear_spear(X):
    r, p = ss.pearsonr(X[:, 0], X[:, 1])
    rs, ps = ss.spearmanr(X)

    return r, rs

n_tests = 1000
coefficients = np.array([pear_spear(gen_sample()) for i in range(n_tests)])
p_diff = abs(coefficients[:, 0] - coefficients[:, 1]) / coefficients[:, 0]
mask = p_diff < 0.05
X = len(coefficients[mask]) / n_tests
print(X)

0.982


## Problem 2

Do the same experiment for the Kendall's Tau.

**Hint:** I'd be inclined to use the built-in `scipy.stats` methods instead of computing the results myself.

### Answer:

I tested the Kendall's Tau statistic against the Pearson coefficient in the same manner. This time, however, I found that the fraction is much smaller, $X \approx 0.12$. This is likely because, with a large ($N = 3000$) sample size, the Kendall's Tau does not compare well in general to the Pearson coefficient (as stated in the lecture notebook). Therefore, this is not generally a good comparison to Pearson's $r$ in general.

In [3]:
def pear_tau(X):
    r, p = ss.pearsonr(X[:, 0], X[:, 1])
    rt, pt = ss.kendalltau(X[:, 0], X[:, 1])

    return r, rt
    
n_tests = 1000
coefficients = np.array([pear_tau(gen_sample()) for i in range(n_tests)])
p_diff = abs(coefficients[:, 0] - coefficients[:, 1]) / coefficients[:, 0]
mask = p_diff < 0.05
X = len(coefficients[mask]) / n_tests
print(X)

0.124
