# 3.5.4

## Pearson Correlation

### Sample Data

In [1]:
import numpy as np

For correlation examples, generate a rough male height/weight/BMI/minutes exercised per day dataset.

In [18]:
np.random.seed(42196)

# height of people in inches
heights = np.random.normal(69, 3, 100)

# weight in pounds, should be positively correlated with height
weights = heights * 3 + np.random.normal(0, 30, 100)

# BMI: temporary rough calculation weight/height, should have no correlation with height
bmi = weights / heights

# minutes exercise per day, should be negatively correlated with BMI
exercise = 1 / bmi * 100 + np.random.normal(0, 15, 100)

### Compute

Formula (for a sample): 

$$\rho = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $$

Where $\bar{x}$ is the mean of variable $X$ and $\bar{y}$ is the mean of variable $Y$.

In [19]:
def compute_pearson(x, y):
    x_hat = np.mean(x)
    y_hat = np.mean(y)
    return np.sum((x - x_hat) * (y - y_hat)) / (np.sum((x - x_hat) ** 2) * np.sum((y - y_hat) ** 2)) ** 0.5

Now check for our different variable combinations.

In [20]:
print (f"Height-Weight correlation: {compute_pearson(heights, weights):.3f}")
print (f"Height-BMI correlation: {compute_pearson(heights, bmi):.3f}")
print (f"BMI-Exercise correlation: {compute_pearson(bmi, exercise):.3f}")

Height-Weight correlation: 0.235
Height-BMI correlation: -0.040
BMI-Exercise correlation: -0.312


Our height-weight and BMI-exercise correlations are not particularly strong, but we can make them stronger by reducing the variation (e.g.- switch to `np.random.normal(0, 10, 100)` for weights).

## Spearman Rank Correlation

### Sample Data

In [22]:
# using fictitious wikipedia example of hours of tv watched vs. iq, which should have a strong spearman rank correlation
iq = [106, 100, 86, 101, 99, 103, 97, 113, 112, 110]
hours_tv = [7, 27, 2, 50, 28, 29, 20, 12, 6, 17]
# adding my own column for something that should have a weak rank correlation, such as height
height = [71, 65, 68, 61, 73, 70, 64, 66, 63, 69]

### Compute

Formula (when all values are distinct integers):

$$ \rho = 1 - \frac {6 \sum d_i^2} {n (n^2 - 1)} $$

Where $d_i$ is the rank difference between variables for the same datapoint.

In [23]:
def compute_rank(data):
    val_mapping = {}
    for i, val in enumerate(data):
        if val not in val_mapping:
            val_mapping[val] = [i]
        else:
            val_mapping[val].append(i)

    data_copy = data[:]
    data_copy.sort()
    
    # need to look up what happens for ties, giving ties all highest rank for now
    rank = []
    for val in data:
        rank.append(0)
    counter = 0
    for val in data_copy:
        counter += len(val_mapping[val])
        for idx in val_mapping[val]:
            rank[idx] = counter
            
    return rank

In [24]:
def compute_spearman(x, y):
    # compute ranks
    rank_x = compute_rank(x)
    rank_y = compute_rank(y)
    
    # compute differences in ranks
    diff_ranks = [rank_x[i] - rank_y[i] for i in range(len(rank_x))]
    
    # square each difference
    squared_diffs = [val ** 2 for val in diff_ranks]
    
    # sum squared differences
    val = sum(squared_diffs)
    
    # compute coefficient with formua: rho = 1 - (6 * sum(d_i^2)) / (n * (n^2 - 1))
    rho = 1 - (6 * val) / (len(rank_x) * (len(rank_x) ** 2 - 1))
    
    return rho

Compute for our sample data.

In [27]:
print (compute_spearman(iq, hours_tv))
print (compute_spearman(hours_tv, height))

-0.17575757575757578
-0.030303030303030276
