# 3.5.4

## Pearson Correlation

### Sample Data

In [5]:
import numpy as np

For correlation examples, generate a rough male height/weight/BMI/minutes exercised per day dataset.

In [26]:
np.random.seed(42196)

# height of people in inches
heights = np.random.normal(69, 3, 100)

# weight in pounds, should be positively correlated with height
weights = heights * 3 + np.random.normal(0, 30, 100)

# BMI: temporary rough calculation weight/height, should have no correlation with height
bmi = weights / heights

# minutes exercise per day, should be negatively correlated with BMI
exercise = 1 / bmi * 100 + np.random.normal(0, 15, 100)

### Compute

Formula: 

$$\rho = \sum[(X - X̄)(Y - Ȳ)] / [(n-1) * S_x * S_y]$$

Where $X̄$ is the mean of variable $X$, $Y$ is the mean of variable $Y$, and $S_x$ and $S_y$ are the respective variable standard deviations.

In [27]:
def compute_pearson(var_1, var_2):
    return np.sum((var_1 - np.mean(var_1)) * (var_2 - np.mean(var_2))) / ((var_1.shape[0] - 1) * np.std(var_1) * np.std(var_2))

Now check for our different variable combinations.

In [28]:
print (f"Height-Weight correlation: {compute_pearson(heights, weights):.3f}")
print (f"Height-BMI correlation: {compute_pearson(heights, bmi):.3f}")
print (f"BMI-Exercise correlation: {compute_pearson(bmi, exercise):.3f}")

Height-Weight correlation: 0.238
Height-BMI correlation: -0.040
BMI-Exercise correlation: -0.315


## Spearman Rank Correlation

### Data

In [29]:
X = [10, 20, 30, 40, 50]
Y = [5, 10, 20, 30, 40]

### Compute

In [53]:
def compute_rank(data):
    val_mapping = {}
    for i, val in enumerate(data):
        if val not in val_mapping:
            val_mapping[val] = [i]
        else:
            val_mapping[val].append(i)

    data_copy = data[:]
    data_copy.sort()
    
    # need to look up what happens for ties, giving ties all highest rank for now
    rank = []
    for val in data:
        rank.append(0)
    counter = 0
    for val in data_copy:
        counter += len(val_mapping[val])
        for idx in val_mapping[val]:
            rank[idx] = counter
            
    return rank

In [57]:
def compute_spearman(x, y):
    # compute ranks
    rank_x = compute_rank(x)
    rank_y = compute_rank(y)
    
    # compute differences in ranks
    diff_ranks = [rank_x[i] - rank_y[i] for i in range(len(rank_x))]
    
    # square each difference
    squared_diffs = [val ** 2 for val in diff_ranks]
    
    # sum squared differences
    val = sum(squared_diffs)
    
    # compute coefficient with formua: rho = 1 - (6 * sum(d_i^2)) / (n * (n^2 - 1))
    rho = 1 - (6 * val) / (len(rank_x) * (len(rank_x) ** 2 - 1))
    
    return rho

Compute for our sample data.

In [60]:
print (compute_spearman(X, Y))

1.0


Next let's do this with some real life data.