# Correlation & Covariance


In [1]:
!uv pip install -q\
    pandas==2.3.3 \
    numpy==2.3.3 \
    scipy==1.16.2

In [None]:
import math

import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr

Covariance and correlation are statistical measures used to determine the relationship between two variables. Both are used to understand how changes in one variable are associated with changes in another one.

- [Covariance](#covariance)
- [Correlation](#correlation)

## Covariance

Covariance is a measure of how much two random variables change together. If the variables tend to increase and decrease together, the covariance is positive. If one tends to increase when other decreases, the covariance is negative.

Covariance of $(x, y)$

$s_{xy} = \text{Cov}(X, Y) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x}) (y_i - \bar{y})$

Covariance of $(x, x)$

$s^2_x = \text{Cov}(X, X) = \text{Var}(X) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2$

### Advantages

- Quantify the relationship between $X$ and $Y$

### Disadvantages

- Covariance does not have a specific limit value. So is not possible to compare two covariances to decide which one is stronger.


In [None]:
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
N = len(X)
N

10

Calculate the means


In [None]:
x_bar = sum(X) / N
y_bar = sum(Y) / N

print(x_bar)
print(y_bar)

5.5
19.0


Calculate the sum of the products of the deviations


In [None]:
sum_of_products = sum((x_i - x_bar) * (y_i - y_bar) for x_i, y_i in zip(X, Y))
sum_of_products

165.0

Calculate the sample covariance


In [None]:
sample_covariance = sum_of_products / (N - 1)
sample_covariance

18.333333333333332

### Using Numpy


In [None]:
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

data = np.array([X, Y])
covariance_matrix = np.cov(data)

print(f"Cov(X, Y) check: {covariance_matrix[0, 1]:.4f}")

Cov(X, Y) check: 18.3333


## Correlation

- [Pearson Correlation Coefficient](#pearson-correlation-coefficient)
- [Spearman Rank Correlation](#spearman-rank-correlation)

### Pearson Correlation Coefficient

- It limits the values between $-1$ and $+1$
- **Use with straight lines**

$r = \frac{\sum_{i=1}^{N} (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2}}$

- $N$ is the number of observations in the sample
- $x_i$ and $y_i$ are individual data points
- $\bar{x}$ and $\bar{y}$ are the **sample means** of $X$ and $Y$


In [None]:
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
N = len(X)
N

10

Calculate the sample means


In [None]:
x_bar = sum(X) / N
y_bar = sum(Y) / N

print(x_bar)
print(y_bar)

5.5
19.0


Calculate the numerator: Sum of products of deviations


In [None]:
numerator = sum((x_i - x_bar) * (y_i - y_bar) for x_i, y_i in zip(X, Y))
print(f"Numerator (Covariance numerator): {numerator}")

Numerator (Covariance numerator): 165.0


Calculate the components of the denominator


Sum of squared deviations for $X$: $sum((x_i - \bar{x})^2)$


In [None]:
sum_sq_dev_x = sum((x_i - x_bar) ** 2 for x_i in X)
print(f"Sum of squared deviations for X: {sum_sq_dev_x}")

Sum of squared deviations for X: 82.5


Sum of squared deviations for $Y$: $sum((y_i - \bar{y})^2)$


In [None]:
sum_sq_dev_y = sum((y_i - y_bar) ** 2 for y_i in Y)
print(f"Sum of squared deviations for Y: {sum_sq_dev_y}")

Sum of squared deviations for Y: 330.0


Calculate the denominator: $\sqrt{\sum_{i=1}^{N} (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2}$


In [None]:
denominator = math.sqrt(sum_sq_dev_x) * math.sqrt(sum_sq_dev_y)
print(f"Denominator: {denominator}")

Denominator: 165.0


Calculate the Pearson correlation coefficient (r)


In [None]:
pearson_correlation = numerator / denominator
print(f"Pearson Correlation Coefficient (r): {pearson_correlation}")

Pearson Correlation Coefficient (r): 1.0


#### Using SciPy


In [None]:
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

correlation, p_value = pearsonr(X, Y)

print(f"Pearson Correlation (r): {correlation:.4f}")
print(f"P-value: {p_value:.2e}")

Pearson Correlation (r): 1.0000
P-value: 1.70e-61


### Spearman Rank Correlation

- It's a non-parametric measure of the **strength and direction** of the association between two ranked variables.
- **Use with curves**

$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$

- $\rho$ (rho) is the Spearman rank correlation coefficient
- $d_i$ is the **difference** between the ranks of the $i$-th observation of $x_i$ and $y_i$
- $n$ is the **number of observations** (or data points) in the sample


In [None]:
X = [10, 2, 8, 1, 5]
Y = [90, 50, 85, 40, 70]
n = len(X)
n

5

Create a list of (value, index) pairs for X, then sort by value


In [None]:
sorted_x_indexed = sorted([(val, i) for i, val in enumerate(X)])
sorted_x_indexed

[(1, 3), (2, 1), (5, 4), (8, 2), (10, 0)]

Create a list of (value, index) pairs for Y, then sort by value


In [None]:
sorted_y_indexed = sorted([(val, i) for i, val in enumerate(Y)])
sorted_y_indexed

[(40, 3), (50, 1), (70, 4), (85, 2), (90, 0)]

Rank_X: The rank for each element in the original X list


In [None]:
sorted_x = sorted(X)
rank_x = [sorted_x.index(x) + 1 for x in X]
rank_x

[5, 2, 4, 1, 3]

Rank_Y: The rank for each element in the original Y list


In [None]:
sorted_y = sorted(Y)
rank_y = [sorted_y.index(y) + 1 for y in Y]
rank_y

[5, 2, 4, 1, 3]

In [None]:
print(f"Original X: {X}")
print(f"Ranked X:   {rank_x}")
print(f"Original Y: {Y}")
print(f"Ranked Y:   {rank_y}")

Original X: [10, 2, 8, 1, 5]
Ranked X:   [5, 2, 4, 1, 3]
Original Y: [90, 50, 85, 40, 70]
Ranked Y:   [5, 2, 4, 1, 3]


Calculate the Sum of Squared Differences $\sum_{i=1}^{N} d_i^2$


In [None]:
sum_d2 = sum((rx - ry) ** 2 for rx, ry in zip(rank_x, rank_y))
sum_d2

0

Calculate Spearman's Rho (rho)


In [None]:
numerator = 6 * sum_d2
denominator = n * (n**2 - 1)
rho = 1 - (numerator / denominator)

print(f"Numerator (6 * sum_d2): {numerator}")
print(f"Denominator (n * (n^2 - 1)): {denominator}")
print(f"Spearman's Rho (ρ): {rho:.4f}")

Numerator (6 * sum_d2): 0
Denominator (n * (n^2 - 1)): 120
Spearman's Rho (ρ): 1.0000


#### Using Scipy


In [None]:
X = [10, 2, 8, 1, 5]
Y = [90, 50, 85, 40, 70]

correlation, p_value = spearmanr(X, Y)

print(f"Spearman Correlation (rho): {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

Spearman Correlation (rho): 1.0000
P-value: 0.0000


### Using Pandas


In [None]:
import pandas as pd

df = pd.DataFrame(
    {
        "X": [1, 2, 3, 4, 5],
        "Y": [90, 50, 85, 40, 70],
        "Z": [1, 5, 3, 2, 4],
    }
)

pearson_matrix = df.corr(method="pearson")

spearman_matrix = df.corr(method="spearman")

print("Pearson Correlation Matrix:\n", pearson_matrix)
print("\nSpearman Correlation Matrix:\n", spearman_matrix)

Pearson Correlation Matrix:
           X         Y         Z
X  1.000000 -0.364662  0.300000
Y -0.364662  1.000000 -0.364662
Z  0.300000 -0.364662  1.000000

Spearman Correlation Matrix:
      X    Y    Z
X  1.0 -0.5  0.3
Y -0.5  1.0 -0.4
Z  0.3 -0.4  1.0


## Usage

It can be applied in machine learning on feature selection step. The more closer to 0 the correlation is, the less relevant the feature is.
