# Understanding Correlation

Correlation is a statistical measure that quantifies the degree to which two variables are related to each other. It is essential for understanding how one variable behaves in relation to another.

## Basic Statistical Concepts

### Mean

The mean (average) of a dataset is the sum of all the values divided by the number of values. For a set of values $x_1, x_2, \ldots, x_n$, the mean $\bar{x}$ is given by:

$$
\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}
$$

### Standard Deviation
Standard deviation is a measure of the dispersion or spread of a set of data points around the mean. It is calculated as the square root of the variance. The standard deviation $\sigma$ is:
$$
\sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}}
$$

## Covariance
Covariance measures the joint variability of two random variables. If $X$ and $Y$ are two variables, their covariance is calculated as:
$$
\text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{X})(y_i - \bar{Y})}{n}
$$

## Correlation Coefficients

### Pearson Correlation Coefficient
The Pearson correlation coefficient normalizes the covariance by the product of the standard deviations of the variables, thus providing a dimensionless measure of linear relationship. It is given by:
$$
r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
$$

## Practical Example

### Example Dataset
Consider the following dataset of heights (cm) and weights (kg) of individuals:

| Person | Height (cm) | Weight (kg) |
|--------|-------------|-------------|
| 1      | 170         | 65          |
| 2      | 180         | 80          |
| 3      | 160         | 60          |
| 4      | 175         | 75          |
| 5      | 165         | 68          |

## Calculation

For the given dataset, the mean, standard deviation, and covariance of heights and weights are calculated as follows:

### Mean of Heights: 
Mean height $\bar{X}_{\text{height}}$ is calculated as:

$$
\bar{X}_{\text{height}} = \frac{170 + 180 + 160 + 175 + 165}{5} = 170\,\text{cm}
$$

### Mean of Weights:
Mean weight $\bar{Y}_{\text{weight}}$ is calculated as:

$$
\bar{Y}_{\text{weight}} = \frac{65 + 80 + 60 + 75 + 68}{5} = 69.6\,\text{kg}
$$

### Standard Deviation of Heights:
Standard deviation of heights $\sigma_{X_{\text{height}}}$ is calculated as:
$$
\sigma_{X_{\text{height}}} = \sqrt{\frac{(170-170)^2 + (180-170)^2 + (160-170)^2 + (175-170)^2 + (165-170)^2}{5}} \approx 7.07\,\text{cm}
$$

### Standard Deviation of Weights:
Standard deviation of weights $\sigma_{Y_{\text{weight}}}$ is calculated as:
$$
\sigma_{Y_{\text{weight}}} = \sqrt{\frac{(65-69.6)^2 + (80-69.6)^2 + (60-69.6)^2 + (75-69.6)^2 + (68-69.6)^2}{5}} \approx 7.12\,\text{kg}
$$

### Covariance between Heights and Weights:
The covariance between heights and weights $\text{Cov}(X, Y)$ is calculated as:
$$
\begin{aligned}
\text{Cov}(X, Y) = &\frac{1}{5} [(170-170)(65-69.6) + (180-170)(80-69.6) + \\
&(160-170)(60-69.6) + (175-170)(75-69.6) + (165-170)(68-69.6)] \approx 47
\end{aligned}
$$


### Pearson Correlation Coefficient:
Finally, the Pearson correlation coefficient $r$ is calculated as:
$$
r = \frac{\text{Cov}(X, Y)}{\sigma_{X_{\text{height}}} \times \sigma_{Y_{\text{weight}}}} \approx \frac{47}{7.07 \times 7.12} \approx 0.93
$$

This value of 0.93 indicates a strong positive linear relationship between height and weight in this dataset.



In [1]:
import numpy as np

# Data for heights (in cm) and weights (in kg)
heights = np.array([170, 180, 160, 175, 165])
weights = np.array([65, 80, 60, 75, 68])


In [2]:
# Calculating means
mean_height = np.mean(heights)
mean_weight = np.mean(weights)

# Calculating standard deviations
std_dev_height = np.std(heights, ddof=0)
std_dev_weight = np.std(weights, ddof=0)

The `ddof` parameter in the `np.std` function call in NumPy stands for "Delta Degrees of Freedom." It affects the divisor used in the calculation of the standard deviation.

In the code:

- `std_dev_height = np.std(heights, ddof=0)`
- `std_dev_weight = np.std(weights, ddof=0)`

we are using `ddof=0`. This has specific implications:

- When `ddof=0` (the default), the standard deviation is calculated by dividing by `n`, the number of observations. This approach is typically used when calculating the standard deviation of a full population.

- When `ddof=1`, the divisor becomes `n-1`. This adjustment is used when calculating the standard deviation of a sample. It provides an unbiased estimator of the population standard deviation if the data points are a sample from a larger population. The use of `n-1` as the denominator corrects for the bias in estimating a population parameter from a sample.

In the context of this code, setting `ddof=0` implies that the data represents the entire population, and the standard deviation is calculated with this assumption in mind. If the data were a sample from a larger population, and the goal was to estimate the population standard deviation from this sample, you would use `ddof=1`.


In [3]:
# Calculating covariance
covariance = np.cov(heights, weights, ddof=0)[0, 1]

# Calculating Pearson correlation coefficient
correlation_coefficient = covariance / (std_dev_height * std_dev_weight)

(mean_height, mean_weight, std_dev_height, std_dev_weight, covariance, correlation_coefficient)

(170.0, 69.6, 7.0710678118654755, 7.116178749862878, 47.0, 0.9340411443826679)