<a href="https://colab.research.google.com/github/cagBRT/Statistics-with-Python/blob/main/Correlation_between_Pairs_of_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Statistics-with-Python.git cloned-repo
%cd cloned-repo

In [None]:
from IPython.display import Image

# Measures of Correlation Between Pairs of Data
You’ll often need to examine the relationship between the corresponding elements of two variables in a dataset.

You’ll see the following measures of correlation between pairs of data:<br>

- Positive correlation exists when larger values of 𝑥 correspond to larger values of 𝑦 and vice versa.<br>
- Negative correlation exists when larger values of 𝑥 correspond to smaller values of 𝑦 and vice versa.<br>
- Weak or no correlation exists if there is no such apparent relationship.<br>

In [None]:
Image("Correlation.png", width=900)

## **Correlation is not a measure or indicator of causation**

**Create some data**

In [None]:
import numpy as np
import pandas as pd

x = list(range(-10, 11))
y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]
x_, y_ = np.array(x), np.array(y)
x__, y__ = pd.Series(x_), pd.Series(y_)

The two statistics that measure the correlation between datasets are <br>
- covariance <br>
- the correlation coefficient

**Covariance**<br>
First, you have to find the mean of x and y. <br>
Then, you apply the mathematical formula for the covariance.

In [None]:
n = len(x)
mean_x = sum(x)/ n
mean_y = sum(y)/ n
total=0

for k in range(n):
  total = total+(x[k] - mean_x) * (y[k] - mean_y)
cov_xy=total/(n-1)
print(cov_xy)

You can use the **.cov function**

In [None]:
cov_xy = x__.cov(y__)
cov_xy

cov_xy = y__.cov(x__)
cov_xy


**Cov Matrix**

In [None]:
cov_matrix = np.cov(x_, y_)
cov_matrix

Delta Degrees of freedom (ddof)

In [None]:
x_.var(ddof=1)

In [None]:
y_.var(ddof=1)

As you can see, the variances of x and y are equal to<br>
  > cov_matrix[0, 0] <br>
cov_matrix[1, 1]

<br>
The other two elements of the covariance matrix are equal and represent the actual covariance between x and y.

**If the correlation is positive**:<br>
> then the covariance is positive, as well. <br>
A stronger relationship corresponds to a higher value of the covariance.<br>

**If the correlation is negative**:
> then the covariance is negative, as well.<br>
 A stronger relationship corresponds to a lower (or higher absolute) value of the covariance.<br>

**If the correlation is weak**:
> then the covariance is close to zero.<br>

## Correlation Coefficient
The correlation coefficient, or Pearson product-moment correlation coefficient, is denoted by the symbol 𝑟. The coefficient is another measure of the correlation between data. <br>
You can think of it as a standardized covariance. <br>
Here are some important facts about it:<br>

- The value 𝑟 > 0 indicates positive correlation.<br>
- The value 𝑟 < 0 indicates negative correlation.<br>
- The value r = 1 is the maximum possible value of 𝑟. It corresponds to a perfect positive linear relationship between variables.<br>
- The value r = −1 is the minimum possible value of 𝑟. It corresponds to a perfect negative linear relationship between variables.<br>
- The value r ≈ 0, or when 𝑟 is around zero, means that the correlation between variables is weak.<br>

In [None]:
import scipy.stats
r, p = scipy.stats.pearsonr(x_, y_)
print(r)
print(p)

The **correlation coefficient** matrix is shown below

In [None]:
corr_matrix = np.corrcoef(x_, y_)
corr_matrix

**Linear Regression** returns the slope and y-intercept for the regression line

In [None]:
result = scipy.stats.linregress(x_, y_)
print("linear regression: ", result)
r = result.rvalue
print("r= ",r)

The correlation coefficient

In [None]:
r = x__.corr(y__)
print("r corr(y)", r)

r = y__.corr(x__)
print("r corr(x)", r)