# Correlation


**IMPORTANT INSTRUCTIONS:** This activity is designed for you to experiment with Python code about correlation. Feel free to change any numerical value throughout the code in the activity to visualize different outcomes and results.

## What is Correlation?

In statistics, it is important to know the relationships between two or more variables in a dataset. 

As you know from Module 3, in a dataset,  each data point is an observation, and the features are the properties or attributes of those observations.

Consider the following table that records information the height of basketball players and their shooting accuracy.


| Name      | Height in cm | Shooting Accuracy (%) |
|-----------|--------------|-----------------------|
| John M.   | 180          | 72                    |
| Alex B.   | 188          | 84                    |
| Briand C. | 193          | 87                    |

After having a quick look at the data above, it's pretty easy to notice that the columns `Height in cm` and `Shooting Accuracy (%)` are related: the taller a player is, the more accurate his shooting will be.

## Linear Correlation

Linear *correlation* measures the relationship between linear variables in a dataset. Mathematically, this coefficient is called Pearson correlation $n$ coefficient.

Consider a dataset with two features: $\mathbf{x}$ and $\mathbf{y}$. Each feature has $n$ values, so $\mathbf{x}$ and $\mathbf{y}$ have $n$ values each. 

Suppose that the first value $x_1$ from $\mathbf{x}$ corresponds to the first value $y_1$ from $\mathbf{y}$, the second value $x_2$ from $\mathbf{x}$ corresponds to the second value $y_2$ from $\mathbf{y}$, and so on. 

Then, there are $n$ pairs of corresponding values: $(x_1, y_1)$, $(x_2, y_2)$, and so on. Each of these $x$-$y$ pairs represents a single observation.

The Pearson (product-moment) *correlation* coefficient is a measure of the linear relationship between two features. As you know, it's definied by the formula:

$$r = \frac{1}{N}\sum_n\frac{x_n}{\sigma_x}\frac{y_n}{\sigma_y}.$$

The Pearson correlation coefficient can take on any real value in the range $−1 \leq r \leq 1$.

The table below summarizes the results about the coefficient $r$.



| r value        | Correlation between $\mathbf{x}$ and $\mathbf{y}$ |
|----------------|---------------------------------------------------|
| 1              | perfect positive linear relationship              |
| greater than 0 | positive correlation                              |
| 0              | no correlation                                    |
| lesser than 0  | negative correlation                              |
| -1             | perfect negative linear relationship              |


## Correlation in Python: NumPy

NumPy has a statistics routines, [`np.corrcoef()`](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html), that return a matrix of Pearson correlation coefficients. 

Let's see how this works with an example.

In the code cell below, we have defined an *array,* `x`, containing the height (in cm) of basketball players. The array `y`, contains the shooting accuracy for those players. Run the code cell below.


In [None]:
import numpy as np
x = np.array([178, 180, 182, 185, 187, 190, 192, 197])
y = np.array([78, 76, 79, 76, 81, 83, 85, 85])

Now, you can call the *function* `np.corrcoef()` with both arrays as arguments to compute the correlation matrix.

In [None]:
r = np.corrcoef(x, y)
r

For your convenenience, we have written the matrix above in table form:

|              | $\mathbf{x}$ | $\mathbf{y}$ |
|--------------|--------------|--------------|
| $\mathbf{x}$ | 1            | 0.86903122   |
| $\mathbf{y}$ | 0.86903122   | 1            |

In any correlation matrix, the values on the main diagonal of the correlation matrix (upper left and lower right) are equal to 1. The upper left value corresponds to the correlation coefficient for $\mathbf{x}$ with itself, while the lower right value is the correlation coefficient for $\mathbf{x}$ and y. 


However, what you usually need are the lower left and upper right values of the correlation matrix. These values are equal and both represent the Pearson correlation coefficient for $\mathbf{x}$ and $\mathbf{y}$. In this case, it’s approximately 0.87, meaning that there is a positive correlation between the height of a basketball player and his shooting accuracy.

Additionally, you can plot the correlation between $\mathbf{x}$ and $\mathbf{y}$ suing Matplotlib.

In [None]:
import matplotlib.pyplot as plt

plt.scatter(x,y)

How would you need to change the values of the arrays `x` and `y` in order to display a negative correlation between them?

## Correlation in Python: Pandas

Whenever working with *dataframes*, it's more convenient to use the pandas *library* to compute the correlation between variables.

Let's see how this works with an example.

Consider the *dataframe* below:

In [None]:
import pandas as pd


df = pd.read_csv('weight-height.csv')

df.head()

The *dataframe* below displays values about the height and  weight of some individuals.

pandas has the *function* [corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) that computes the correlation between **numerical**  variables in a *dataframe*.

In the cell below, fill-in the ellipsis to compute the correlation matrix for the *dataframe* above.

In [None]:
....corr()

Notice that the *function* `corr()` returns a dataframe.

We can also plot the correlation matrix using the library Seaborn.

Observe the code below.

In [None]:
import seaborn as sns


sns.heatmap(df.corr())

What happens if you add the argument `annot = True` inside the *function* `heatmap()`?