# DATA 5600: Introduction to Regression and Machine Learning for Analytics

## __Koop Chapter 03: Correlation__ <br>

Author:  Tyler J. Brough <br>
Updated: October 31, 2021 <br>

---

<br>

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

In [2]:
np.random.seed(7)

---

<br>

## __Introduction__

<br>


These notes are taken from the book _Analysis of Economic Data 2nd Edition_ by Gary Koop.

<br>

A solid understanding of the concept of correlation is essential for doing regression.


<br>

## __Understanding Correlaiton__

<br>

* Let $X$ and $Y$ be two variables

* Let's suppose we have data on $i = 1, 2, \ldots, N$ different units/observations

* The ___correlation___ between $X$ and $Y$ is denoted by Greek letter $\rho$ ("rho")



### __Sample Correlation__

---

The __sample correlation__ between $X$ and $Y$ is referred to by the letter $r$ and is calculated as: 

<br>

$$
r = \frac{\sum\limits_{i=1}^{N} (Y_{i} - \bar{Y}) (X_{i} - \bar{X})}{\sqrt{\sum\limits_{i=1}^{N} (Y_{i} - \bar{Y})^{2} \sum\limits_{i=1}^{N} (X_{i} - \bar{X})^{2}}}
$$

<br>

__NB:__ the population correlation is denoted by Greek letter $\rho$

---

<br>
<br>

### __Properties of Correlation__

<br>

1. $r$ always lies between -1 and 1, which may be written as $-1 \le \rho \le 1$

2. Positive values of $r$ indicate a positive correlation between $X$ and $Y$. Negative values indicate a negative correlation. $r = 0$ indicates that $X$ and $Y$ are uncorrelated. 

3. Larger positive values of $r$ indicate stronger positive correlation. $r = 1$ indicates perfect positive correlation. Larger negative values of $r$ indicate stronger negative correlation. $r = -1$ indicates perfect negative correlation.

4. The correlation between $X$ and $Y$ is the same as the correlation between $Y$ and $X$.

5. The correlation between any variable and itself (e.g. the correlation between $X$ and $X$) is 1.

<br>

## __Understanding Correlation Through Verbal Reasoning__

<br>

* Data scientists (statisticians, econometricians, etc) use the word correlation in much the same as the lay person

* We will look at an example of deforestation/population density example to illustrate verbally

<br>

<u><b>Example:The Correlation Between Deforestation and Population Density</b></u> 

Let's look at the file `FOREST.XLS`

<br>

In [3]:
df = pd.read_excel('FOREST.XLS')

In [5]:
df.head(25)

Unnamed: 0,Forest loss,Pop dens,Crop ch,Pasture ch
0,0.7,357.0,27.9,0.0
1,0.7,48.0,1.7,0.0
2,0.8,932.0,14.5,0.0
3,0.7,366.0,17.9,0.0
4,0.8,83.0,2.2,0.0
5,0.0,22.0,5.1,0.0
6,0.0,67.0,4.0,-6.6
7,0.6,413.0,0.0,0.0
8,0.3,496.0,0.4,-1.1
9,0.5,458.0,6.5,0.0


In [6]:
df.tail()

Unnamed: 0,Forest loss,Pop dens,Crop ch,Pasture ch
65,0.6,327.0,4.1,5.8
66,1.7,409.0,9.4,29.2
67,2.4,117.0,26.7,33.5
68,0.4,179.0,6.1,0.0
69,1.2,234.0,4.3,2.9


In [7]:
df.describe()

Unnamed: 0,Forest loss,Pop dens,Crop ch,Pasture ch
count,70.0,70.0,70.0,70.0
mean,1.138571,639.427,6.937143,3.002857
std,0.928189,726.339977,8.305022,8.444808
min,0.0,0.89,-2.9,-20.1
25%,0.6,139.25,1.55,0.0
50%,0.9,354.0,3.9,0.0
75%,1.375,874.5,8.75,2.225
max,5.3,2769.0,39.7,33.5


In [8]:
x = df['Forest loss']
y = df['Pop dens']

(x.corr(y), y.corr(x))

(0.6591501274300241, 0.6591501274300242)

In [9]:
np.corrcoef(df['Forest loss'], df['Pop dens'])

array([[1.        , 0.65915013],
       [0.65915013, 1.        ]])

In [10]:
stats.pearsonr(x, y)

(0.659150127430024, 5.503310411082788e-10)

<br>

We find that the correlation between deforestation and population density is $0.66$

Being a postive number allows us to say the following:

1. There is a positive relationship (or positive association) between deforestation and population density

2. Countries with high population densities tend to have high deforestation. Countries with low population densities tend to have low levels of deforestation
    - NB: note the word _"tend"_ here
    - This is not a causal relationship, but rather a "general tendency"
    - It outlines a broad pattern that may not hold in particular cases

3. Deforestation rates vary across countries as do population densities (thus the name "variables")
    - Some countries have high deforestation, others have low rates
    - This high/low cross-country variance in deforestation rates tends to "match up" with the high/low variance in population density
    
These states are based on the positive value of $r$. If it were negative the opposite statements would be true


4. The degree to which deforestation rates vary across countries can be measured numerically using the formula for the standard deviation
    - The fact that deforestation and population density are positively correlated means that their patterns of cross-country variability tend to match up
    - The correlation squared $r^{2}$ measures the proportion of the cross-country variability in deforestation that matches up with, or is explained by, the variance in population density
    - Correlation is a numerical measure of the degree to which patterns in $X$ and $Y$ correspond
    - In our example $0.66^{2} = 0.44$, we can say that $44\%$ of the cross-country variance in deforestation can be explained by the cross-country variance in population density
    
<br>

In [11]:
x.corr(y) ** 2.0

0.434478890491017

<br>


<br>

<br>

<u><b>Example 2: House Prices in Windsor, Cananda</b></u>

<br>

In [12]:
house_prices = pd.read_excel("HPRICE.XLS")

In [13]:
house_prices.head()

Unnamed: 0,sale price,lot size,#bedroom,#bath,#stories,driveway,rec room,basement,gas,air cond,#garage,desire loc
0,42000,5850,3,1,2,1,0,1,0,0,1,0
1,38500,4000,2,1,1,1,0,0,0,0,0,0
2,49500,3060,3,1,1,1,0,0,0,0,0,0
3,60500,6650,3,1,2,1,1,0,0,0,0,0
4,61000,6360,2,1,1,1,0,0,0,0,0,0


In [14]:
house_prices.describe()

Unnamed: 0,sale price,lot size,#bedroom,#bath,#stories,driveway,rec room,basement,gas,air cond,#garage,desire loc
count,546.0,546.0,546.0,546.0,546.0,546.0,546.0,546.0,546.0,546.0,546.0,546.0
mean,68121.59707,5150.265568,2.965201,1.285714,1.807692,0.858974,0.177656,0.349817,0.045788,0.31685,0.692308,0.234432
std,26702.670926,2168.158725,0.737388,0.502158,0.868203,0.348367,0.382573,0.477349,0.209216,0.465675,0.861307,0.424032
min,25000.0,1650.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,49125.0,3600.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,62000.0,4600.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,82000.0,6360.0,3.0,2.0,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0
max,190000.0,16200.0,6.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0


<br>

* Let $Y =$ the sales price of the house

* Let $X =$ the size of the lot in square feet

* $r_{X,Y} = 0.54$

<br>

In [15]:
y = house_prices['sale price']
x = house_prices['lot size']
y.corr(x)

0.5357956724321841

<br>

The following statements can be made:

1. Houses with large lots tend to be worth more than those with small lots

2. There is a positive relationship between lot size and sales price

3. The variance in lot size accounts for $29\%$ (i.e. $0.54^{2} = 0.29$) or the variability in house prices

<br>

In [16]:
y.corr(x) ** 2.0

0.28707700259705626

<br>

* Now let's add a 3rd variable $Z =$ the number of bedrooms

* Calculating the correlation between sales price and the number of bedroooms we obtain $r_{Y,Z} = 0.37$

<br>

In [17]:
z = house_prices['#bedroom']
y.corr(z)

0.3664473586641068

In [18]:
x.corr(z)

0.1518514921213393

<br>

### __Causality__

* We are often interested finding out whether or not one variable "causes" another

* We will not give a complete definition of causality here

* We can use the positive correlation to get at this concept

* Lot size is a variable that directly influences (roughly causes) sales price

* House prices do not influence (cause) lot size

* In other words, the direction of causality flows from lot size to house prices, but not the other way around

* __Q:__ what would happen if a homeowner were to purchase some adjacent land and thereby increase the lot size?
    - This action would tend to increase the value of the house 
    - "will increasing the price of the house cause lot size to increase?" Obviously not.
    
* We can make similar statements about number of bedrooms

<br>

* It is important to know how to interpret results

* This house price example illustrates this principle

* It is not enough to simply report that $r_{Y,X} = 0.54$

* Interpretation requires a good intuitive knowledge of what a correlation is in conjunction with common sense about the phenomenon under study

<br>

### __Exercise__

* __(A)__ Using the data in `HPRICE.XLS`, calculate and interpret the mean, standard deviation, minimum and maximum of $Y = $ sales price, $X =$ lot size and $Z =$ number of bedrooms

* __(B)__ Verify that the correlation between $X$ and $Y$ is the same as given in the example. Repeat for $X$ and $Z$ then for $Y$ and $Z$

* __(C)__ Now add a new variable, $W =$ number of bathrooms. Calculate the mean of $W$

* __(D)__ Calculate and interpret the correlation between $W$ and $Y$. Discuss to what extent it can be said that $W$ causes $Y$.

* __(E)__ Repeat part (d) for $W$ and $X$ and then for $W$ and $Z$.

<br>

<br>

## __Understanding Why Variables Are Correlated__

<br>

* In the deforestation/population density example there was a positive correlation


* But what exact form does this relationship take?

* We like to think in terms of causality

* It may be the case that the positive correlation means that population density causes deforestation

* In a wage regression it might mean that a positive correlation between education levels and wages can be interepreted as additional education causes higher wages

* But we must be cautious because, while correlation can be high between variables, it need not mean that one causes the other

<br>

### __Correlation Does Not Necessarily Imply Causality__

<br>

* See here: https://www.tylervigen.com/spurious-correlations

<br>