# [ESPM-163ac] Lecture - Introduction to Prediction!

*Estimated Time: 50 minutes*

Now that you have had an introduction to programming, we are going to start using these tools to explore our dataset!

In [None]:
# This is import code we will use later. Just run the cell.
import numpy as np
import matplotlib.pyplot as plt
from datascience import *
import statsmodels.formula.api as sm
%matplotlib inline 
plt.style.use("fivethirtyeight")

In [None]:
# Here is our dataset for reference.
Table.read_table("../data/ces_data.csv").show(5)

## 1. Correlation

Correlation is used to test relationships between quantitative variables or categorical variables. In other words, it’s a measure of how things are related. The study of how variables are correlated is called correlation analysis.

Some examples of data that have a high correlation:

    Your caloric intake vs. your weight.
    Your eye color vs. your relatives’ eye colors.
    The amount of time you study vs. your GPA.
    Alcohol consumed vs. your blood alcohol content.

Some examples of data that have a low correlation (or none at all):

    Your sexual preference vs. the type of cereal you eat.
    A dog’s name vs. the type of dog biscuit they prefer.
    The cost of a car wash vs. how long it takes to buy a soda inside the station.

Correlations are useful because if you can find out what relationship variables have, you can make predictions about future behavior. Knowing what the future holds is very important in the social sciences like government and healthcare.

You make decisions based on relationships of two events all the time: if it's 2pm on a Thursday of Deadweek, you predict the number of seats avaiable in Moffitt Floor 5 would be close to none and would think twice about trying your luck there. As simple as this is, this is correlation and prediction at work: time of semester vs. the number of seats available in Moffitt Floor 5. This is exacltly waht we are going in this lab -- **the correlation coefficient simply assigns a number to the *type* and *strength* of a relationship between two events**.

The **correlation coefficient** ( r ) puts a value to the relationship and shows how strong it is. The value is between -1 and 1 where 0 is no relationship, -1 is a perfect negative relationship, and 1 is a perfect positive relationship. Correlation is also necessary for regression (which we will get to later).

![image](./images/correlation-examples.svg)

If we wanted to look at the relationship between two of the variables in our dataset, we could calculate the correlation. For example, asking how race is related to a particular health factor, such as asthma.

In [None]:
data = Table.read_table("../data/ces_data.csv")
clean_data = Table.read_table("../data/cleaned_data_new.csv")
clean_data.show(5)

In [None]:
def standard_units(xyz):
    return (xyz - np.mean(xyz))/np.std(xyz) 

def correlation(t, label_x, label_y):
    return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))

In [None]:
clean_data.scatter("ces_pollution_score", "asthma", alpha = .18, s = 10)

#### Based on this scatter plot, what do you think the r-value is?
In other words, about how closely are pollution and asthma related? Compare this graph with the charts above to help you identify the **type** (Positive? Negative?) and **strength** (value) of the relationship.

*Your Guess Here*

In [None]:
#Run me to find the actual correlation coefficient!
correlation(clean_data, 'ces_pollution_score', 'asthma')

It's certainly not perfect -- if you are given a pollution score, you can't say that the number of reported asthma attacks will be \_\_. However, you can definitely see that there is a positive relationship between a census tract's pollution score and the number of reported asthma attacks.

## 2. Simple Linear Regression

Linear regression is really just a term for **making predictions using lines**. That's right -- with two variabes, linear regression is just a plain old line:

$$Y = mX+b$$

![image](http://onlinestatbook.com/2/regression/graphics/gpa.jpg)

In the example above:
- `Y` is what you are predicting (e.g. University GPA) and
- `X` is what you are basing the prediction off of (e.g. High School GPA)

So, all we need to make predictions are two values:
- the slope (`m`) and
- the intercept (`b`) of the line!


Amazingly, this simple line is the **best** linear predictor of the data: it is **the best line that "fits" the data**. In a moment, we will define functions to evaluate the slope and intercept of the line for some given data. Before that, let's build some intuition on what these two values signify.

If you recall from your algebra classes:

- the **y-intercept**  is just what the Y-value is expected to be when X = 0 (where the line crosses the y-axis), and 
- the **slope** tell you how much the Y-value changes when the X-value changes. 

In other words, the **slope** is highly dependent on the **relationship** between X and Y -- it is dependent on the **the correlation coefficient**.

**We need the correlation coefficient in order to find the equation for the regression line.**

You don't need to know how any of the following functions are defined: we've given you everything you need to calculate the slope and intercept of the regression line -- all you need to do is to **run the cell below**, and we will show you how to use the functions.

In [None]:
def standard_units(xyz): #ignore this function!
    return (xyz - np.mean(xyz))/np.std(xyz) 

def correlation(table, label_x, label_y):
    return np.mean(standard_units(table.column(label_x))*standard_units(table.column(label_y)))

# We use these function to construct the regression line. As you can see, correlation is used to 
# evaluate the slope of the regression line below.

def slope(table, label_x, label_y):
    r = correlation(table, label_x, label_y) # correlation function used in slope!
    return r*np.std(table.column(label_y))/np.std(table.column(label_x))

def intercept(table, label_x, label_y):
    return np.mean(table.column(label_y)) - slope(table, label_x, label_y)*np.mean(table.column(label_x))

## Back to our data set

Given a census table, our goal is to discover something about the data that will enhance our understanding of a population. Correlation allows us to determine a preliminary relationship between two variables (such as **asthma** and **race**) and allows us to continue with regression, which explores the relationship further and gives us a best-fit line. In the next lab, we will be putting these skills to use to calculate the correlation coefficients and regression lines with factors from our dataset. We will do this using functions that calculate it for us! They will look something like this:

In [None]:
# This calculates the slope of our regression line.
slope_of_reg_line = slope(clean_data, "ces_pollution_score", "asthma")
slope_of_reg_line

In [None]:
# This will calculate our intercept.
intercept_of_reg_line = intercept(clean_data, "ces_pollution_score", "asthma")
intercept_of_reg_line

Now that you have a slope and intercept, it can go into the equation: $y = mx + b$ and create the regression line.
We could plot the regression line on our data set manually, however, as shown below, we can use a scatter plot which creates it for us!

In [None]:
print('r: ', correlation(clean_data, 'ces_pollution_score', 'asthma'))
clean_data.scatter("ces_pollution_score", "asthma", fit_line=True, alpha = .18, s = 10)

We have the r-value above from the correlation function.

`r:  0.5452274394377603`

### Coefficient of Determination: How Good is our Predictive Model?

We know how to assess the relationship between two variables. This relationship is used to derive the linear regression equation. But what if we want to assess how **effective** our linear regression model is?

That is where the **Coefficient of Determination**, also called **r-squared**, comes in. It helps us assess the effectiveness of our predictive model and, more importantly, allows us to **compare** the effectiveness amongst various predictive models.

Here is all you need to know about **r-squared**:
<div class="alert alert-info">
It's a number ranging from <b>0 to 1</b> that tells you how well the model predicts the outcome: 1 is a perfect prediction (if you know the X-value, you definitely know the Y-Value) while 0 is a terrible prediction (you might as well guess!)
</div>


*Side Note:*
It's called **r-squared** because in simple linear regression, the **Coefficient of Determination** is just the **square of the Correlation Coefficient r**. However, when we get to Multiple Regression (where we use *TWO* X variables to predict a Y variable), we can't just rely on the relationship between **two** variables to evaluate the effectiveness of the model because the model uses **three or more** variables. So, using r-squared allows us to compare model performance between any type of regression model!

Run the cell below to find the R-squared value:

In [None]:
model = sm.ols(formula='asthma ~ ces_pollution_score', data = clean_data)
fit = model.fit()
fit.summary()

`r:  0.5452274394377603`
`r-squared: 0.297`

How good do you think our predictive model is, given that our r-squared value is 0.297?

## 3. Your Turn

In previous example, we explored the relationship between an environmental outcome and a health issue. Now let's look at how this health issue compares with a certain demographic.

In [None]:
# This will find the correlation coefficient between African Americans and Asthma.
print('r: ', correlation(clean_data, 'african_american', 'asthma'))
clean_data.scatter("african_american", "asthma", alpha = .18, s = 10)

`r:  0.4986847676603604`

In [None]:
# Now fill this in to find the slope of our regression line.
slope_of_reg_line = ...
slope_of_reg_line

In [None]:
# Now fill this in to find the intercept of our regression line.
intercept_of_reg_line = ...
intercept_of_reg_line

In [None]:
print('r: ', correlation(clean_data, 'african_american', 'asthma'))
clean_data.scatter("african_american", "asthma", fit_line=True, alpha = .18, s = 10)

In [None]:
model = sm.ols(formula='asthma ~ african_american', data = clean_data)
fit = model.fit()
fit.summary()

`r-squared: 0.249`

Since our r-value is low (far from 1) it shows us that we need to conduct a multiple regression because a single variable is not sufficient to predict asthma. Usually, there are multiple factors that affect an outcome so it makes sense that we need to do more than a simple analysis.

Predictions are a powerful tool that we will explore more in the next lab using these techniques! If you would like to learn more about the theory behind standard units or explore these equations further, statistics classes or data 8 are great places to start!

**CONGRATULATIONS!!!** You've made it through an introduction to correlation and prediction! In lab next week, we will revisit these concepts and delve further into analysis. See you then!


## Peer Consulting Office Hours
If you had trouble with any content in this notebook, Data Peer Consultants are here to help! You can check for availability of Peer Consultants in the **first floor of Moffitt library** with this detailed [Office Hours schedule](https://data.berkeley.edu/education/peer-consulting). 


---

**Bibliography:**

- [DS Modules](https://github.com/ds-modules)

*Notebook developed by: Aarish Irfan and Alleanna Clark*