# Module 6 - Correlation & Linear Regression

Bring in our usual libraries. . . . 

In [1]:
import pandas as pd
import numpy as np

# WHY ARE WE EVEN DOING THIS?

The overall purpose of conducting regression analyses is to discover the RELATIONSHIP between variables of interest. As an added bonus, regression can show us the relationships between variables while simultaneously controlling for the effects of other variables - for example:

I want to figure out the relationship between age and hair loss. I have a feeling that these two variables have a POSITIVE LINEAR relationship (as AGE increases, HAIR LOSS increases). However, I know it can't be this simple. . . .  What if the relationship between age and hair loss is influenced by gender, geographic location, medical history, family history, hat preference, and hair color... the list is endless! Therefore, to figure out the relationship between age and hair loss - we have to CONTROL for the potential influence that all these other factors might have on that relationship. We are able to do this with regression analyses . . .  at the end of the analyses we will be able to more confidently say: 

the relationship between age and hair loss is this, and I know this because I thought about all the other factors that might influence that relationship, and I controlled for them - so the results I'm seeing are more accurate. 

The entire purpose of regression is to tease out the relationships between variables, but before we get to interpret our results, there are a few things we need to do.


## Correlation

In [2]:
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)

df.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


To help us determine which variables are best to include in our regression, we want to figure out which variables have at least some kind of relationship to each other. This is because regressions work the best when we don't overrun them with meaningless variables - we only want to include the variables that have meaning to our dependent variable (refresher below). To do this, we run a CORRELATION. This analysis shows us the linear relationship between variables. Remember, only numeric variables have meaningful output for a correlation. 

Variable Refresher:

<b>INDEPENDENT VARIABLE:</b> The independent variable is the variable whose change isn’t affected by any other variable in the experiment. Either the scientist has to change the independent variable herself or it changes on its own; nothing else in the experiment affects or changes it. Two examples of common independent variables are age and time. There’s nothing you or anything else can do to speed up or slow down time or increase or decrease age. They’re independent of everything else.

<b>DEPENDENT VARIABLE:</b> The dependent variable is what is being studied and measured in the experiment. It’s what changes as a result of the changes to the independent variable. An example of a dependent variable is how tall you are at different ages. The dependent variable (height) depends on the independent variable (age).


In [3]:
df.corr()

Unnamed: 0,age,exercise,hours,grade
age,1.0,-0.003643,-0.017467,-0.00758
exercise,-0.003643,1.0,0.021105,0.161286
hours,-0.017467,0.021105,1.0,0.801955
grade,-0.00758,0.161286,0.801955,1.0


The results of a correlation give us a numeric estimation of the magnitude of the linear realtionship between variables. The +/- of the number doesn't matter when determining the strength of the relationship, the actual number is what matters when determining the strength of the relationship. The larger the number (regardless if the number is negative), the greater the relationship. If a number is negative, it means that the relationship is negative (or inverse) - when one value goes up, the other goes down. When the number is positive, it means that the relationship is positive - the values move in the same direction - when one goes up the other goes up, when one goes down the other goes down. 

Our output shows us:

* Age doesn't have a strong relationship with any variables

* Exercise has weak, positive relationships with hours and grade. 

* Hours has a strong, positive relationship with grade. 


When looking at this, it would seem that Age doesn't have much to do with the other variables. But since this is an exercise, we're still going to include it in our regression. However, if you were in the position of having to determine which variables to eliminate, we want to eliminate the ones that have the weakest relationship. 


## Linear Regression

Now that we know which variables we want to include in our regression, we have to set up the regression analysis and write our code. 

First we bring in the library needed to run the regression. . . 

In [5]:
import statsmodels.formula.api as smf

Next we write out the regression equation. This is where we specify which variable we are considering to be dependent/independent. 

In the below equation GRADE is the dependent variable. AGE, EXERCISE, HOURS are the independent variables. 

In [6]:
result = smf.ols('grade ~ age + exercise + hours', data=df).fit()

To see the results of the analysis we use the code below to get the summary. 

In [7]:
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1315.0
Date:,"Tue, 05 Mar 2019",Prob (F-statistic):,0.0
Time:,20:42:40,Log-Likelihood:,-6300.7
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1996,BIC:,12630.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,57.8704,1.321,43.804,0.000,55.279,60.461
age,0.0397,0.075,0.532,0.595,-0.107,0.186
exercise,0.9893,0.089,11.131,0.000,0.815,1.164
hours,1.9165,0.031,61.564,0.000,1.855,1.978

0,1,2,3
Omnibus:,321.187,Durbin-Watson:,2.047
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2196.187
Skew:,-0.567,Prob(JB):,0.0
Kurtosis:,8.007,Cond. No.,213.0


Okay, now we are in business!

There is A LOT of information in these results. Let's break it down with just the most important aspects:

<b>Adj. R-squared</b>
* This number shows the amount of variation in the data that is explained by our regression equation. The higher this number, the better the model (regression equation/analysis) fits. When you are running multiple models (and swamping out variables), you typically want to go with the model that has the highest Adj. R-squared!

<b>Intercept</b>
* This is showing what the value of the dependent variable would be if all the independent variables (IV) are 0. Sometimes this makes sense, sometimes this doesn't. In this example, the dependent variable (DV) is Grade - so when Age (that would be weird), Exercise, and Hours of Studying are all 0, your starting grade would be around 57.87. 

<b>Coef</b>
* The regression coefficients represent the mean change in the DV for a one unit change in the IV, while controlling for the other IV's. You can think of the coefficients as slopes. For example - every one year increase in AGE, GRADE increases by 0.0397. 

<b>p-value</b>
* This is the crème de la crème of the regression output. The p-value - shown in the output as "P>|t|" - is letting you know if there is a statistically significant relationship being shown in the regression. The p-value is the percent chance that the results you see are just by chance, and there is not actually a true relationship. Therefore, we want the *smallest* percent chance - and typically a p-value <= 0.005 is considered a statistically significant result. 

From our results, we can see that the Adj R-Squared is 0.664 (not bad), and that intercept, exercise, and hours are the statistically significant variables. This makes sense - since we saw that Age didn't have a strong correlation with the other variables. 

At this point, we want to work with our model to see if we can increase the fit... there are a few things to try at this point. 

First, we want to try the model without the non-significant variable(s). In this case, that's AGE. Let's remove age from our equation and see how the model fit changes.

With age, exercise, and hours being 0, your starting grade is likely to be around 57.87

In [8]:
#remove age from regression, not very correlated
result = smf.ols('grade ~ exercise + hours', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1973.0
Date:,"Tue, 05 Mar 2019",Prob (F-statistic):,0.0
Time:,21:03:01,Log-Likelihood:,-6300.8
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1997,BIC:,12620.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.5316,0.447,130.828,0.000,57.654,59.409
exercise,0.9892,0.089,11.131,0.000,0.815,1.163
hours,1.9162,0.031,61.575,0.000,1.855,1.977

0,1,2,3
Omnibus:,318.721,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2158.0
Skew:,-0.564,Prob(JB):,0.0
Kurtosis:,7.962,Cond. No.,43.2


So we took AGE out of our regression - and the Adj R-squared didn't change, and the significance of our IV's also didn't change. This isn't a bad thing, but it means that AGE may not be influencing the model much. 

Even if it makes sense overall - we want to also try a model without the "Intercept" variable. To do this, we simply add a "-1" to the end of our regression equation in our code. 

In [9]:
#set coefficient to 0
result = smf.ols(formula='grade ~ age + exercise + hours - 1', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.991
Model:,OLS,Adj. R-squared:,0.991
Method:,Least Squares,F-statistic:,72840.0
Date:,"Tue, 05 Mar 2019",Prob (F-statistic):,0.0
Time:,21:05:00,Log-Likelihood:,-6974.3
No. Observations:,2000,AIC:,13950.0
Df Residuals:,1997,BIC:,13970.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
age,3.1129,0.035,88.030,0.000,3.044,3.182
exercise,1.7659,0.122,14.482,0.000,1.527,2.005
hours,2.2860,0.042,54.486,0.000,2.204,2.368

0,1,2,3
Omnibus:,131.221,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,403.367
Skew:,-0.301,Prob(JB):,2.5700000000000003e-88
Kurtosis:,5.116,Cond. No.,14.2


HUGE change in that Adj R-squared! And AGE suddenly decided to show up for the party!

Since the Adj R-Squared is much higher, we can safely assume that this model fits our data better than the previous models. Using this model, we can now make some assumptions about our results:

<b> AGE </b>

For every one year increase in AGE, GRADE increases by 3.1129.

<b> EXERCISE </b>

For every one hour increase in EXERCISE time, GRADE increases by 1.77. 

<b> HOURS </b>

For every additional HOUR spent studying, GRADE increases by 2.29. 


*Maybe our data is showing us that the older, more studious and more health conscious students are getting better grades.* 

<b>*side note*</b> if any of our coefficients were negative, this would indicate that the relationship with the DV is inverse, i.e. if the coef for AGE was "-3.11" -- we could interpret this as: For every one year increase of AGE, GRADE decreases by 3.11. 

NOW GO FORTH AND TRY THE EXERCISES!

### Your Turn

Run a correlation and regresssion on the dataset below. What can you tell from the data?

In [None]:
Location = "datasets/datasets/tamiami.csv"

df = pd.read_csv(Location)
df.head()

In [None]:
columns = ['location', 'sales', 'employees', 'restaurants', 'foodcarts', 'price']

#change column names for readability
df.columns = columns
df.head()