# Module 6 - Correlation & Models

In [1]:
import pandas as pd
import numpy as np

## Correlation

Correlation is the relationship between one feature (column) to another. Features can be positively correlated, meaning that they both move in the same direction (if one increases, so does the other and if one decreases, so does the other) or negatively correlated, meaning that they move in opposite directions (if one increases, the other decreases). Correlation values are on a scale from -1 to 1. Features that are positively correlated are closer to 1 and features that are negatively correlated are closer to -1. All features are perfectly positively correlated (exact value of 1) with itself.

In [2]:
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)

df.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


In [3]:
#creates a table of correlation values
df.corr()

Unnamed: 0,age,exercise,hours,grade
age,1.0,-0.003643,-0.017467,-0.00758
exercise,-0.003643,1.0,0.021105,0.161286
hours,-0.017467,0.021105,1.0,0.801955
grade,-0.00758,0.161286,0.801955,1.0


## Linear Regression

Linear regression is used to predict the numerical value(s) for a target variable (the column that is being predicted). With one column as a predictor, a linear regression model is mathematically represented by the formula:
### \begin{align}  y = mx + b \end{align}
Where *y* is the target variable, *x* is the predictor, *m* is the slope (weight of *x*), and *b* is the y-intercept, which is the starting value of *y* when m*x=0. Below is a linear regression line graphed to predict student grades based on the number of hours studied for an exam.

<center><img src='https://s3.amazonaws.com/stackabuse/media/linear-regression-python-scikit-learn-1.png'></center>

Source: [Stack Abuse](https://s3.amazonaws.com/stackabuse/media/linear-regression-python-scikit-learn-1.png)

In [4]:
#use this library to build a statistical test for linear regression
import statsmodels.formula.api as smf

In [5]:
#OLS is Ordinary Least Squares, the most common type of linear regression
#the fit function uses the predictive values to calculate the best linear regression line
result = smf.ols('grade ~ age + exercise + hours', data=df).fit()

In [6]:
#the summary will show the calculated values (slopes and y-intercept) for the linear regression formula
#the closer to 1 the r-squared value is, the better the fit of the linear regression line
#the p-value shows how statistically significant a predictive feature could be the model

result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1315.0
Date:,"Thu, 30 May 2019",Prob (F-statistic):,0.0
Time:,13:34:16,Log-Likelihood:,-6300.7
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1996,BIC:,12630.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,57.8704,1.321,43.804,0.000,55.279,60.461
age,0.0397,0.075,0.532,0.595,-0.107,0.186
exercise,0.9893,0.089,11.131,0.000,0.815,1.164
hours,1.9165,0.031,61.564,0.000,1.855,1.978

0,1,2,3
Omnibus:,321.187,Durbin-Watson:,2.047
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2196.187
Skew:,-0.567,Prob(JB):,0.0
Kurtosis:,8.007,Cond. No.,213.0


#### With age, exercise, and hours being 0, your starting grade is likely to be around 57.87

In [7]:
#remove age from regression, since it was not very correlated to other features
result = smf.ols('grade ~ exercise + hours', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1973.0
Date:,"Thu, 30 May 2019",Prob (F-statistic):,0.0
Time:,13:37:22,Log-Likelihood:,-6300.8
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1997,BIC:,12620.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.5316,0.447,130.828,0.000,57.654,59.409
exercise,0.9892,0.089,11.131,0.000,0.815,1.163
hours,1.9162,0.031,61.575,0.000,1.855,1.977

0,1,2,3
Omnibus:,318.721,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2158.0
Skew:,-0.564,Prob(JB):,0.0
Kurtosis:,7.962,Cond. No.,43.2


In [8]:
#add age back into formula
#remove y-intercept (set it to be 0)
result = smf.ols(formula='grade ~ age + exercise + hours - 1', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.991
Model:,OLS,Adj. R-squared:,0.991
Method:,Least Squares,F-statistic:,72840.0
Date:,"Thu, 30 May 2019",Prob (F-statistic):,0.0
Time:,13:38:27,Log-Likelihood:,-6974.3
No. Observations:,2000,AIC:,13950.0
Df Residuals:,1997,BIC:,13970.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
age,3.1129,0.035,88.030,0.000,3.044,3.182
exercise,1.7659,0.122,14.482,0.000,1.527,2.005
hours,2.2860,0.042,54.486,0.000,2.204,2.368

0,1,2,3
Omnibus:,131.221,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,403.367
Skew:,-0.301,Prob(JB):,2.5700000000000003e-88
Kurtosis:,5.116,Cond. No.,14.2
