# Simple Linear Regression (y= β0+ β1x1+Ɛ)

## Importing

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set()

### **Load Data**

In [None]:
data=pd.read_csv('Data/Simple linear regression.csv')
data.head()

**Why would i predict GPA with SAT** ? 
* The SAT is considered one of the best estimators of intellectual capacity and capability
*  Almost all collage (ex: USA) are using the SAT as a proxy for admission
* The SAT stood the test of time

**SAT** = Critical reading + Mathematics + Writing <br>
**GPA** = Grade Point Average

Creating a linear regression which predicts GPA based on the SAT score

In [None]:
data.describe()

## **Create first regression**

### **Define dependent and the independent variables**

In [None]:
y=data['GPA']
x=data['SAT']

### Explore The Data

In [None]:
plt.scatter(x,y)
plt.xlabel('SAT')
plt.ylabel('GPA')

 ## Regression itself

In [None]:
# y=b0+b1x1

x1=sm.add_constant(x)

results=sm.OLS(y,x1).fit()
#Contain Ordinary Least Square Regression 
results.summary()

### **How to read the OLS Regression Above**

Stat Models has 3 main tables :
* A Model Summary
* Coefficients table
* Some Adittional Test

**Coefficient Table** <br>
0.2750 means b0 <br>
0.0017 means b1

yHat=b0+b1x1 <br>
yHat = 0.2750+0.0017*x1 <br>
GPA = 0.2750+0.0017*SAT <br>
**std err ** shows the accuracy of prediction for each variable (Lower means better) <br>
SAT score is a significant variable when predicting GPA because SAT has 0.000 in **P> | t |**

**A Model Summary**
* Dep. Variable means variable that we want to predict , which is **GPA** 
* Model : OLS (Ordinary Least Square) , OLS is the common method to estimate the linear regression , **this method will find the line which minimises the Sum of the Squared Error** ( Lower Error = better explanatory power ) 

There is other method like : 
* Generalized least squares
* Maximum likelihood estimation
* Bayesian Regression
* Gaussian process regression

In this tutorial im using OLS because it quite simple and powerful enough 

**R-Squared** is measured how powerful the regression 
* R-Squared = Variability explained by the regression / Total variability of the dataset
* R-Squared using values ranging from 0 to 1 , if R-Squared = 0 means your regression explains NONE of the variability , if 1 means regression explains the ENTIRE variability

Example : Our R-Squared has 0.406 , in other words SAT scores explained 41% of variability of college grades , but since it is far away from 90% we may conclude that we are missing some important information . Other determinants must be consider such as gender,income or maybe marital status.

**Conclusion :** R- Squared measure goodness of fit , the more factore you include in regression the higher the R Squared 

**F-statistic :**  is important for regression as it give us some important insights (Higher mean Better), **the lower the F-Statistic the closer to an non-significant model** . You can use this tool to compare with other models . 









**Plotting the Simple Regression Line**

In [None]:
plt.scatter(x,y)
yhat=0.2750+0.0017*x1 #yHat=0.275+0.0017x1 Regression Line
fig=plt.plot(x,yhat, lw=4,c='orange',label='Regression Line')
plt.xlabel('SAT',fontsize='20')
plt.ylabel('GPA',fontsize='20')
plt.show()

# **Mutiple Linear Regression (y= β0+ β1x1+β2x2+...+βkxk + Ɛ)** 

Because GPA cannot be predicted solely by student as a score , but also by their HighSchool GPA , Income , Gender etc .  **If we want a good models , we need Multiple Regression , in order to address the higher complexity of problems**

I will use the Attendance variable for mutiple regression <br>
Note : If attendance is more than 75% is 1 , and if below 75% is 0


## **Load Data**

In [None]:
raw_data=pd.read_csv('Data/Multi linear regression.csv')
raw_data.head()

In [None]:
data=raw_data.copy()

In [None]:
#Change Yes=1 , and No = 0
data['Attendance']=data['Attendance'].map({'Yes':1,'No':0})
data.head()

In [None]:
data.describe()

## **Mutiple Regression**

In [None]:
y=data['GPA']
x=data[['SAT','Attendance']]

In [None]:
x1=sm.add_constant(x)
results=sm.OLS(y,x1).fit()
results.summary()

GPA = 0.6439+0.0014*SAT + 0.2226*Attendance <br>

If did not attend (Attendance = 0) <br>
GPA=0.6439+0.0014*SAT+0.2226*0<br>
GPA=0.6439+0.0014*SAT

If Attends (Attendance = 1) <br>
GPA=0.6439+0.0014*SAT+0.2226*1<br>
GPA=0.8665+0.0014*SAT



#### **Look at the R-Squared , Adj.R-Squared , and try to compare from the Simple Linear Regression , it explain that the Attendance variable can be such powerful variable for Mutiple Regression**

### Plotting The Data

In [None]:
plt.scatter(data['SAT'],y,c=data['Attendance'],cmap='RdYlGn_r')
yHat_no=0.6439+0.0014*data['SAT']
yHat_yes=0.8665+0.0014*data['SAT']
fig=plt.plot(data['SAT'],yHat_no,lw=2,c='#006837')
fig=plt.plot(data['SAT'],yHat_yes,lw=2,c='#a50026')

plt.xlabel('SAT',fontsize=20)
plt.ylabel('GPA',fontsize=20)
plt.show()

**The Red one is the one who attandance , and vice versa**

## **Make Predictions based on the regression we already create**

In [None]:
x1
# Const actually added with the add_constant() method we use prior to fitting the model , it is simulation of x0

We create new data that predict 2 students :
* Bob , who got 1700 on SAT and did **NOT attend**
* Alice , who got 1670 on SAT and **attended**

In [None]:
new_data=pd.DataFrame({'const':1,'SAT':[1700,1670],'Attendance':[0,1]})
new_data=new_data[['const','SAT','Attendance']]
new_data

In [None]:
#Rename the index 0:Bob and 1:Alice
new_data.rename(index={0:'Bob',1:'Alice'})

In [None]:
#The appropriate method that allow us to predict the values is the fitted regression dot predict
#The fitted regressions for us is variable results , results= sm.OLS(y,x).fit()
predictions = results.predict(new_data)
predictions

In [None]:
#I will transform into a data frame and join it with the first one
predictionsDataFrame=pd.DataFrame({'Predictions':predictions})
joined=new_data.join(predictionsDataFrame)
joined.rename(index={0:'Bob',1:'Alice'})

We can see that , the predicted GPA at graduation for Bob is 3.02 , and for Alice is 3.20 