< [Notebook 12](PartIV1.ipynb) | [PyFinLab Index](ALWAYS-START-HERE.ipynb) | [Notebook 14](PartIV3.ipynb) >

<a id = "ref00"></a>

<a><img src="figures/UUBS.png" width="180" height="180" border="10" /></a>

<hr>

### Notebook 13: Simple (meaning univariate) linear regression

In this notebook, we discuss the univariate linear regression model, aka simple linear regression because it involves only one predictor variable. 

In [None]:
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
housing = pd.read_csv('data/housing.csv',index_col=0)
housing.head()

### Simple linear regression

$y_i = \beta_0 + \beta_1 * x_i + \epsilon_i $

<div align="left">
<a href="#Gag62" class="btn btn-default" data-toggle="collapse"><b>What's the plan?</b></a>

</div>
<div id="Gag62" class="collapse">
<br>

We are going to use Python to build a univariate linear regression model based on the association between RM and MEDV. This amounts to finding estimates, $\hat{\beta}_0$  and $\hat{\beta}_1$, for the linear coefficients $\beta_0$ and $\beta_1$ (intercept and slope respectively). 


In [None]:
# to get things started let's have a pure guess at the values 
# of the intercept and slope. Let's name our guesses b0, b1...
b0 = -2.5
b1 = 2.8

# let's assign the values, b0 and b1, to a 
# straight line which can describe our data 
housing['GuessResponse'] = b0 + b1*housing['RM']

# also we want to quantify the error in our guesswork 
# to see how just how far it is from the true response
housing['observederror'] = housing['MEDV'] - housing['GuessResponse']


# here we plot our estimated line together with the actual data points
plt.figure(figsize=(12, 6))
plt.title('Sum of sqaured error is {}'.format(((housing['observederror'])**2).sum()))
plt.scatter(housing['RM'],housing['MEDV'],color='g',label='Observed')
plt.plot(housing['RM'],housing['GuessResponse'],color='red',label='GuessResponse')
plt.legend()
plt.xlim(housing['RM'].min()-2,housing['RM'].max()+2)
plt.ylim(housing['MEDV'].min()-2,housing['MEDV'].max()+2)
plt.show()

<div align="left">
<a href="#62" class="btn btn-default" data-toggle="collapse"><b>Have a go!</b></a>

</div>
<div id="62" class="collapse">
<br>

Try a range of different guess values for b0 and b1 to see if you can get close to what you consider a visually good best fit line. Hopefully after several attempts and no little frustration you will agree, guessing is not really the way to go. 


### Least square estimates

In [None]:
# input the formula (refer to PyFinlab video, PartIV.Video2)
model = smf.ols(formula=None,data=housing).fit()

# here we determine estimated values for intercept and 
# slope using least squares estimation. The attribute 
# 'params' returns a list of estimated model parameters 
b0 = model.params[0]
b1 = model.params[1]

# here is the resulting least squares straight-line fit to the data
housing['BestResponse'] = b0 + b1*housing['RM']

# again we would like to know the error involved
housing['error'] = housing['MEDV'] - housing['BestResponse']


# we plot the estimated lines together with the data points to see
# how much the error has dropped after using least squares method
plt.figure(figsize=(10, 10))
plt.title('Sum of sqaured error is {}'.format((((housing['error'])**2)).sum()))
plt.scatter(housing['RM'],housing['MEDV'],color='g',label='Observed')
plt.plot(housing['RM'],housing['GuessResponse'],
         color='red',label='GuessResponse')
plt.plot(housing['RM'],housing['BestResponse'],
         color='yellow',label='BestResponse')
plt.legend()
plt.xlim(housing['RM'].min()-2,housing['RM'].max()+2)
plt.ylim(housing['MEDV'].min()-2,housing['MEDV'].max()+2)
plt.show()

### Summary table

In [None]:
# refer to the p-value of RM, Confidence Interval 
# and R-square to evaluate the performance.

model.summary()

< [Notebook 12](PartIV1.ipynb) | [PyFinLab Index](ALWAYS-START-HERE.ipynb) | [Notebook 14](PartIV3.ipynb) >

<div align="right"><a href="#ref00">back to top</a></div>