# $\mathbf{\text{Linear Regression: A Tutorial}}$<br>

Author: K. Voudouris, 2021

Please note: This tutorial has borrowed heavily from [here](https://mubaris.com/posts/linear-regression/#:~:text=Linear%20Regression%20from%20Scratch%20in%20Python%201%20Simple,me%20know%20if%20you%20found%20any%20errors.%20).

First, we import the libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)

Next we import the Iris dataset from sklearn. The Iris dataset is the classic dataset for Data Science.It contains five columns about Iris flowers: Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type

In [None]:
iris_data = pd.read_csv("./iris_csv.csv")
iris_data.head() #show the first 5 rows of the dataset

In [None]:
iris_data.sample(10) #show a random sample of 10 of the dataset

## Task

Explore the dataset using the pandas functions:

columns()

shape()

iloc()          Identify columns by their row index

loc()           Identify columns by name

In [None]:
#############################################################################
##Write Code here:


#############################################################################

## $\mathbf{\text{Introducing Linear Regression}}$<br>

We will be focusing here on simple bivariate linear regression. This means we try to model how a change in one variable (the independent/explanatory/predictor variable) is related to a change in other variable (dependent/predicted variable). Usually, the independent variable is denoted by $\mathbf{X}$ and the dependent by $\mathbf{y}$.

Linear regression, in its simplest bivariate form, can model a linear relationship between a vector of values $\mathbf{x}$ and a vector of values $\mathbf{y}$. It essentially fits a line of best fit. This line has the familiar mathematical characterisation from high-school maths classes of $\mathbf{y = mx + c}$. $\mathbf{m}$ is the (gradient/slope/scale) coefficient, it tells you how much a value *y* changes when you change a value of *x* by 1 unit. $\mathbf{c}$ is the intercept (or sometimes the bias coefficient). It tells you what we expect *y* to be when *x* is 0 (and therefore $\mathbf{mx}$ is 0).

Usually, we rearrange this equation and use different letters when we talk in DataScience/ML contexts. This is because the new notation generalises better when we think about multivariate cases and more general modelling techniques. However, they are essentially the same. The following is used to characterise the bivariate linear model:

$$\mathbf{y} = \beta_{0} + \beta_{1}\mathbf{x}$$

Our job is to find the best values of $\beta_{0}$ and $\beta_{1}$ for our dataset containing {$\mathbf{x, y}$}

Let's plot some data with Iris. Let's assume that we are interested in how petal width changes based on petal length.

Find the columns of interest and assign them to x and y variables


In [None]:
############Fill in the parts of the code necessary

x = iris_data[''].values
y = iris_data[''].values

In [None]:
###########Plot the data using matplotlib

plt.scatter(x, y, label = 'Scatter Plot of Iris Dataset')
plt.xlabel('')
plt.ylabel('')
plt.show

Now let's think about lines of best fit. Play around with the following code to fit lines of best fit to your graph

In [None]:
#######We need a set of x-values first, between 0 and 7

min_x = 0
max_x = 7

xvalues = np.linspace(min_x, max_x, 500) #creates 500 values between 0 and 7

b0 = 2 #vary the intercept
b1 = 0 #vary the gradient coefficient

yvalues = b0 + b1 * xvalues #create a vector of yvalues

plt.plot(xvalues, yvalues, label = 'Line of Best Fit')
plt.scatter(x, y, label = 'Data points')
plt.xlabel('')
plt.ylabel('')
plt.legend()
plt.show

We could do this by eye and get an okay looking plot, as I am sure you have. But ideally we would at least have a metric to determine how well this line fits (i.e., a measure of model-fit). One such metric is the *Root Mean Squared Error (RMSE)*. 

Our line of best fit (our *model*) gives an expected *y* value, $\hat{y}$, for a given value of *x*. For the x-values we know about, how different is $\hat{y}$ from *y*? We can simply calculate that difference for each of our datapoints. To give an overall metric for the *error* of our model, we sum all those values. The problem is that some of these values will be negative and some will be positive, and we don't want those to cancel each other out (i.e., say we have 0 error, when actually we have -2+2 error). So the solution is to square all these errors to give just positive values. Now we want the average error for your average data point, so we can divide the sum of all the squared errors by the number of squared errors (i.e. the number of x-values we have) to give an average error. The final issue is that this average error is in units of $y^2$. So Let's square root the whole thing to give us a standardised metric of error across our dataset in units of *y*. The equation is as follows:

$$RMSE = \sqrt{\sum\limits _{j = 1} ^{m}\frac{1}{m}(\hat{y}_{j}-y_{j})^2}$$

In [None]:
b0 = 2 #vary the intercept
b1 = 0 #vary the gradient coefficient

y_expected = b0 + b1 * x #create a vector of expected values

#####################Calculate the RMSE for your chosen values of b0 and b1
##Complete code here

differences = y_expected - y
sq_differences = differences*differences
length_m = len(y)
#...
#note sqrt is math.sqrt()

In [None]:
#####################BONUS: make this into a function#######################

def RMSE_calc(b0, b1):
    ## enter code here
    return(rmse)

print(RMSE_calc())

There is another metric for determining fit here, $R^{2}$. RMSE gave us the average error in units of *y*. $R^{2}$ gives us a measure of the proportion of the total variance (strictly speaking, a value *proportional* to the total variance) in the data that is accounted for by the variance in the error between the data and the model you have fitted. $R^{2}$ uses the Sum-of-Squares, and is the Sum-of-Squares of the residuals (the errors) of the model divided by the total Sum-of-Squares of the data relative to the mean.

The following is the total sum of the squares of the *residuals*, i.e. the differences between the actual *y* values in the dataset and the predicted *y* values from our model:

$$SS_r = \sum\limits _{j = 1} ^{m}(y_{j}-\hat{y}_{j})^2$$

Then we have the total sum of squares of the data, which is the total sum of the squares of the differences between actual *y* values and the mean of *y*, $\bar{y}$. This is kind of like the variance of the data, although that is usually scaled by $\frac{1}{N(-1)}$. $SS_t$ is the total of how much the datapoints vary around the mean of *y*.

$$SS_t = \sum\limits _{j = 1} ^{m}(y_{j}-\bar{y}_{j})^2$$

We want to be able to come with an explanation for $SS_t$, and introducing a line of best fit helps to do that be introducing the effect of a predictor variable, *x*. We want the error around our fitted model to be small, relative to the amount of variability inherent in the data set. To calculate this relative effect, we simply divide $SS_r$ by $SS_t$.

$$\frac{SS_r}{SS_t}$$

For a given value, this gives a smaller value the smaller the error around the fitted model, which is what we want. To put this onto a scale where larger values mean better fit, we subtract this fraction from one, to give us $R^{2}$.

$$R^{2} = 1- \frac{SS_r}{SS_t}$$

This is nice because it gives us a standardised measure from 0 to 1, which relates to how well the model accounts for the variance in the data. The more variance explained, the better.

In [None]:
b0 = 2 #vary the intercept
b1 = 1 #vary the gradient coefficient

y_expected = b0 + b1 * x #create a vector of expected values

#####################Calculate the R^2 for your chosen values of b0 and b1
##Enter code here




In [None]:
#####################BONUS: make this into a function#######################

def R2_calc(b0, b1):
    ## enter code here
    return(r2)

print(R2_calc())

## The next step

We have seen to metrics for evaluating the fit of a model. Adjusting b0 and b1 give us different values for $R^{2}$ and RMSE. These characterise a **cost function**. There is some value for b0 and b1 where the cost is at it's lowest. We could find this by hand, or we could use optimisation to find it automatically. There are two general approaches when it comes to linear regression. The first is to use the *normal equations*, making use of the properties of vectors and matrices in linear algebra, to analytically find the optimal values of b0 and b1. The other option is to use an iterative algorithm, which tries to find values for b0 and b1 that result in a lower cost than previous values of b0 and b1. This is the basic intuition behind **gradient descent**. Many ML algorithms do not have closed form solutions like linear regression does (e.g. logistic regression), and so we have to rely on iterative algorithms. Gradient descent is also better for large datasets, because the normal equations can take a lot of compute to solve when large matrices are involved (because of calculating matrix inverses).

For today, we will use the normal equations. See Chapter 2 and Chapter 4 of the deeplearning book for more insight into how these work.

First, we estimate the gradient coefficient using the following equation (where *n* is sample size):

$$\hat{\beta}_{1} = \frac{n \sum x_{j} y_{j} - \sum x_{j} \sum y_{j}}{n \sum x_{j} ^{2} -(\sum x_{j}) ^{2}} $$

Then we estimate the intercept using $\hat{\beta}_{1}$ and $\bar{y}$ and $\bar{x}$ (i.e. the means of **x** and **y**):

$$\hat{\beta}_{0} = \bar{y} - \hat{\beta}_{1}\bar{x}$$

In [None]:
###################Finish the code to calculate b0 and b1##########################
b1_normeq = 
b0_normeq = 

In [None]:
##################Plot the graph with your values fitted

min_x = 0
max_x = 7

xvalues = np.linspace(min_x, max_x, 500) #creates 500 values between 0 and 7


yvalues = b0_normeq + b1_normeq * xvalues #create a vector of yvalues

plt.plot(xvalues, yvalues, label = 'Line of Best Fit')
plt.scatter(x, y, label = 'Data points')
plt.xlabel('')
plt.ylabel('')
plt.legend()
plt.show

If you have written functions to calculate RMSE and $R^{2}$ then calculate those values for your beta values

In [None]:
RMSE_fitted <- RMSE_calc(b0_normeq, b1_normeq)

R2_fitted <- R2_calc(b0_normeq, b1_normeq)

print(RMSE_fitted)
print(R2_fitted)