#Simple Linear Regression

This is a very simple tutorial intended for the beginners to understand and implement Simple Linear Regression from the scratch. 



**Simple Linear Regression** is a great first machine learning algorithm to implement as it requires you to estimate properties from your training dataset, but is simple enough for beginners to understand. Linear regression is a prediction method that is more than 200 years old. In this tutorial, you will discover how to implement the simple linear regression algorithm from scratch in Python.

After completing this tutorial you will know:<br>
&#9632; How to estimate statistical quantities from training data.<br>
&#9632; How to estimate linear regression coefficients from data.<br>
&#9632; How to make predictions using linear regression for new data.<br>


Linear regression assumes a **linear or straight line relationship between the input variables (X) and the single output variable (y).** More specifically, that output (y) can be calculated from a linear combination of the input variables (X). When there is a single input variable, the method is referred to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data.

The line for a simple linear regression model can be written as:

$$ y = θ_0 + θ_1 \times x $$
where $θ_0$ and $θ_1$ are the coefficients we must estimate from the training data. Once the coefficients are known, we can use this equation to estimate 
output values for $y$ given new input examples of $x$. It requires that you calculate statistical properties from the data such as **mean, variance** and **covariance.**

We will use a dataset to understand the relationship between the numbers of hours a student studies and the percentage of marks that student scores in an exam which demonstrate simple linear regression. The dataset involves **predicting the obtained percentage score of a student ($y$) given the total number of hourse s/he has studied ($x$).**

**[Dataset can be found here.](https://drive.google.com/file/d/1HChTis2Kwhk-1EZC6yvJHx6vHBO_94Aq/view?usp=sharing)**

Let's load some basic python libraries that we will need over the course of this tutorial. 

In [None]:
# library for manipulating the csv data
import pandas as pd

# library for scientific calculations on numbers + linear algebra
import numpy as np
import math

# library for regular plot visualizations
import matplotlib.pyplot as plt

#library for responsive visualizations
import plotly.express as px
import plotly.graph_objects as go

##Dataset

**Downloading the Dataset**<br>
Full Dataset Link: [https://drive.google.com/file/d/1HChTis2Kwhk-1EZC6yvJHx6vHBO_94Aq/view?usp=sharing](https://drive.google.com/file/d/1HChTis2Kwhk-1EZC6yvJHx6vHBO_94Aq/view?usp=sharing)

In [None]:
!gdown --id 1HChTis2Kwhk-1EZC6yvJHx6vHBO_94Aq

Reading the data

In [None]:
data = pd.read_csv('/content/student_scores.csv')

print("---Data Info---")
data.info()
print("---Data Head---")
data.head()

Dataset Description

In [None]:
data.describe()

Let's have a look at the data itself. You can either use `matplotlib.pyplot` or `plotly` for visualization. The latter one produces responsive visualizations. Try hovering over the points on the graph to see the actual values.

In [None]:
fig = px.scatter(x = data['Hours'], y=data['Scores'])
fig.update_layout(title = 'Obtained Score Percentage vs Hours-Studied', title_x=0.5, 
                  xaxis_title= "Number of Hours Studied", yaxis_title="Obtained Percentage Score", 
                  height = 500, width = 700)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

**This tutorial is broken down into five parts:<br>**
1. Calculate Mean and Variance.
2. Calculate Covariance (X,Y).
3. Estimate Coefficients.
4. Make Predictions.
5. Visual Comparison for Correctness.<br>

These steps will give you the foundation you need to implement and train simple linear regression models for your own prediction problems.

## 1. Calculate Mean and Variance.
As said earlier, simple linear regression uses mean and variance of the given data. We will use `numpy` builtin functions to calculate them. 

In [None]:
print(data["Hours"])
print(data["Scores"])

In [None]:
mean_x = np.mean(data["Hours"])
mean_y = np.mean(data["Scores"])

var_x = np.var(data["Hours"])
var_y = np.var(data["Scores"])

print('x stats: mean= %.3f   variance= %.3f' % (mean_x, var_x))
print('y stats: mean= %.3f   variance= %.3f' % (mean_y, var_y))

## 2. Calculate Covariance.
The covariance of two groups of numbers describes how those numbers change together. Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers. It is calculated by the following formula. 
$$ Cov(X,Y) = \frac{\sum{(X_i - \overline{X})}{(Y_j - \overline{Y})}}{n-1} $$

You can simply implement it by yourself or use builtin function `numpy.cov()`

In [None]:
# Calculate covariance between x and y
def covariance(x, y):
    mean_x = np.mean(x)
    mean_y = np.mean(y)
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar/(len(x)-1)

cov_xy = covariance(data['Hours'], data['Scores'])

print(f'Cov(X,Y): {cov_xy}')
print(f'Covariance Using Built-In Function:\n{np.cov(data["Hours"], data["Scores"])}')

## 3. Estimate Coefficients
We must estimate the values for two coefficients $θ_0$ & $θ_1$ in simple linear regression.

In [None]:
θ_1 = cov_xy / var_x
θ_0 = mean_y - θ_1 * mean_x

print(f'Coefficents:\n θ_0: {θ_0}  θ_1: {θ_1} ')

## 4. Make Predictions
The simple linear regression model is a line defined by coefficients estimated from training data. Once the coefficients are estimated, we can use them to make predictions. The equation to make predictions with a simple linear regression model is as follows:
$$ \hat{y} = θ_0 + θ_1 * x $$

In [None]:
# Taking the values from the dataframe
x = data['Hours'].values.copy()

print(f'x: {x}')

# Predicting the new data based on calculated coeffiecents. 
y_pred = θ_0 + θ_1 * x
print(f'\n\ny_pred: {y_pred}')

y = data['Scores'].values.copy()
print(f'\n\ny: {y}')

## 5. Visual Comparison for Correctness 


In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=data["Hours"], y=data["Scores"], name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=data["Hours"], y=y_pred, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))

fig.update_layout(title = f'Obtained Score Percentage vs Hours-Studied\n (visual comparison for correctness)',title_x=0.5, xaxis_title= "Number of Hours Studied", yaxis_title="Obtained Percentage Score")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

**<font color='red'>Wait, this is not the right way to do it!</font>**<br>


Preparing the data

In [None]:
X = data.iloc[:,0].values
Y = data.iloc[:,1].values

Splitting Training and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, shuffle=True)

Calculating co-efficients again! But this time only using the training data.

In [None]:
mean_x = np.mean(X_train)
mean_y = np.mean(Y_train)

var_x = np.var(X_train)
var_y = np.var(Y_train)

print('x stats: mean= %.3f   variance= %.3f' % (mean_x, var_x))
print('y stats: mean= %.3f   variance= %.3f' % (mean_y, var_y))

cov_xy = covariance(X_train, Y_train)
θ_1 = cov_xy / var_x
θ_0 = mean_y - b1 * mean_x

print(f'Coefficents:\n θ_0: {θ_0}  θ_1: {θ_1} ')

print(f'\n\ny: {X}')

# Predicting the new data based on calculated coeffiecents. 
y_pred = θ_0 + θ_1 * X
print(f'\n\ny_pred: {y_pred}')

print(f'\n\ny: {Y}')

fig = go.Figure()

fig.add_trace(go.Scatter(x=X, y=Y, name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=X, y=y_pred, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))

fig.update_layout(title = f'Obtained Score Percentage vs Hours-Studied\n (visual comparison for correctness after Train-Test Split)',title_x=0.5, xaxis_title= "Number of Hours Studied", yaxis_title="Obtained Percentage Score")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

Predicting only the test samples

In [None]:
y_pred = θ_0 + θ_1 * X_test
print(f'\n\nY_Test: {Y_test}')
print(f'\n\nY_Pred: {y_pred}')

# Task 01

## Calculating MSE

In [None]:
def calculate_mse(data, test):
  differences = pow(data - test, 2)
  mse = np.sum(differences) / test.size
  return mse

## Training Error

In [None]:
training_output = θ_0 + θ_1 * X_train
calculate_mse(training_output, Y_train)

#Gradient Descent Algorithm

 Now you are going to implement Gradient Descent Algorithm from the scratch. To understand how it works you will need some basic math and logical thinking. Gradient Descent can be used in different machine learning algorithms, including neural networks. For this lab, you are going to build it for a linear regression problem, because it’s easy to understand and visualize.

### Linear Regression

In order to fit the regression line, we tune two parameters: $slope (θ_1)$ and $intercept (θ_0).$ Once optimal parameters are found, we usually evaluate results with a  mean squared error $(MSE).$ We remember that smaller MSE — better. In other words, we are trying to minimize it.


### Gradient Descent
Minimization of the function is the exact task of the Gradient Descent algorithm. It takes parameters and tunes them till the local minimum is reached.

Let’s break down the process in steps and explain what is actually going on under the hood:
1. First, we take a function we would like to minimize, and very frequently it will be Mean Squared Errors function. 
2. We identify parameters, such as m and b in the regression function and we take partial derivatives of MSE with respect to these parameters. This is the most crucial and hardest part. Each derived function can tell which way we should tune parameters and by how much.
2. We update parameters by iterating through our derived functions and gradually minimizing MSE. In this process, we use an additional parameter **learning rate** which helps us define the step we take towards updating parameters with each iteration. By setting a smaller learning rate we make sure our model wouldn’t jump over a minimum point of MSE and converge nicely.


The formula of the Mean Squared Error MSE is as follows: 
$$ MSE = \frac{1}{n}\sum\limits_{i=1}^n(y_{i} - \hat{y})^2$$  where$$ \hat{y} = θ_0 + θ_1x_i$$

##Test Error

In [None]:
calculate_mse(y_pred, Y_test)

# Task 2

In [None]:
from sklearn.metrics import mean_squared_error

def gradient_descent(X, y, lr=0.0001, epoch=12):
  θ_1, θ_0 = 0.0, 0.0 # parameters
  log, mse = [], [] # lists to store learning process
  # GRADIENT DESCENT ALGORITHM
  for i in range(epoch):
    sumyhat = 0
    sumxyhat = 0
    # CALCULATE SUM PORTIONS; COULD HAVE VECTORISED HERE
    for j in range(len(X)):
      sumyhat += θ_0 + θ_1*X[j] - y[j]
      sumxyhat += (θ_0 + θ_1*X[j] - y[j])*(X[j])
    # CALCULATE AND UPDATE b1 AND b0
    θ_1 -= lr*(1/len(X))*sumxyhat
    θ_0 -= lr*(1/len(X))*sumyhat

    # COULD HAVE ADDED THE CONDITION HERE
    # BUT MATHEMATICALLY IT SEEMED THAT THE THRESHOLD VALUE WOULD BE DIFFICULT TO ACCURATELY PUT IN
    # AND THE NUMBER OF EPOCHS CHOSEN HERE (20) OR EVEN 10 TIMES HIGHER IS MORE 

    # UPDATE LOGS AND MSES
    log.append((θ_1, θ_0))
    mse.append(mean_squared_error(y, (θ_1*X + θ_0)))

    # TASK 2 BEGINS HERE:
    # Add a condition to break the loop once the change in the error falls below 0.01

    if (i != 0 and abs(mse[i] - mse[i - 1]) <= 0.01):
      break
    
  return θ_1, θ_0, log, mse

In [None]:
θ_1, θ_0, log, mse = gradient_descent(X_train, Y_train , epoch = 2000)

In [None]:
log

In [None]:
y_pred = θ_0 + θ_1 * X
fig = go.Figure()

fig.add_trace(go.Scatter(x=X, y=Y, name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=X, y=y_pred, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))

fig.update_layout(title = f'Obtained Score Percentage vs Hours-Studied\n (visual comparison for correctness after Train-Test Split)',title_x=0.5, xaxis_title= "Number of Hours Studied", yaxis_title="Obtained Percentage Score")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

y_pred = θ_0 + θ_1 * X_test
print(f'\n\nY_Test: {Y_test}')
print(f'\n\nY_Pred: {y_pred}')

#Task 3

In [None]:
def plot_mse(mse):
  error_fig = go.Figure()
  error_fig.add_trace(go.Scatter(x=list(range(len(mse))), y=mse, name='mse', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))
  
  error_fig.update_layout(title = f'MSE vs Iterations',title_x=0.5, xaxis_title= "Iterations", yaxis_title="MSE")
  error_fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
  error_fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)

  error_fig.show()

In [None]:
plot_mse(mse)

#Linear Regression with Scikit Learn

Loading Libraries

In [None]:
from sklearn.linear_model import LinearRegression

Reshaping the training and test sets

In [None]:
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
Y_train = Y_train.reshape(-1,1)
Y_test = Y_test.reshape(-1,1)

Training The Model

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

Printing $Intercept$ and $Slope$

In [None]:
print(regressor.intercept_, regressor.coef_)

Making Predictions and visual comparison for correctness

In [None]:
y_pred = regressor.predict(X.reshape(-1,1))

fig = go.Figure()

fig.add_trace(go.Scatter(x=X, y=Y, name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=X, y=y_pred.flatten(), name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))

fig.update_layout(title = f'Obtained Score Percentage vs Hours-Studied\n (visual comparison for correctness after Train-Test Split)',title_x=0.5, xaxis_title= "Number of Hours Studied", yaxis_title="Obtained Percentage Score")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

y_pred = regressor.predict(X_test)
print(f'\n\nY_Test: {Y_test.flatten()}')
print(f'\n\nY_Pred: {y_pred.flatten()}')

#Multiple Linear Regression

In the previous section we performed linear regression involving two variables. Almost all real world problems that you are going to encounter will have more than two variables. Linear regression involving multiple variables is called "multiple linear regression". The steps to perform multiple linear regression are almost similar to that of simple linear regression.

##Dataset
we will use multiple linear regression to predict the gas consumptions (in millions of gallons) in 48 US states based upon gas taxes (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population that has a drivers license.

[Dataset can be downloaded from here.](https://drive.google.com/file/d/1lHJC6sb3ZQoWrrFQJq5zUVxFbQbLlqHH/view?usp=sharing)

Loading the data

In [None]:
!gdown --id 1lHJC6sb3ZQoWrrFQJq5zUVxFbQbLlqHH
data = pd.read_csv('/content/petrol_consumption.csv')

In [None]:
X = data[['Petrol_tax', 'Average_income', 'Paved_Highways',
       'Population_Driver_licence(%)']]
Y = data['Petrol_Consumption']

data.head()

Splitting the dataset

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, shuffle=True)

##Training

Training the multiple linear regression model

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

Printing the coeficients

In [None]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

This means that for a unit increase in "petrol_tax", there is a decrease of 24.19 million gallons in gas consumption. Similarly, a unit increase in proportion of population with a drivers license results in an increase of 1.324 billion gallons of gas consumption. We can see that "Average_income" and "Paved_Highways" have a very little effect on the gas consumption.

##Predictions

In [None]:
Y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
df