# Beginners Guide to Linear Regression  

Supervised learning is called regression if the dependent variable (aka target) is continuous. Supervised learning is called classification if the dependent variable is discrete. In other words, a regression model outputs a numerical value (a real floating value), but a classification model outputs a class (among two or more classes).

In this practice session, we will discuss linear regression and its implementation with python codes. Regression analysis can be specifically termed linear regression if the dependent variable (target) has a linear relationship with the independent variables (features)

To understand the math behind it, please refer [this](https://analyticsindiamag.com/beginners-guide-to-linear-regression-in-python/) article.

# Code Implementation

## Load a Regression Data

Import necessary libraries and modules.

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_diabetes

Load a regression problem dataset from SciKit-Learn’s in-built datasets. Data is already preprocessed and normalized, and is ready to use.

In [None]:
data = load_diabetes()
data.keys()

Generate features and target. Visualize the top 5 rows of the data.

In [None]:
features = pd.DataFrame(data['data'], columns=data['feature_names'])
target = pd.Series(data['target'], name='target')
features.head()

## Simple Linear Regression

Simple linear regression is performed with one dependent variable and one independent variable. In our data, we declare the feature ‘bmi’ to be the independent variable.

Prepare X and y.

In [None]:
X = features['bmi'].values.reshape(-1,1)
y = target.values.reshape(-1,1)

Fit the data to the model

In [None]:
simple = LinearRegression()
simple.fit(X,y)

The training is completed. We can explore the weight (coefficient) and bias (intercept) of the trained model.

In [None]:
simple.intercept_

In [None]:
simple.coef_

Calculate the predictions following the formula, y = intercept + X*coefficient.

Predictions can also be calculated using the trained model.

In [None]:
calc_pred = simple.intercept_ + (X*simple.coef_)

pred = simple.predict(X)

In [None]:
(calc_pred == pred).all()

Plot the actual values and predicted values to get a better understanding.

In [None]:
plt.scatter(X,y, label='Actual')
plt.plot(X,pred, '-r', label='Prediction')
plt.xlabel('Feature X')
plt.ylabel('Target y')
plt.title('Simple Linear Regression', color='orange', size=14)
plt.legend()
plt.show()

According to SciKit-Learn’s LinearRegression method, the above red line is the best possible fit with minimal error value. 

We can calculate the mean squared error value for the above regression using the following code.

Mean Squared Error

In [None]:
mean_squared_error(y, pred)

R-Squared value

CoD gives the ratio of the regression sum of square to the total sum of the square. Total sum of squares (SST) is the sum of deviations of each y value from the mean value of y. Regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squared error (SSE). When there is no error (MSE = 0), CoD becomes unity. When the sum of squared error equals the total sum of squares (SSE = SST), CoD becomes zero.

CoD = 1 refers to the best prediction

CoD = 0 refers to the worst prediction

CoD gives a limit [0,1], thus makes the predictions comparable. CoD is also called R-squared value. It can be calculated using the following code.

In [None]:
simple.score(X,y)

## Multiple Linear Regression 

Multiple linear regression is performed with more than one independent variable. We choose the following columns as our features.

In [None]:
columns = ['age', 'bmi', 'bp', 's3', 's5']
columns

Visuaize the data

In [None]:
for i in columns:
  plt.scatter(features[i], y)
  plt.xlabel(str(i))
  plt.show() 

It is observed that each individual feature has scatteredness in nature. But, the variation in target values for a single input feature value may be explained by some other features. In other words, the target value may find difficulty in fitting a linear regression model with a single feature. Nevertheless, it may yield an improved fit with multiple features by exploring the true pattern in the data.

In the simple linear regression implementation, we have used all our data to fit the model. But, how can we test our model? How far will our model perform on unforeseen data? This is where the train-test-split comes into play. We split our dataset into two sets: a training set and a validation set. We train our model with training data only and evaluate it with the validation set.

Perform Train-Validation split

In [None]:
from sklearn.model_selection import train_test_split

X = features[columns]
# 70% training data, 30% validation data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=6)

Build a linear regression model and fit the data.

In [None]:
multi = LinearRegression()
multi.fit(X_train, y_train)

What are the weights (coefficients) of our model? There should be five coefficients each corresponding to each feature.

In [None]:
multi.coef_

In [None]:
multi.intercept_

Predictions, error and R-squared value

In [None]:
pred = multi.predict(X_val)


In [None]:
mean_squared_error(y_val, pred)

In [None]:
multi.score(X_train, y_train), multi.score(X_val, y_val)

## Using statsmodels Library 

We have used the SciKit-Learn library so far to perform linear regression. However, we can use the statsmodels library to perform the same task. Fit the training data on the OLS (Ordinary Least Squares) model available in the statsmodels library.

In [None]:
import statsmodels.api as sm
X_train = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train).fit()
model.summary()

Prediction, error calculation

It can be observed that the model weights, intercept and the R-squared value are all identical to the Linear Regression method of the SciKit-Learn library.

The model can be implemented to make predictions on validation data too.

In [None]:
# Constant (intercept) must be added manually
X_val = sm.add_constant(X_val)
preds = model.predict(X_val)


In [None]:
mean_squared_error(y_val, preds)

Both methods perform identically