In [3]:
import pandas as pd
import numpy as np

import matplotlib as mpl
from matplotlib import pyplot as plt

In [4]:
data_df = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col = 0)
data_df

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9
...,...,...,...,...
196,38.2,3.7,13.8,7.6
197,94.2,4.9,8.1,9.7
198,177.0,9.3,6.4,12.8
199,283.6,42.0,66.2,25.5


In [5]:
X = data_df.TV.to_numpy().reshape(-1,1)
Y = data_df.sales.to_numpy().reshape(-1,1)

In [6]:
X.shape

(200, 1)

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [8]:
X_train, X_test, y_train, y_test  = train_test_split(X, Y, test_size=0.2, random_state=909)

In [9]:
X_test.shape

(40, 1)

In [10]:
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [11]:
linear_reg.coef_

array([[0.04710336]])

In [12]:
linear_reg.intercept_

array([7.04206318])

In [13]:
y_pred = linear_reg.predict(X_test)

In [14]:
SSE = np.sum(np.power((y_test - y_pred), 2))
SSE

np.float64(350.3553922682761)

### Performance Metrics

* Mean Squared Error (MSE): $\frac{\sum_{i=1}^{N}(y_{actual} - y_{predicted})^2}{N}$


* Mean Absolute Error (MAE): $\frac{\sum_{i=1}^{N}|y_{actual} - y_{predicted}|}{N}$


In [15]:
from sklearn.metrics import max_error, mean_absolute_error, mean_squared_error, r2_score

In [16]:
max_error(y_test, y_pred)

7.289078438466312

In [17]:
mean_absolute_error(y_test, y_pred)

2.338255454577244

In [18]:
mean_squared_error(y_test, y_pred)


8.758884806706902

### R2 Cofficient

R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.

R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates a perfect fit. It is calculated as:

$R^2 = 1 - \frac{RSS}{TSS}$

Where RSS is the residual sum of squares and TSS is the total sum of squares.

The RSS (Residual Sum of Squares) represents the sum of squared differences between the observed dependent variable values (y) and the predicted values (ŷ) obtained from the linear regression model. Mathematically, it is calculated as follows:

$RSS = Σ(y - ŷ)^2$

On the other hand, the TSS (Total Sum of Squares) represents the total variation in the dependent variable (y) from its mean (ȳ). It measures the sum of squared differences between each observed dependent variable value (y) and the mean of the dependent variable (ȳ). Mathematically, it is calculated as follows:

$TSS = Σ(y - ȳ)^2$

R-squared is always between 0 and 100%:

* 0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
* 100% represents a model that explains all the variation in the response variable around its mean.

In [19]:
r2_score(y_test, y_pred)

0.6956649384057587