<img src="data/images/div/lecture-notebook-header.png" />

# Classification & Regression: Linear Regression

Linear Regression is a statistical modeling technique used to understand and predict the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables, meaning that the dependent variable can be expressed as a linear combination of the independent variables.

In linear regression, the goal is to find the best-fitting straight line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual observed values of the dependent variable. This line is determined by estimating the coefficients (or weights) associated with each independent variable. The line's equation in simple linear regression (with one independent variable) can be represented as:

$$y = \theta_0 + \theta_1*x$$

where:

* $y$ is the dependent variable (the variable we want to predict),
* $x$ is the independent variable (the variable used to predict y),
* $\theta_0$ is the y-intercept (the value of y when x is 0),
* $\theta_1$ is the coefficient (the slope of the line, representing the change in y for a unit change in x).

Linear regression is considered a linear model because the relationship between the dependent and independent variables is assumed to be linear. It assumes that the change in the dependent variable is a constant multiple of the change in the independent variable(s). Although the actual relationship between the variables in the real world may not be strictly linear, linear regression provides a simple and interpretable approximation that works well in many cases.

Linear Regression is one of the most fundamental and popular techniques to solve regression tasks. It's simplicity has several advantages:

* The best parameter values can be found analytically (in many to most cases)
* There are no "fundamental" hyperparameters that need tuning. Adding polynomial terms for Polynomial Linear Regression is more on the level of data preprocessing. Different methods for regularization are more inherent to Linear Regression itself
* Linear Regression are typically easy to interpret by looking at the coefficients of the model

Whether Linear Regression shows (very) good model performance compared to more sophisticated models, typically depends on the nature of the data.

## Setting up the Notebook

### Specify How Plots Get Rendered

In [None]:
%matplotlib inline

### Import Required Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split

---

## Working with Toy Data (CSI Example)

The following toy data is the CSI example used in the lecture, where the goal is to estimate a person's height based on the size of a shoe print. As the training dataset, 20 (shoe print size, height)-pairs have been collected. Note that the code below directly adds the constant term $x_0=1$ to the data matrix `X`.

You can do a `print(X)` to see how the data matrix looks in the end.

In [None]:
data = np.array([
    (31.3, 180.3), (29.7, 175.3), (31.3, 184.8), (31.8, 177.8),
    (31.4, 182.3), (31.9, 185.4), (31.8, 180.3), (31.0, 175.5),
    (29.7, 177.8), (31.4, 185.4), (32.4, 190.5), (33.6, 195.0),
    (30.2, 175.3), (30.4, 180.3), (27.6, 172.7), (31.8, 182.9),
    (31.3, 189.2), (34.5, 193.7), (28.9, 170.3), (28.2, 173.8)
])

# Convert input and outputs to numpy arrays; makes some calculations easier
X = np.ones((data.shape[0], 2))
X[:,1] = data[:,0]
y = data[:,1]

Since we only have 1 input feature, it's always good to plot the data first.

In [None]:
plt.figure()
plt.axes().set_ylim([165, 200])
plt.tick_params(labelsize=14)
plt.scatter(X[:,1], y)
plt.xlabel('shoe print length (cm)', fontsize=16)
plt.ylabel('body height (cm)', fontsize=16)
plt.tight_layout()
plt.show()

The plot shows that there is a reasonably good linear relationship between the shoe print size and the height of a person. Hence one can justifiably use Linear Regression to find a good fit of the data.

Let's extend this plot to also show 3 possible regression lines; again, matching the example from the lecture slides. You can of course change the values of `theta_i` to see the effects.

In [None]:
theta_1 = np.array([92, 3.1])
theta_2 = np.array([70, 3.6])
theta_3 = np.array([56, 3.95])


x_0 = [27.0, 35.0]
y_1 = [ (theta_1[1] * x + theta_1[0]) for x in x_0 ]
y_2 = [ (theta_2[1] * x + theta_2[0]) for x in x_0 ]
y_3 = [ (theta_3[1] * x + theta_3[0]) for x in x_0 ]

plt.figure()
plt.ylim([165, 200])
plt.tick_params(labelsize=14)
plt.scatter(X[:,1], y, s=50)
plt.plot(x_0, y_1)
plt.plot(x_0, y_2)
plt.plot(x_0, y_3)

plt.xlabel('shoe print length (cm)', fontsize=16)
plt.ylabel('body height (cm)', fontsize=16)
plt.tight_layout()
plt.show()

We made the argument that the orange line fits the data best as it minimizes the average error between the prediction (described) by the regression line and the true values for the body height.

### Calculate Loss

We formalized this notion of the error (or loss, cost) using the Root Mean Squared Error (RSME). It sums up the  squared differences between predictions and true value and normalizes it by the number of data samples (i.e., averaging).

$$
\begin{align*}
L & = \frac{1}{n}\sum_{i=1}^n (\hat{y_i} - y_i)^2 \\
  & = \frac{1}{n}\sum_{i=1}^n (\theta^Tx_i - y_i)^2 \\
  & = \frac{1}{n}\lVert X\theta - y\rVert^2
\end{align*}
$$

Note how the sum over all data samples can be rewritten using matrix/vector representations. In practice, this makes both the math and the implementation much more convenient (and even faster since we can use fast matrix/vector operations provided by numpy).

The following method `calc_loss()` simply implements the formula above.

In [None]:
def calc_loss(X, y, theta):
    
    # Calculate predicted value
    h = X.dot(theta)
    
    # Calculate square error for each ground truth / prediction pair
    e_squared = np.square(h - y)
    
    # Calculate loss as normalized sum of all squared errors
    loss = (1 / X.shape[0]) * np.sum(e_squared)
    
    # Of course, we could just use the method to caluclate the MSE from scikit-learn
    #loss = mean_squared_error(y, h)
    
    # Return the loss
    return loss


Using this method, we can calculate the losses for the `theta_i` values as defined above.

In [None]:
print("MSE loss for blue line:   {:.3f}".format(calc_loss(X, y, theta_1)))
print("MSE loss for orange line: {:.3f}".format(calc_loss(X, y, theta_2)))
print("MSE loss for green line:  {:.3f}".format(calc_loss(X, y, theta_3)))
print("MSE loss for random line: {:.3f}".format(calc_loss(X, y, np.array([100, -5]))))

In line with our initial intuition, the orange line yields indeed the best loss. However, one can argue that the blue and green lines are not really that bad either, since the loss is still comparable, contrast to a much more random setting for `theta`.

### Find the Best $\theta$ Using Random Search

For this really simple dataset with just one input feature which in turn requires to fit only two parameters $\theta_0$ and $\theta_1$ we can in fact try a random search to find the best parameter values. In practice, this is of course not a viable approach.

Note that the parameter search is not truly random as we quite limit the range of possible values for $\theta_0$ and $\theta_1$. Identifying such meaningful ranges can be done by eyeballing the data or basic EDA. But again, random search is not a practical method anyway.

In [None]:
num_iterations = 1000

# Keep track of all data points for a plot
xs, ys, zs = [], [], []

# Initialize parameters
best_loss, best_theta0, best_theta1 = float("inf"), None, None

for i in range(num_iterations):
    
    # Select a random value for theta_0 and theta_1
    theta = np.array([np.random.uniform(0.0,100.0), np.random.uniform(0.0, 5.0)])
    
    # Calculate loss for selected m and b
    loss = calc_loss(X, y, theta)
    
    # Remember current parameter values and loss for plotting
    xs.append(theta[0])
    ys.append(theta[1])
    zs.append(loss)
    
    # If the loss is lower than the currently best loss, remember all parameters
    if loss < best_loss:
        best_loss = loss
        best_theta0 = theta[0]
        best_theta1 = theta[1]
        
print("The best values are: m={:.3f}, b={:.3f} (loss={:.3f})".format(best_theta0, best_theta1, best_loss))

As soon as `num_terations` is large enough, we are very likely to get a decent estimate for $\theta_0$ and $\theta_1$, which shows when we plot the corresponding regression line. Keep in mind that we are cheating a bit in the sense that we randomly pick the values for `\theta` from a rather narrow range.

In [None]:
x_line = [np.min(X[:,1]), np.max(X[:,1])]
y_line = [ (best_theta1 * x + best_theta0) for x in x_line ]

plt.figure()
plt.ylim([165, 200])
plt.tick_params(labelsize=14)
plt.scatter(X[:,1], y)
plt.plot(x_line, y_line, c='red')
plt.xlabel('shoe print length (cm)', fontsize=16)
plt.ylabel('body height (cm)', fontsize=16)
plt.tight_layout()
plt.show()

Since we kept track of the loss for all combinations of $\theta_0$ and $\theta_1$, we can also plot the loss function as a 3d scatter plot.

In [None]:
xs = np.array(xs)
ys = np.array(ys)
zs = np.array(zs)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel(r'$\theta_1$', fontsize=16)
ax.set_ylabel(r'$\theta_0$', fontsize=16)
ax.set_zlabel('L', fontsize=16)
surf = ax.scatter(xs, ys, zs)
plt.tight_layout()
plt.show()

The plot shows -- maybe not fully convincingly -- that the loss function is convex, i.e., there is a unique minimum of the loss.

### Performing Linear Regression

In practice, of course, Linear Regression is performed in much smarter ways. As the nature of the loss function for Linear Regression allows it, the best values for $\theta$ can be found analytically. In the lecture, we have seen how to take the loss function L

$$
L = \frac{1}{n}\lVert X\theta - y\rVert^2
$$

calculate the derivative of L w.r.t. to $\theta$, set the derivative to 0 and solve for $\theta$ to arrive at the **Normal Equation**

$$
\theta = (X^TX)^{-1} X^Ty
$$

that allows us to calculate the best values of $\theta$.

As Linear Regression is one of the most fundamental but also popular methods, scikit-learn provides of course an implementation: [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). An important parameter of this implementation is `fit_intercept` whether to calculate the intercept $\theta_0$ or not. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered). This is because the scikit-learn implementation treats $\theta_0$ (`intercept_`) and the other $\theta_{i\neq 0}$ (`coef_`) separately.

As we already added the constant term to our data matrix $X$, the $\theta_0$ is already part of `coef_` and we can set `fit_intercept=False`.

In [None]:
linreg = LinearRegression(fit_intercept=False).fit(X, y)

print('Intercept: {}, Coefficients: {}'.format(linreg.intercept_, linreg.coef_))

theta0, theta1 = linreg.coef_[0], linreg.coef_[1]

In practice, we do not usually add the constant term to the data ourselves. In this case, we typically set `fit_intercept=True` and find $\theta_0$ in `intercept_`.

In [None]:
X_raw = data[:,0].reshape(-1,1)
linreg = LinearRegression(fit_intercept=True).fit(X_raw, y)

print('Intercept: {}, Coefficients: {}'.format(linreg.intercept_, linreg.coef_))

theta0, theta1 = linreg.intercept_, linreg.coef_[0]

Both results are of course the same, but it's a detail one has to keep in mind.

In [None]:
print("The best values are: theta0={:.3f}, theta1={:.3f}".format(theta0, theta1))

Now that we found the best data for $\theta$ we can plot the respective regression line. Of course, it looks very similar to the orange regression line above, and also to the one found through random search if we search long enough.

In [None]:
x_line = [np.min(X_raw), np.max(X_raw)]
y_line = [ (theta1 * x + theta0) for x in x_line ]

plt.figure()
plt.ylim([165, 200])
plt.tick_params(labelsize=14)
plt.scatter(X_raw, y)
plt.plot(x_line, y_line, c='red')
plt.xlabel('shoe print length (cm)', fontsize=16)
plt.ylabel('body height (cm)', fontsize=16)
plt.tight_layout()
plt.show()

### Predict Height of Suspect

Now that we know how to fit a Linear Regression model on your example dataset, we can finally predict the height of the suspect based on the shoe print we found. Recall that the size of the shoe print size was 32.2 cm.

In [None]:
y_pred = linreg.predict([[32.2]])

print('The estimated hight of the suspect is: {:.1f}'.format(y_pred.squeeze()))

-------------------------------------------

## Polynomial Linear Regression

A common misconception is that Linear Regression always yields straight regression *line* (or flat plane or hyperplane in higher dimensions). By transforming the data to include polynomial terms based on the input features, more complex curves are possible.

As an example, we use a toy dataset we've seen in earlier lectures to illustrate regression tasks. In this case, we do not explicitly add the constant term to the data matrix $X$.

In [None]:
data = np.array([
    [2.0, 11.0], [18.0, 9.0], [10.0, 4.0], [2.5, 9], [4, 9], [4.5, 8.5],
    [9.5, 4.5], [8.5, 5], [5.5, 5.5], [4.5, 6.5], [3.8, 6], [7.5, 6.5], [7.7, 7.3],
    [11.5, 6], [12.5, 4.5], [13.5, 4.5], [13, 3.5], [14, 6.2], [14.7, 3.7],
    [14.7, 3.7], [15.2, 6], [16.5, 7]
])

X_raw = data[:,0].reshape(-1, 1)
y = data[:,1]

print(X_raw.shape, y.shape)

Again, we have only one input feature, so let's plot the data.

In [None]:
t1 = np.arange(0., 20., 0.01)
plt.figure()
plt.xlim([0.0, 20.0])
plt.ylim([0.0, 14.0])
plt.scatter(X_raw, y, marker='x', s=75)
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
plt.tight_layout()
plt.show() 

One can already see that fitting a straight line probably won't yield a good fit. Hence the idea of Polynomial Linear Regression to transform the data to add polynomial terms up to a degree of $p$:

$$
\hat{y_i} = \theta_0 + \theta_1x_{i} + \theta_2x^2_{i} + ... + \theta_px^p_{i}
$$

Conveniently, scikit-learn provides [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) to automate this transformation. The main parameter is the maximum degree $p$ for the polynomial terms.

In the following, try different values for the maximum degree and see the effects on the final results.

In [None]:
#poly = PolynomialFeatures(1)
poly = PolynomialFeatures(2)
#poly = PolynomialFeatures(3)
#poly = PolynomialFeatures(4)
#poly = PolynomialFeatures(5)
#poly = PolynomialFeatures(8)

X_poly = poly.fit_transform(X_raw)

When fitting a Linear Regression model, note that we set `fit_intercept=False`, as [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) will explicitly add the constant term, even for $p=1$.

In [None]:
poly_reg = LinearRegression(fit_intercept=False).fit(X_poly, y)

np.set_printoptions(suppress=True)

print(np.around(poly_reg.coef_, 2))

To plot the regression line, we simply predict the output values for a series of input values in the range of the dataset and plot the result as a line.

In [None]:
x_test = np.arange(0., 20., 0.1)
y_test = poly.fit_transform(x_test.reshape(-1,1))

plt.figure()
plt.xlim([0.0, 20.0])
plt.ylim([0.0, 14.0])
plt.scatter(X_raw, y, marker='x', s=50)
plt.plot(x_test, poly_reg.predict(y_test), c='red', lw=2, linestyle='--')
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)
plt.tight_layout()
plt.show() 

For increasing values for the maximum polynomial degree, the regression line becomes more and more "wiggly" as there model has more and more capacity to fit the training data. Although we do not explicitly quantify it, this will very likely lead to severe overfitting as it fails to generalize the data.

Intuitively, setting $p=2$ yields the best fit. But again, in practice, this needs to be properly evaluated, as we will illustrate using a real-world dataset below.

---

## Linear Regression using Vessel Details Dataset

We now want to apply Logistic Regression on our Vessels Details Datasets. More specifically, our goal is to predict the `Efficiency` of a vessel based on the other features of a vessel.

### Prepared Training & Test Data

#### Load Dataset from File

As usual, we use `pandas` to load the `csv` file with the details about all vessels.

In [None]:
df = pd.read_csv('data/datasets/vessels/vessel-details.csv')

# Sort dataset (often a good practice)
df = df.sample(frac=1, random_state=0).reset_index(drop=True)

# Show the first 5 columns
df.head()

#### Data Selection

To skip any more sophisticated data preprocessing steps, we consider only the convenient features -- that is, we consider only all numerical features for our model. This particularly means that we do not have to consider any encoding strategies for categorical features. To keep it even simpler, we also remove all rows containing any missing value.

In [None]:
# Keep only the numerical attributes to keep it simple here
df = df[['Build Year', 'Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage', 'Efficiency']]

# Remove all rows with any NaN values; again, just to keep it simple
df = df.dropna()

df.head()

#### Data Cleaning

Let's first look at some statistics of our dataset:

In [None]:
df.describe()

When looking at the `Efficiency` values, we can see that we have some arguably "wrong" values. For example, no ship can have an `Efficiency` value of 0. Also, we assume that `Efficiency` has a range of 0..100%, values larger than 100 also seem incorrect. In the following, we perform a very simple step of outlier removal, by deleting all rows with an `Efficiency` value below the 10% quantile or above the 90% quantile. In other words, we only keep 80% of our dataset.

**Important:** In practice, more thoughts should go into the preprocessing!

In [None]:
# Compute 10% and 90% quantiles for Efficiency
q10 = df['Efficiency'].quantile(0.1)
q90 = df['Efficiency'].quantile(0.9)

df = df.drop(df[(df['Efficiency'] < q10) | (df['Efficiency'] > q90)].index)

#### Generate Training & Test Data

Ignoring the `Efficiency`, there are 5 features: `Build Year`, `Length`, `Width`, `Gross Tonnage`, and `Deadweight Tonnage`. In the following, we aim to predict a man's weight given his age and height. Feel free to change this to any other combination, e.g., predicting a man's age given his height and weight. Lastly, we use an 80/20 split to create the training and test data.

In [None]:
# Convert data to numpy arrays
X = df[['Build Year', 'Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage']].to_numpy()
y = df[['Efficiency']].to_numpy().squeeze()

# Split dataset in to training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))

#### Normalize Data via Standardization

Since we want to consider different polynomial degrees, it is strongly recommended – and almost required – to normalize/standardize the data. As the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) implementation also applies regularization by default, we do normalize the data via standardization.

In [None]:
# We fit the scaler based on the training data only
scaler = StandardScaler().fit(X_train)

# Of course, we need to convert both training and test data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

The data is now ready to perform Linear Regression. In the following, to keep things simple, we do not use K-Fold Cross-Validation or similar methods but directly compare different hyperparameters using the test set.

### Linear Regression

We first apply basic [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) which does not perform any regularization. However, we want to transform our data to include polynomial and interaction terms up to a degree of $p$ to see which degree yields the best results.

Recall from the lecture that the number of terms given a polynomial degree of $p$ a number of input features $d$ is

$$
\#terms = \binom{p+d}{p}
$$

Since our dataset has 5 input features, this equation simplifies to

$$
\#terms = \binom{p+5}{5}
$$

Below we consider $p$ as our hyperparameter, i.e., we transform the dataset using different polynomial degrees, apply Logistic Regression, and check the Means Squares Error (MSE) for each setup.

In [None]:
for p in range(1, 7):
    
    # Transform data w.r.t to degree of polynomial p
    poly = PolynomialFeatures(p)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)
    
    # Train Linear Regressor or transformed data
    # fit_intercept=False since for p=1, transformation adds constant term to data
    poly_reg = LinearRegression(fit_intercept=False).fit(X_train_poly, y_train)

    # Predict values for training and test set
    y_train_pred = poly_reg.predict(X_train_poly)
    y_test_pred = poly_reg.predict(X_test_poly)
    
    # Calculate MSE 
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    
    
    print('Degree of polynomial: {} => MSE (train/test): {:.2f}/{:.2f} (#terms: {})'.format(p, mse_train, mse_test, X_train_poly.shape[1]))

According to these results, we achieve the lowest MSE for the test data with $p=3$. We can also see that the MSE explodes once $p$ gets too large due to overfitting. While the training error goes down since higher polynomials allow for a better fit of the training data, the fit fails to generalize and thus yields poor performance on the test data.

You can increase the value of $p$ for the loop even further to see how bad it quickly gets.

### Ridge Regression

We introduced the concept of regularization to address the problem of overfitting. In a nutshell, regularization extends the loss function adding a term that punishes large values for $\theta$ to smoothen the regression line, i.e., to generalize better.

In the lecture, we use the squared L2 norm of $\theta$ for regularization. Linear Regression using the squared L2 norm is also called Ridge Regression. (There are other ways to define the regularization but that's not so important here.)

$$
L = \frac{1}{n}\lVert X\theta - y\rVert^2 + \lambda\frac{1}{n}\lVert \theta \rVert^2_2
$$

There are two common approaches to implement regularization. In one case, $theta_0$ (i.e., the intercept/bias) is not considered to be subject to regularization. There are theoretical arguments for that that are again beyond the scope of this notebook. In this case, the Normal Equation for Ridge Regression is

$$
\theta = (X^TX + \lambda \begin{bmatrix}
0 & &  &   \\
 & 1 &  &  \\
 &  & \ddots & \\
 &  &  & 1 \\
\end{bmatrix} )^{-1} X^Ty
$$

In the other approach, all $\theta_i$ are considered for regularization. This simplifies the Normal Equation to

$$
\theta = (X^TX + \lambda I )^{-1} X^Ty
$$

where $I$ is the identity matrix with 1s in diagonal and 0s everywhere else.

We can now perform the same evaluation as above, but using [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) Regression. Again, we test different values for the polynomial degree $p$. The regularization parameter $\lambda$ in the formula above is represented by parameter `alpha`. `alpha=1.0` is the default value, but feel free to change it and compare the results.

In [None]:
for p in range(1, 7):
    
    # Transform data w.r.t to degree of polynomial p
    poly = PolynomialFeatures(p)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)
    
    # Train Linear Regressor or transformed data
    # fit_intercept=False since for p=1, transformation adds constant term to data
    poly_reg = Ridge(alpha=1.0, fit_intercept=False).fit(X_train_poly, y_train)

    # Predict values for training and test set
    y_train_pred = poly_reg.predict(X_train_poly)
    y_test_pred = poly_reg.predict(X_test_poly)
    
    # Calculate MSE 
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    
    #print('Degree of polynomial: {} => MSE: {:.2f} (#terms: {})'.format(p, mse, X_train_poly.shape[1]))
    print('Degree of polynomial: {} => MSE (train/test): {:.2f}/{:.2f} (#terms: {})'.format(p, mse_train, mse_test, X_train_poly.shape[1]))

While the lowest MSE for the test data is still for $p=3$, we see that the result for larger values of `p` start to differ significantly:

* The training loss does not decrease as much since regularization adds some loss to the basic training loss without regularization.

* The test loss does not explode with regularization than without.

---

## Summary

Linear regression is a widely used machine learning algorithm for predicting a continuous target variable based on one or more independent variables. It assumes a linear relationship between the independent variables and the target variable. The goal of linear regression is to find the best-fitting line or hyperplane that minimizes the difference between the predicted values and the actual observed values.

One of the main advantages of linear regression is its simplicity and interpretability. The algorithm provides coefficients for each independent variable, allowing us to understand the magnitude and direction of their impact on the target variable. This makes it easier to explain and interpret the results of the model. Additionally, linear regression performs well when the underlying relationship between the variables is approximately linear, and it is computationally efficient, making it suitable for large datasets.

However, linear regression has some limitations. It assumes a linear relationship between the variables, which may not hold in all cases. If the relationship is nonlinear, linear regression may not capture it accurately. Additionally, linear regression is sensitive to outliers, as they can heavily influence the line of best fit. Another limitation is that linear regression assumes that the independent variables are not strongly correlated with each other (i.e., no multicollinearity), as this can lead to unstable and unreliable coefficient estimates.

In summary, linear regression is a simple and interpretable algorithm for predicting continuous target variables. Its pros include simplicity, interpretability, and efficiency. However, it has limitations in capturing nonlinear relationships, is sensitive to outliers, and assumes no multicollinearity. Therefore, it is important to assess the assumptions and limitations of linear regression before using it in a particular context.