# Linear Regression

Linear regression is a statistical method used to model the relationship between a **dependent variable** (the outcome you're trying to predict) and one or more **independent variables** (the predictors or factors that influence the outcome). The goal of linear regression is to find a mathematical function that maps the independent variables to the dependent variable as accurately as possible.

### Variables in Linear Regression

- The dependent variable is often denoted as $y$, which represents the output or response.
- The independent variables are represented as $x_1, x_2, \dots, x_r$, where $r$ is the number of independent variables or predictors.

If we have multiple independent variables, we can group them into a vector $\mathbf{x} = (x_1, x_2, \dots, x_r)$. 

### The Linear Regression Model

Linear regression assumes a **linear relationship** between the dependent variable $y$ and the independent variables $\mathbf{x}$. This relationship can be expressed by the following equation, known as the **regression equation**:

$$
y = \beta_0 + \beta_1 x_1 + \cdots + \beta_r x_r + \varepsilon
$$

- $\beta_0$ is the **intercept**, representing the value of $y$ when all $x$ values are zero.
- $\beta_1, \beta_2, \dots, \beta_r$ are the **regression coefficients**, which indicate how much $y$ changes with a one-unit change in each corresponding $x_i$.
- $\varepsilon$ is the **error term** (or residual), accounting for any variability in $y$ that cannot be explained by the linear relationship with $x_1, \dots, x_r$.

### Estimating the Regression Function

To estimate the coefficients $\beta_0, \beta_1, \dots, \beta_r$ from the data, we calculate their **estimators**, which are typically denoted as $b_0, b_1, \dots, b_r$. Using these estimators, we can write the **estimated regression function** as:

$$
f(\mathbf{x}) = b_0 + b_1 x_1 + \cdots + b_r x_r
$$

This function predicts the value of $y$ based on the input values of $x_1, \dots, x_r$. The goal of linear regression is to find the values of $b_0, b_1, \dots, b_r$ that result in the best possible predictions.

### Residuals and Error Minimization

For each observation $i$, we can compute the **predicted response** $f(\mathbf{x}_i)$, and compare it to the actual response $y_i$. The difference between them is called the **residual**:

$$
\text{Residual} = y_i - f(\mathbf{x}_i)
$$

The smaller the residuals, the better the model fits the data. Therefore, linear regression aims to find the coefficients $b_0, b_1, \dots, b_r$ that minimize the overall error, often measured as the **sum of squared residuals (SSR)**:

$$
SSR = \sum_{i=1}^{n} \left( y_i - f(\mathbf{x}_i) \right)^2
$$

This method of minimizing the SSR to determine the best-fitting line is known as the **ordinary least squares (OLS)** method.

## Multiple Linear Regression

**Multiple linear regression** is an extension of simple linear regression when you have more than one independent variable. The relationship is still linear, but now with multiple predictors influencing the dependent variable.

For example, if there are two independent variables $x_1$ and $x_2$, the estimated regression function is:

$$
f(x_1, x_2) = b_0 + b_1 x_1 + b_2 x_2
$$

This equation represents a **regression plane** in three-dimensional space. The goal is to find the values of $b_0, b_1$, and $b_2$ that make the plane fit the data points as closely as possible.

In general, for $r$ independent variables, the regression equation becomes:

$$
f(x_1, \dots, x_r) = b_0 + b_1 x_1 + \cdots + b_r x_r
$$

There are $r + 1$ unknown coefficients (including the intercept $b_0$), which need to be determined by minimizing the SSR.

## Polynomial Regression

**Polynomial regression** is a form of linear regression where the relationship between the dependent and independent variables is modeled as a polynomial. It extends linear regression by allowing for non-linear relationships between the variables, while still keeping the model linear in the coefficients.

For example, in polynomial regression, the regression function may include terms like:

$$
f(x_1) = b_0 + b_1 x_1 + b_2 x_1^2 + b_3 x_1^3
$$

Or even interaction terms such as:

$$
f(x_1, x_2) = b_0 + b_1 x_1 + b_2 x_1^2 + b_3 x_1 x_2
$$

In this case, the model can capture more complex relationships by incorporating higher-degree terms or interactions between variables. While the model includes non-linear terms (such as $x_1^2$ or $x_1 x_2$), it is still considered a form of **linear regression** because the coefficients ($b_0, b_1, b_2, \dots$) appear linearly.


### Underfitting and Overfitting

When implementing polynomial regression, a critical question arises: how do you choose the optimal degree for the polynomial regression function?

Unfortunately, there is no universal rule for selecting the best degree—it largely depends on the specific problem and dataset. However, it's important to understand two common issues that can arise from selecting the wrong degree: **underfitting** and **overfitting**.

### Underfitting

**Underfitting** occurs when the model is too simple to capture the underlying patterns in the data. As a result, it fails to accurately represent the relationships between the variables. This often leads to a low $R^2$ score, indicating that the model explains very little of the variance in the data. Additionally, an underfit model will perform poorly when applied to both the training data and new, unseen data.

Underfitting typically happens when the polynomial degree is too low, and the model lacks the complexity needed to capture the true behavior of the data.

### Overfitting

**Overfitting**, on the other hand, happens when the model becomes too complex and starts to learn not only the underlying data relationships but also random noise and fluctuations in the training data. This can result in a model that fits the training data almost perfectly, often yielding a very high $R^2$ score on the known data.

However, the problem with overfitting is that the model doesn't generalize well to new, unseen data. While it performs excellently on the training data, its performance drops significantly when applied to new datasets, often resulting in a much lower $R^2$ score. Overfitting is common when the polynomial degree is too high, or when the model includes too many features or terms.

### Finding the Right Balance

To avoid both underfitting and overfitting, it's essential to find the right balance between model simplicity and complexity. Techniques like **cross-validation** and **regularization** can help in determining the optimal degree of the polynomial, ensuring that the model generalizes well to new data while accurately capturing the underlying relationships in the training data.



## Regression with Scikit-Learn

Let’s begin with the simplest scenario: **simple linear regression**. There are five key steps involved in implementing a linear regression model:

1. **Import necessary packages and libraries**: Start by loading the tools and libraries you’ll need for the task.
2. **Prepare and process your data**: Input the dataset you’ll be working with and apply any necessary transformations or preprocessing steps.
3. **Build and train the regression model**: Create your linear regression model and train it using the available data.
4. **Evaluate the model’s performance**: Check the model’s results to assess if it fits the data well and meets your expectations.
5. **Make predictions with the model**: Once satisfied with the model, use it to make predictions on new data.

These steps generally apply to most regression models, regardless of the specific approach or technique. As you proceed through this guide, you’ll see how to follow these steps for various types of regression scenarios.


In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])

model = LinearRegression()

model.fit(x, y)

r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")

print(f"intercept: {model.intercept_}")


print(f"slope: {model.coef_}")

coefficient of determination: 0.7158756137479542
intercept: 5.633333333333329
slope: [0.54]


### STEP 5. Prediction

In [2]:
y_pred = model.predict(x)
y_pred

array([ 8.33333333, 13.73333333, 19.13333333, 24.53333333, 29.93333333,
       35.33333333])

### Multiple Linear Regression With scikit-learn
The main difference is that your x array will now have two or more columns.

In [3]:
import numpy as np
from sklearn.linear_model import LinearRegression

x = [
  [0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]
]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x, y = np.array(x), np.array(y)

model = LinearRegression().fit(x, y)

r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")


print(f"intercept: {model.intercept_}")


print(f"coefficients: {model.coef_}")

coefficient of determination: 0.8615939258756776
intercept: 5.52257927519819
coefficients: [0.44706965 0.25502548]
