- Linear regression is a technique used to predict a continuous output variable based on one or more input features. It assumes a linear relationship between the input variables (also known as features) and the output (target).

# Linear Regression: A Comprehensive Guide

Linear Regression is one of the most fundamental algorithms in machine learning, used for predicting continuous values based on input features. It assumes a linear relationship between the dependent variable (target) and one or more independent variables (features).

## 1. Simple Linear Regression (Single Feature)

In **simple linear regression**, we predict a continuous output variable \( y \) based on a single feature \( x \).

### Equation for Simple Linear Regression:
$$
y = mx + b
$$
Where:
- \( y \) is the predicted value (output).
- \( x \) is the input feature (independent variable).
- \( m \) is the **slope** (coefficient), representing the change in \( y \) for a one-unit change in \( x \).
- \( b \) is the **intercept**, the value of \( y \) when \( x = 0 \).

### Goal:
The goal is to find the values of \( m \) (slope) and \( b \) (intercept) that minimize the error between the predicted and actual values of \( y \).

#### Steps to Fit the Line:
1. **Collect Data**: Dataset consisting of pairs \( (x, y) \).
2. **Calculate the coefficients \( m \) and \( b \)**:
   - **Slope (m)**:
   $$
   m = \frac{n \sum{xy} - \sum{x} \sum{y}}{n \sum{x^2} - (\sum{x})^2}
   $$
   - **Intercept (b)**:
   $$
   b = \frac{\sum{y} - m \sum{x}}{n}
   $$
   Where \( n \) is the number of data points.
3. **Make Predictions**: Use the calculated coefficients to predict \( y \) for any given \( x \).

---

## 2. Multiple Linear Regression (Multiple Features)

**Multiple linear regression** extends simple linear regression to handle multiple input features.

### Equation for Multiple Linear Regression:
$$
y = b + m_1x_1 + m_2x_2 + \dots + m_nx_n
$$
Where:
- \( y \) is the predicted value.
- \( $x_1$, $x_2$,.., $x_n$ \) are the independent features.
- \( $m_1$, $m_2$,.., $m_n$ \) are the coefficients (slopes) for each feature.
- \( b \) is the intercept.

### Goal:
The goal is to find the coefficients \( $m_1$,$m_2$, \dots, $m_n$ \) and the intercept \( b \) that minimize the error between predicted and actual \( y \) values.

#### Steps to Fit the Model:
1. **Matrix Formulation**:
   $$
   Y = X \beta + \epsilon
   $$
   Where:
   - \( Y \) is the vector of observed values.
   - \( X \) is the matrix of input features.
   - \( \beta \) is the vector of coefficients.
   - \( \epsilon \) is the error term.

2. **Normal Equation**:
   The optimal coefficients \( \beta \) can be calculated using the normal equation:
   $$
   \beta = (X^T X)^{-1} X^T Y
   $$
   Where \( X^T \) is the transpose of the feature matrix, and \( (X^T X)^{-1} \) is its inverse.

3. **Prediction**: 
   After calculating \( \beta \), we make predictions using:
   $$
   \hat{Y} = X \beta
   $$

---

## 3. Cost Function (Mean Squared Error)

The **cost function** measures how well the model fits the data. The most common cost function is **Mean Squared Error (MSE)**, which computes the average squared difference between actual and predicted values.

### Formula for MSE:
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2
$$
Where:
- \( n \) is the number of data points.
- \( y_i \) is the actual value.
- \( \hat{y_i} \) is the predicted value.

### Goal:
Minimize the MSE to improve the model’s performance.

---

## 4. Model Evaluation

After fitting the linear regression model, we use several metrics to evaluate its performance:

### Key Evaluation Metrics:
- **R-squared (R²)**: Measures the proportion of variance in the dependent variable that is predictable from the independent variables. \( R^2 \) ranges from 0 to 1, with 1 indicating a perfect fit.
  
  **Formula for \( R^2 \)**:
  $$
  R^2 = 1 - \frac{\sum (y_i - \hat{y_i})^2}{\sum (y_i - \bar{y})^2}
  $$
  Where \( \bar{y} \) is the mean of the actual values.

- **Root Mean Squared Error (RMSE)**: Measures the square root of the average squared errors, representing the model's average prediction error in the same units as the target variable.
  
  **Formula for RMSE**:
  $$
  RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2}
  $$

### Goal:
- High **R²** values and low **MSE** or **RMSE** indicate a good model fit.

---

## 5. Assumptions of Linear Regression

Linear regression makes several assumptions about the data:
1. **Linearity**: There is a linear relationship between the dependent and independent variables.
2. **Independence**: Observations are independent of each other.
3. **Homoscedasticity**: The variance of errors is constant across all levels of the independent variable(s).
4. **Normality**: The residuals (errors) are normally distributed.

Violations of these assumptions may lead to biased predictions or inefficiencies in the model.

---

## 6. Example Code: Simple Linear Regression in Python (Using Scikit-learn)

Here's an example of how to implement **Simple Linear Regression** using **Scikit-learn** in Python:

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # Random input feature between 0 and 10
y = 2.5 * X + np.random.randn(100, 1) * 2  # Linear relation with some noise

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Plot the results
plt.scatter(X_test, y_test, color='blue', label='Actual data')
plt.plot(X_test, y_pred, color='red', label='Fitted line')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
