<img src="../images/cover.jpg" alt="" width="1920"/>

# Linear Regression

Linear Regression is a supervised learning algorithm that models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the observed data.

<img src="../images/linear_regression_graph.png" width="1000"/>

## Linear Regression implementation using Gradient Descent 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

In [None]:
dataset = pd.read_csv("data/salary_data.csv")
dataset.head()

In [None]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

In [None]:
X.shape

In [None]:
plt.scatter(X, y)

`learning_rate = 0.01`
This is the **learning rate $ \alpha $**, which determines the size of the steps taken in the gradient descent algorithm. A learning rate of (0.01) means that each gradient step reduces the parameters slightly, ensuring controlled updates.

`n_features = 1`
This specifies the number of **features** (input variables) in the dataset.

In [None]:
learning_rate = 0.01
n_features = 1

`weights = np.zeros(n_features)`
This initializes the **weights** (coefficients) for the features to zeros. Since `n_features = 1`, this will create a 1D array with a single value of 0:

`bias = 0`
This initializes the **bias term** $ b $ to 0. The bias represents the intercept of the regression line when all feature values are zero.

In [None]:
weights = np.zeros(n_features)
bias = 0

weights, bias

`np.dot(X, weights)`
  - Computes the **dot product** between the feature matrix $ X $ and the weights $ w $.
  - If $ X $ has dimensions $ (m, n) $ (where $ m $ is the number of samples and $ n $ is the number of features), and $ weights $ has dimensions $ (n,) $, the result will be a vector of predicted values for each sample of size $ (m,) $.

`+ bias`
  - Adds the **bias** term $ b $ to each predicted value. Since $ bias $ is a scalar, it is broadcasted across all predictions.

This corresponds to the linear regression equation:
$$
\hat{y}_i = \sum_{j=1}^n w_j x_{ij} + b
$$
for each sample $ i $ in the dataset, where:
- $ \hat{y}_i $: Predicted value for the $ i $-th sample
- $ x_{ij} $: Value of the $ j $-th feature for the $ i $-th sample
- $ w_j $: Weight for the $ j $-th feature
- $ b $: Bias term

In vectorized form for all $ m $ samples:
$$
\hat{y} = X \cdot w + b
$$
where:
- $ X $: Feature matrix of size $ (m, n) $
- $ w $: Weight vector of size $ (n,) $
- $ b $: Scalar bias (added element-wise to the result of $ X \cdot w $)

This line computes the predicted outputs $ \hat{y} $, which will later be compared with the actual values $ y $ to calculate the error and gradients during training.

In [None]:
y_pred = np.dot(X, weights) + bias
y_pred

In [None]:
# Plot results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", label="Actual data")
plt.plot(X, y_pred, color="red", label="Predictions")

`m = len(y)`
- $ m $: Number of samples in the dataset.
- This is used to normalize the gradient by dividing it by the total number of samples, ensuring that the updates to the weights and bias are appropriately scaled.

In [None]:
m = len(y)
m

`dw = -(2 / m) * np.dot(X.T, (y - y_pred))`
- Computes the gradient of the cost function with respect to the weights ($ w $).
  
##### **Mathematical Derivation**
1. **Cost Function** (Mean Squared Error):
   $$
   J(w, b) = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i)^2
   $$
   where $ \hat{y}_i = \sum_{j=1}^n w_j x_{ij} + b $.

2. **Gradient of $J(w, b)$ with respect to $ w_j $**:
   $$
   \frac{\partial J}{\partial w_j} = \frac{2}{m} \sum_{i=1}^m (\hat{y}_i - y_i) x_{ij}
   $$

3. **Vectorized Form**:
   The gradient for all weights ($ w $) can be written as:
   $$
   dw = \frac{2}{m} X^T (\hat{y} - y)
   $$
   - $ X^T $: Transpose of the feature matrix $ X $, size $ (n, m) $.
   - $ (\hat{y} - y) $: Error vector, size $ (m, 1) $.
   - $ dw $: Gradient vector for weights, size $ (n, 1) $.

4. **Negative Sign**:
   Gradient descent minimizes the cost function, so we take the negative of the gradient:
   $$
   dw = -\frac{2}{m} X^T (y - \hat{y})
   $$

**In Code**:
- `X.T`: Transpose of the feature matrix $ X $.
- `y - y_pred`: Difference (error) between actual values and predictions.

$ dw $: Measures how much the cost function changes with respect to each weight. It considers the error ($ y - y_{pred} $) scaled by the corresponding feature values ($ X $).

In [None]:
dw = -(2 / m) * np.dot(X.T, (y - y_pred))
dw

`db = -(2 / m) * np.sum(y - y_pred)`
- Computes the gradient of the cost function with respect to the bias ($ b $).

**Mathematical Derivation**
1. **Gradient of $ J(w, b) $ with respect to $ b $**:
   $$
   \frac{\partial J}{\partial b} = \frac{2}{m} \sum_{i=1}^m (\hat{y}_i - y_i)
   $$
   In vectorized form:
   $$
   db = -\frac{2}{m} \sum_{i=1}^m (y - \hat{y})
   $$

##### **In Code**:
- `np.sum(y - y_pred)`: Computes the sum of errors across all samples.


$ db $: Measures how much the cost function changes with respect to the bias. It considers the total error ($ y - y_{pred} $) across all samples.

In [None]:
db = -(2 / m) * np.sum(y - y_pred)
db

`weights = weights - learning_rate * dw`
- Updates the weights ($ w $) by subtracting the product of the learning rate ($ \alpha $) and the gradient ($ dw $).
  
**Mathematical Context**:
1. **Gradient Descent Formula for Weights**:
   $$
   w_j := w_j - \alpha \frac{\partial J}{\partial w_j}
   $$
   - $ w_j $: Current weight for feature $ j $.
   - $ \frac{\partial J}{\partial w_j} $: Gradient of the cost function with respect to $ w_j $ ($ dw $).
   - $ \alpha $: Learning rate, which determines the step size.

2. **Vectorized Form**:
   For all weights:
   $$
   w := w - \alpha dw
   $$
   where $ dw $ is the gradient vector for all features.

3. **Interpretation**:
   - The weights are adjusted in the opposite direction of the gradient.
   - This ensures the cost function $ J(w, b) $ is minimized in successive iterations.


In [None]:
weights = weights - learning_rate * dw
weights

`bias = bias - learning_rate * db`
- Updates the bias ($ b $) similarly to the weights.

**Mathematical Context**:
1. **Gradient Descent Formula for Bias**:
   $$
   b := b - \alpha \frac{\partial J}{\partial b}
   $$
   - $ b $: Current bias.
   - $ \frac{\partial J}{\partial b} $: Gradient of the cost function with respect to $ b $ ($ db $).
   - $ \alpha $: Learning rate.

2. **Interpretation**:
   - The bias is adjusted to reduce the error caused by the predictions not aligning with the actual data.


In [None]:
bias = bias - learning_rate * db
bias

### **How It Works in Gradient Descent**
1. The gradients $ dw $ and $ db $ point in the direction of steepest ascent of the cost function.
2. Subtracting $ \alpha dw $ and $ \alpha db $ moves the parameters in the opposite direction, toward the minimum of the cost function.

### **Effect of the Learning Rate**
- $ \alpha $: Controls the size of each update.
  - If $ \alpha $ is too small, convergence is slow.
  - If $ \alpha $ is too large, the updates might overshoot, and the algorithm may fail to converge.

### **Iterative Updates**
These updates occur iteratively over multiple epochs (or iterations). After enough updates, the weights and bias converge to values that minimize the cost function, yielding a linear model that best fits the data.


In [None]:
dataset = pd.read_csv("data/salary_data.csv")

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

for x_, y_ in zip(X[:5], y[:5]):
    print(f"{x_} -> {y_}")

In [None]:
learning_rate = 0.01
n_features = 1

weights = np.zeros(n_features)
bias = 0

weights, bias

In [None]:
y_pred = np.dot(X, weights) + bias
y_pred

In [None]:
# Plot results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", label="Actual data")
plt.plot(X, y_pred, color="red", label="Predictions")
plt.scatter(X, y_pred, color="red", label="Predictions")

In [None]:
dw = -(2 / m) * np.dot(X.T, (y - y_pred))
db = -(2 / m) * np.sum(y - y_pred)

dw, db

In [None]:
weights = weights - learning_rate * dw
bias = bias - learning_rate * db

weights, bias

[Linear Regression Animation](https://youtu.be/1hGsKphwC-A?si=uJqmTN6K7LadFmKh)

# End-to-End LinearRegression from scratch

In [None]:
class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate  # Step size for gradient descent
        self.n_iterations = n_iterations  # Number of training iterations
        self.weights = None  # Model weights (initialized later)
        self.bias = None  # Model bias (initialized later)

    def _initialize_parameters(self, n_features):
        """Initialize weights and bias to zeros"""
        self.weights = np.zeros(n_features)  # Create array of zeros for weights
        self.bias = 0  # Initialize bias to zero

    def _compute_predictions(self, X):
        """Compute predictions using current weights and bias"""
        return np.dot(X, self.weights) + self.bias

    def _the_print(self, X, y, y_pred):
        print(f"X: {X}")
        print(f"y: {y}")
        print(f"y_pred: {y_pred}")

    def _compute_gradients(self, X, y_true, y_pred):
        """Compute gradients for weights and bias"""
        m = len(y_true)  # Number of samples
        dw = -(2 / m) * np.dot(X.T, (y_true - y_pred))  # Gradient for weights
        db = -(2 / m) * np.sum(y_true - y_pred)  # Gradient for bias
        return dw, db

    def fit(self, X, y):
        """Train the model using gradient descent"""
        n_features = X.shape[1]  # Get number of features
        self._initialize_parameters(n_features)  # Initialize parameters

        for _ in range(self.n_iterations):
            y_pred = self._compute_predictions(X)  # Forward pass by makeing predictions
            dw, db = self._compute_gradients(X, y, y_pred)  # Compute gradients

            # Update parameters using gradient descent
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        """Make predictions on new data"""
        return self._compute_predictions(X)

In [None]:
dataset = pd.read_csv("data/salary_data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

In [None]:
# Create and train the model
model = LinearRegression(learning_rate=0.01, n_iterations=1000)
model.fit(X, y)

In [None]:
# Make predictions
y_pred = model.predict(X)

In [None]:
# Plot results
# plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", label="Actual data")
plt.plot(X, y_pred, color="red", label="Predictions")
plt.title("Linear Regression Example")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
loss = np.mean((y - model.predict(X)) ** 2)
print(f"Loss: {loss} \nRoot Mean Squared Error: {np.sqrt(loss)}")

In [None]:
year_experience = 4.5
np.round(model.predict([[year_experience]]))