# 📘 Complete Linear Regression Notes

---

### 🔹 1. What is Linear Regression?

Linear Regression is a supervised learning algorithm used to predict a continuous output based on one or more input variables.  
It finds the best-fitting straight line (linear relationship) that minimizes the difference between actual and predicted values.

---

### 🔹 2. Linear Regression Equation:

For Simple Linear Regression:

\[
y = a + bX + \varepsilon
\]

Where:  
- **y** is the dependent variable (output),  
- **a** is the intercept,  
- **b** is the slope coefficient (weight),  
- **X** is the independent variable (input),  
- **ε (epsilon)** is the error term (noise).

---

### 🔹 3. Outcomes of Linear Regression:

- ✅ **Predictions (ŷ):**  
  The model gives predicted values based on the input features.

- ✅ **Coefficients (b):**  
  Represent how much the output changes per unit change in input.

- ✅ **Intercept (a):**  
  The value of output when all inputs are 0.

- ✅ **Residuals:**  
  The difference between actual and predicted values:  
  \[
  \text{Residual} = y - \hat{y}
  \]

---

### 🔹 4. Evaluation Metrics:

These metrics tell you how good your model is.

- ✅ **(a) MAE – Mean Absolute Error**  
$$
\text{MAE} = \frac{1}{n} \sum |y_i - \hat{y}_i|
$$

  Measures the average magnitude of errors.  
  Less sensitive to outliers.  
  Units are same as the target variable.

- ✅ **(b) MSE – Mean Squared Error**  
$$
MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2
$$
  Penalizes larger errors more (because of squaring).  
  Not interpretable in real units (units are squared).

- ✅ **(c) RMSE – Root Mean Squared Error**  
$$
RMSE = \sqrt{\frac{1}{n} \sum (y_i - \hat{y}_i)^2}
$$

  Same units as the target variable.  
  Useful when large errors need to be penalized.

- ✅ **(d) R² – Coefficient of Determination**  
$$
R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
$$

  Where:  
  - Numerator = Sum of Squared Errors (SSE)  
  - Denominator = Total Sum of Squares (SST)

📌 **Interpretation:**  
- \(R^2 = 1\) → Perfect model  
- \(R^2 = 0\) → Model explains nothing  
- \(R^2 < 0\) → Worse than predicting the mean

---

### 🔹 5. MAE vs MSE vs RMSE – When to Use What?

| Metric | Use When | Sensitive to Outliers? | Units |
|--------|-----------|-----------------------|-------|
| MAE    | Simpler interpretation needed | ❌ No | Same as target |
| MSE    | You want to penalize large errors | ✅ Yes | Squared units |
| RMSE   | Want to penalize big errors but keep original units | ✅ Yes | Same as target |
| R²     | Want to know % variance explained | ❌ No | No units (ratio) |

---




# **🧮 Linear Regression in Pure Python (No Libraries)**

This script implements Linear Regression using Gradient Descent, built using only core Python. It also calculates important metrics like MSE, MAE, and R² Score to evaluate the model.



### 📊 Data

In [18]:
x = [1, 2, 3, 4, 5]
y = [3, 5, 7, 9, 11]


**x**: Input features (e.g., hours studied)  
**y**: Target labels (e.g., scores)

The data follows a linear trend:

**y = 2x + 1**


### ⚙️ Initialization

In [19]:
w = 0.0  # weight (slope)
b = 0.0  # bias (intercept)


- Start with 0 values for both parameters  
- These will be learned through training


In [28]:
learning_rate = 0.01
max_iterations = 10000
tolerance = 1e-6
n = len(x)


- **learning_rate**: How big each parameter update step is  
- **max_iterations**: Maximum number of training loops  
- **tolerance**: Training stops when improvement is too small  
- **n**: Number of training samples


# 🧮 Error Metrics
## 📌 MSE (Mean Squared Error)

In [29]:
def mse(y_true, y_pred):
    return sum((y_true[i] - y_pred[i])**2 for i in range(n)) / n


- Measures average squared difference between actual and predicted values  
- Penalizes large errors more


### 📌 MAE (Mean Absolute Error)

In [30]:
def mae(y_true, y_pred):
    return sum(abs(y_true[i] - y_pred[i]) for i in range(n)) / n


- Measures average absolute difference

- Easier to interpret, less sensitive to outliers than MSE



### 📌 R² Score (Coefficient of Determination)

In [31]:
def r2_score(y_true, y_pred):
    y_mean = sum(y_true) / n
    ss_tot = sum((y_true[i] - y_mean)**2 for i in range(n))
    ss_res = sum((y_true[i] - y_pred[i])**2 for i in range(n))
    return 1 - (ss_res / ss_tot) if ss_tot != 0 else 0


- Shows how much variation in **y** is explained by the model  
- Closer to **1** means better prediction


### 🔁 Gradient Descent

In [32]:
prev_mse = float('inf')
iteration = 0


- We track `prev_mse` to check for convergence  
- Begin training with `iteration = 0`


### 🚀 Training Loop

In [35]:
while iteration < max_iterations:
    # Step 1: Predict using current w and b
    y_pred = [w * x[i] + b for i in range(n)]

    # Step 2: Calculate Gradients (dw and db)
    dw = 0
    db = 0
    for i in range(n):
        error = y[i] - y_pred[i]          # Difference between actual and predicted
        dw += -2 * x[i] * error           # Gradient of MSE w.r.t. w
        db += -2 * error                  # Gradient of MSE w.r.t. b
    dw /= n                               # Average gradient for weight
    db /= n                               # Average gradient for bias

    # Step 3: Update Parameters
    w = w - learning_rate * dw            # Move weight against the gradient
    b = b - learning_rate * db            # Move bias against the gradient

    # Step 4: Calculate Performance Metrics
    current_mse = mse(y, y_pred)          # Mean Squared Error
    current_mae = mae(y, y_pred)          # Mean Absolute Error
    current_r2 = r2_score(y, y_pred)      # R² Score

    # Step 5: Log progress every 500 iterations
    if iteration % 500 == 0:
        print(f"Iter {iteration}: w={w:.4f}, b={b:.4f}, MSE={current_mse:.6f}, MAE={current_mae:.6f}, R2={current_r2:.6f}")

    # Step 6: Check for Convergence
    if abs(prev_mse - current_mse) < tolerance:
        print(f"Converged at iteration {iteration}")
        break

    prev_mse = current_mse                # Store current MSE to compare in next loop
    iteration += 1                        # Move to next iteration


Iter 0: w=0.5000, b=0.1400, MSE=57.000000, MAE=7.000000, R2=-6.125000
Iter 500: w=2.0210, b=0.9241, MSE=0.001056, MAE=0.027897, R2=0.999868
Converged at iteration 792


## 🧠 Explanation (Line by Line)

| Section                                      | What It Does                                                                 |
|---------------------------------------------|------------------------------------------------------------------------------|
| `while iteration < max_iterations:`         | Loop runs until either max iterations reached or early stopping is triggered. |
| `y_pred = [w * x[i] + b for i in range(n)]`  | Predicts output using current `w` and `b`.                                  |
| `dw, db`                                     | Initialize gradients to 0 before accumulation.                              |
| `error = y[i] - y_pred[i]`                   | Measures how far prediction is from actual value.                           |
| `dw += -2 * x[i] * error`                    | Gradient of MSE with respect to weight `w`.                                 |
| `db += -2 * error`                           | Gradient of MSE with respect to bias `b`.                                   |
| `dw /= n, db /= n`                           | Average gradient across all samples.                                        |
| `w -= learning_rate * dw`                    | Update the weight by moving opposite the gradient.                          |
| `b -= learning_rate * db`                    | Update the bias similarly.                                                  |
| `current_mse = mse(...)`                     | Calculate Mean Squared Error for tracking.                                  |
| `current_mae = mae(...)`                     | Calculate Mean Absolute Error.                                              |
| `current_r2 = r2_score(...)`                 | R² measures how well the model explains the variance.                       |
| `if iteration % 500 == 0:`                   | Print progress every 500 iterations.                                        |
| `if abs(prev_mse - current_mse) < tolerance:`| Stop if error improvement is too small.                                     |
| `prev_mse = current_mse`                     | Prepare for next loop comparison.                                           |
| `iteration += 1`                             | Count the iteration.                                                        |

---

## 🏁 Summary

**Goal:** Minimize MSE using Gradient Descent

**Update rule:**

w = w − α ⋅ ∂Loss/∂w␣␣  
b = b − α ⋅ ∂Loss/∂b




- **Early Stopping** ensures we don’t keep training when improvement is negligible.
- **Metrics** (`MSE`, `MAE`, `R²`) help monitor the learning progress.


### ✅ Final Model Output (Markdown Format)

In [36]:
print("\nFinal Model:")
print(f"y = {w:.4f} * x + {b:.4f}")
print(f"Final MSE: {current_mse:.6f}")
print(f"Final MAE: {current_mae:.6f}")
print(f"Final R2 Score: {current_r2:.6f}")



Final Model:
y = 2.0078 * x + 0.9718
Final MSE: 0.000146
Final MAE: 0.010377
Final R2 Score: 0.999982


## 📄 Explanation:

| Line of Code                                    | Description                                                                                               |
|------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| `print("\nFinal Model:")`                        | Adds a line break and labels the final output section.                                                    |
| `print(f"y = {w:.4f} * x + {b:.4f}")`           | Prints the final learned linear regression equation with weight (`w`) and bias (`b`) rounded to 4 decimals. |
| `print(f"Final MSE: {current_mse:.6f}")`         | Displays the final Mean Squared Error — a measure of how far predictions are from actual values (lower is better). |
| `print(f"Final MAE: {current_mae:.6f}")`         | Displays the final Mean Absolute Error — shows the average magnitude of errors (no direction).             |
| `print(f"Final R2 Score: {current_r2:.6f}")`     | Shows the final R² Score — closer to 1 means a better fit.                                                |


# 📘 Complete Logistic Regression  Notes

---

### 🔹 1. What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for **binary classification** tasks, where the goal is to predict one of two possible discrete outcomes (e.g., yes/no, pass/fail).  
Instead of predicting continuous values, it estimates the **probability** that a given input belongs to a certain class.

---

### 🔹 2. Logistic Regression Equation:

The model calculates a weighted sum of the inputs and passes it through a **sigmoid function** to map the result into a probability between 0 and 1.

$$
z = wX + b
$$

$$
\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:

* **X** is the input feature vector,  
* **w** is the weight vector (coefficients),  
* **b** is the bias (intercept),  
* **z** is the linear combination of inputs and weights,  
* **σ (sigmoid)** squashes any real number to a value between 0 and 1, representing probability,  
* **$\hat{y}$** is the predicted probability of the positive class.

---

### 🔹 3. How Predictions Work:

* If $\hat{y} \geq 0.5$, predict class 1 (positive class).  
* If $\hat{y} < 0.5$, predict class 0 (negative class).

---

### 🔹 4. Loss Function – Binary Cross-Entropy (Log Loss):

The model is trained by minimizing the **binary cross-entropy loss**, which measures how well the predicted probabilities match the true labels:

$$
L = -\frac{1}{n} \sum_{i=1}^n \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]
$$

Where:

* $y_i$ is the actual label (0 or 1),  
* $\hat{y}_i$ is the predicted probability for sample $i$,  
* $n$ is the number of samples.

This loss penalizes confident but wrong predictions heavily.

---

### 🔹 5. Training via Gradient Descent:

* The model updates weights $w$ and bias $b$ by calculating gradients of the loss with respect to these parameters.  
* Gradients are averaged over all training samples.  
* Parameters are updated in the direction that reduces the loss:

$$
w := w - \alpha \frac{\partial L}{\partial w}
$$

$$
b := b - \alpha \frac{\partial L}{\partial b}
$$

Where $\alpha$ is the learning rate.

---

### 🔹 6. Outcomes of Logistic Regression:

* ✅ **Predicted probabilities ($\hat{y}$)** indicating the likelihood of belonging to class 1.  
* ✅ **Class labels (0 or 1)** after applying a threshold (commonly 0.5).  
* ✅ **Weights (coefficients)** showing the influence of each feature on the prediction.  
* ✅ **Bias (intercept)** shifts the decision boundary.

---

### 🔹 7. Evaluation Metrics for Classification:

Common metrics to evaluate logistic regression:

* ✅ **Accuracy:** Percentage of correct predictions.  
* ✅ **Precision:** Proportion of positive identifications that were actually correct.  
* ✅ **Recall (Sensitivity):** Proportion of actual positives identified correctly.  
* ✅ **F1 Score:** Harmonic mean of precision and recall.  
* ✅ **ROC-AUC:** Area under the Receiver Operating Characteristic curve, showing tradeoff between true positive rate and false positive rate.

---

### 🔹 8. Advantages of Logistic Regression:

* Simple and interpretable model.  
* Outputs probabilities, useful for uncertainty estimation.  
* Efficient and fast to train.  
* Works well for linearly separable classes.

---

### 🔹 9. Limitations:

* Assumes linear decision boundary in feature space.  
* Can underperform on complex, non-linear problems.  
* Sensitive to irrelevant features and multicollinearity.  
* Not suitable for multi-class problems without extension (like one-vs-rest).

---

### 🔹 10. When to Use Logistic Regression?

* Binary classification problems with interpretable output.  
* When probabilities are needed, not just hard labels.  
* When dataset is relatively small or features are few.

---

### 🔹 11. Summary Table – Linear vs Logistic Regression

| Aspect          | Linear Regression               | Logistic Regression               |
| --------------- | ------------------------------ | -------------------------------- |
| Problem Type    | Regression (continuous output) | Classification (binary output)   |
| Output          | Continuous values              | Probability (0 to 1)             |
| Model Equation  | $y = a + bX + \epsilon$        | $\hat{y} = \sigma(wX + b)$       |
| Loss Function   | Mean Squared Error (MSE)       | Binary Cross-Entropy Loss        |
| Prediction Rule | Direct output                  | Threshold at 0.5 for class label |
| Use Cases       | Predicting prices, temperature | Spam detection, medical diagnosis|





# Logistic Regression from Scratch - Code Explanation

---

### 1. Importing Required Library

```python
import math


We use the **math** library for mathematical functions like exponentiation (`exp`) and logarithms (`log`).


## 2. Sample Training Data

In [37]:
x_data = [1, 2, 3, 4, 5]
y_data = [0, 0, 0, 1, 1]


- x_data represents the feature values — in this case, hours studied.

- y_data is the label or target — 1 means pass, 0 means fail.

- This is a binary classification problem.



## 3. Initializing Parameters

In [38]:
w = 0.0
b = 0.0


- **w** is the weight (coefficient) for the feature.  
- **b** is the bias (intercept).  
- Both start from zero.


## 4. Sigmoid Function

In [39]:
def sigmoid(z):
    return 1 / (1 + math.exp(-z))


- The sigmoid function squashes any real number into a value between 0 and
- Used to interpret the output as a probability of belonging to class 1.

## 5. Training Parameters

In [40]:
learning_rate = 0.1
epochs = 1000
n = len(x_data)


- **learning_rate** controls the step size in the gradient descent update.  
- **epochs** is how many times the training loop runs.  
- **n** is the number of training samples.


## 6. Training Loop

In [41]:
for epoch in range(epochs):
    dw = 0
    db = 0
    loss = 0

    for i in range(n):
        x = x_data[i]
        y = y_data[i]

        z = w * x + b
        y_hat = sigmoid(z)

        error = y_hat - y

        dw += error * x
        db += error

        loss += - (y * math.log(y_hat + 1e-8) + (1 - y) * math.log(1 - y_hat + 1e-8))

    dw /= n
    db /= n
    loss /= n

    w -= learning_rate * dw
    b -= learning_rate * db

    if epoch % 100 == 0:
        print(f"Epoch {epoch} | Loss: {loss:.4f} | w: {w:.4f} | b: {b:.4f}")


Epoch 0 | Loss: 0.6931 | w: 0.0300 | b: -0.0100
Epoch 100 | Loss: 0.4680 | w: 0.4825 | b: -1.3860
Epoch 200 | Loss: 0.3649 | w: 0.7567 | b: -2.3566
Epoch 300 | Loss: 0.3062 | w: 0.9639 | b: -3.0914
Epoch 400 | Loss: 0.2680 | w: 1.1307 | b: -3.6840
Epoch 500 | Loss: 0.2410 | w: 1.2711 | b: -4.1833
Epoch 600 | Loss: 0.2207 | w: 1.3929 | b: -4.6168
Epoch 700 | Loss: 0.2047 | w: 1.5009 | b: -5.0016
Epoch 800 | Loss: 0.1916 | w: 1.5984 | b: -5.3489
Epoch 900 | Loss: 0.1807 | w: 1.6875 | b: -5.6662


For each epoch:

1. Initialize gradients `dw` (weight gradient) and `db` (bias gradient), and loss accumulator.

2. Loop over each training sample:

  - Calculate the linear combination:  
  $$
  z = w \times x + b
  $$

- Pass \( z \) through sigmoid to get predicted probability \( \hat{y} \).

- Compute error:  
  $$
  \text{error} = \hat{y} - y
  $$

- Accumulate gradients:  
  $$
  dw += \text{error} \times x \quad \text{(gradient of loss w.r.t weight)}
  $$  
  $$
  db += \text{error} \quad \text{(gradient of loss w.r.t bias)}
  $$

- Calculate binary cross-entropy loss and add to total loss:  
  $$
  \text{loss} = - \left( y \log \hat{y} + (1 - y) \log (1 - \hat{y}) \right)
  $$  
  *(Add a small number \(1e^{-8}\) to prevent \(\log(0)\)).*

3. Average gradients and loss by dividing by \( n \).

4. Update weights and bias using gradient descent:  
  $$
  w = w - \text{learning_rate} \times dw
  $$  
  $$
  b = b - \text{learning_rate} \times db
  $$

5. Print the loss and parameters every 100 epochs to monitor training.


## 7. Prediction Function

In [42]:
def predict(x):
    y_prob = sigmoid(w * x + b)
    return 1 if y_prob >= 0.5 else 0


Given an input `x`, calculate the predicted probability using current `w` and `b`.

If the probability is greater than or equal to 0.5, classify as **1** (pass), else **0** (fail).


## 8. Testing Predictions

In [43]:
print("\nTesting predictions:")
for x in [1, 2, 3, 4, 5, 6]:
    print(f"Input: {x} => Prediction: {predict(x)}")



Testing predictions:
Input: 1 => Prediction: 0
Input: 2 => Prediction: 0
Input: 3 => Prediction: 0
Input: 4 => Prediction: 1
Input: 5 => Prediction: 1
Input: 6 => Prediction: 1


Testing the trained model on inputs 1 through 6.

Prints predicted class for each input.

## Summary

- This code implements logistic regression without libraries.
- Uses gradient descent to optimize weights and bias.
- Uses sigmoid activation to output probabilities.
- Uses binary cross-entropy as loss function.
- Updates parameters iteratively to minimize loss.
- Finally, it predicts class labels based on thresholding predicted probabilities.
