# Linear Regression with Multiple Features & Practical Considerations (Part 2)

---

## Multiple Linear Regression

* **Purpose:** Predict $y$ (output) using **multiple features** ($x_1, x_2, \dots, x_n$) instead of just one.
* **Notation:**
    * $x_j$: The $j$-th feature.
    * $n$: Total number of features.
    * $x^{(i)}$: The $i$-th training example (a list/vector of features).
    * $x_j^{(i)}$: The value of the $j$-th feature for the $i$-th training example.
* **Model Equation:**
    $$ f_{w,b}(X) = w_1x_1 + w_2x_2 + \dots + w_nx_n + b $$
    * **Interpretation of Parameters:**
        * $b$: Base price (y-intercept when all features are zero).
        * $w_j$: Impact of the $j$-th feature on the predicted output (e.g., $w_1=0.1$ means $0.1 \times \$1000 = \$100$ increase per square foot).
* **Vectorized Notation:**
    * Collect parameters $w_1, \dots, w_n$ into a **vector** $W$.
    * Collect features $x_1, \dots, x_n$ into a **vector** $X$.
    * Model can be written as:
        $$ f_{W,b}(X) = W \cdot X + b $$
        * **Dot Product:** $W \cdot X = w_1x_1 + w_2x_2 + \dots + w_nx_n$.
* **Name:** This model is called **Multiple Linear Regression**.

---

## Vectorization

* **Purpose:** Makes code shorter, easier to read, and significantly faster.
* **Mechanism:** Leverages optimized numerical linear algebra libraries (like **NumPy**) and parallel processing hardware (CPU, GPU) by operating on entire arrays/vectors/matrices at once.
* **NumPy Indexing:** Starts from `0` (e.g., `w[0]`, `x[0]`).
* **Implementation Example:**
    * **Non-vectorized (for-loop):**
        ```python
        f = 0
        for j in range(n):
            f = f + w[j] * x[j]
        f = f + b
        ```
    * **Vectorized (NumPy dot product):**
        ```python
        f = np.dot(w, x) + b
        ```

---

In [1]:
# --- Demo 1: Vectorization Speed Comparison ---
import numpy as np
import time

print("--- Demo 1: Vectorization Speed Comparison ---")

# Example parameters and features for a small case (n=3)
w_small = np.array([0.1, 4.0, 10.0])
b_small = 80
x_small = np.array([1000, 3, 2])

# Non-vectorized prediction for small case
f_non_vec_small = 0
for j in range(len(w_small)):
    f_non_vec_small += w_small[j] * x_small[j]
f_non_vec_small += b_small
print(f"Non-vectorized (small n) prediction: {f_non_vec_small:.2f}")

# Vectorized prediction for small case
f_vec_small = np.dot(w_small, x_small) + b_small
print(f"Vectorized (small n) prediction: {f_vec_small:.2f}\n")


# Simulate a large number of features for speed test (e.g., n = 100,000)
n_features_large = 100000
w_large = np.random.rand(n_features_large) # Random weights
x_large = np.random.rand(n_features_large) # Random features
b_large = 50

# Time non-vectorized computation
start_time_non_vec = time.time()
f_large_non_vec = 0
for j in range(n_features_large):
    f_large_non_vec += w_large[j] * x_large[j]
f_large_non_vec += b_large
end_time_non_vec = time.time()
time_non_vec = end_time_non_vec - start_time_non_vec
print(f"Time for non-vectorized (n={n_features_large}): {time_non_vec:.6f} seconds")

# Time vectorized computation
start_time_vec = time.time()
f_large_vec = np.dot(w_large, x_large) + b_large
end_time_vec = time.time()
time_vec = end_time_vec - start_time_vec
print(f"Time for vectorized (n={n_features_large}): {time_vec:.6f} seconds")

# Compare speeds
if time_vec > 0: # Avoid division by zero if time_vec is extremely small
    print(f"\nVectorized code is approximately {time_non_vec / time_vec:.1f}x faster!")
else:
    print("\nVectorized code was extremely fast, couldn't accurately measure speedup.")

--- Demo 1: Vectorization Speed Comparison ---
Non-vectorized (small n) prediction: 212.00
Vectorized (small n) prediction: 212.00

Time for non-vectorized (n=100000): 0.017076 seconds
Time for vectorized (n=100000): 0.000029 seconds

Vectorized code is approximately 587.1x faster!


## Gradient Descent for Multiple Linear Regression

* **Objective:** Minimize the cost function $J(W, b)$ to find optimal parameters $W$ and $b$.
* **Update Rule for each $w_j$ ($j=1, \dots, n$):**
    $$ w_j := w_j - \alpha \frac{\partial}{\partial w_j} J(W, b) $$
    * Where the partial derivative for $w_j$ is:
        $$ \frac{\partial}{\partial w_j} J(W, b) = \frac{1}{m} \sum_{i=1}^{m} (f_{W,b}(x^{(i)}) - y^{(i)})x_j^{(i)} $$
* **Update Rule for $b$:**
    $$ b := b - \alpha \frac{\partial}{\partial b} J(W, b) $$
    * Where the partial derivative for $b$ is:
        $$ \frac{\partial}{\partial b} J(W, b) = \frac{1}{m} \sum_{i=1}^{m} (f_{W,b}(x^{(i)}) - y^{(i)}) $$
* **Key Point:** All parameters ($w_1, \dots, w_n, b$) must be **updated simultaneously** in each step.

### Normal Equation (Alternative for Linear Regression)

* **Method:** A non-iterative approach using linear algebra to directly solve for $W$ and $b$.
* **Applies To:** **Only linear regression models**.
* **Disadvantages:**
    * Does **not generalize** to other learning algorithms (e.g., logistic regression, neural networks).
    * Can be **slow** if the number of features ($n$) is very large.
* **Use Case:** Some mature machine learning libraries might use this internally for linear regression.

---

## Feature Scaling

* **Problem:** If features have vastly different ranges (e.g., house size: 300-2000 sq ft; bedrooms: 0-5), the cost function's contour plot will be **elongated/skinny**. This makes Gradient Descent **slow** and prone to "bouncing" back and forth.
* **Solution:** Transform features so they have **comparable ranges of values**.
* **Benefit:** Greatly **speeds up** Gradient Descent's convergence.
* **Rule of Thumb:** Aim for feature ranges roughly between **-1 and +1** (or similar small, symmetric ranges like -3 to +3, -0.3 to +0.3). Features with very large (e.g., -100 to +100) or very small (e.g., -0.001 to +0.001) ranges should likely be scaled. When in doubt, scale.

### Common Feature Scaling Methods

1.  **Scaling by Max:**
    $$ x_j^{\text{scaled}} = \frac{x_j}{\text{max}(x_j)} $$
2.  **Mean Normalization:** Centers features around zero.
    $$ x_j^{\text{normalized}} = \frac{x_j - \mu_j}{\text{max}(x_j) - \text{min}(x_j)} $$
    * $\mu_j$: Mean (average) of feature $j$.
3.  **Z-score Normalization:** Centers features around zero and scales by standard deviation.
    $$ x_j^{\text{normalized}} = \frac{x_j - \mu_j}{\sigma_j} $$
    * $\sigma_j$: Standard deviation of feature $j$.

---

In [2]:
# --- Demo 2: Feature Scaling Examples ---
import numpy as np

print("--- Demo 2: Feature Scaling Examples ---")

# Example data for house size (x1) and number of bedrooms (x2)
# Units: x1 in sq ft (e.g., 300 to 2000), x2 is count (e.g., 0 to 5)
x1_original = np.array([300, 800, 1200, 1500, 2000])
x2_original = np.array([0, 1, 2, 3, 5])

print(f"Original x1: {x1_original}")
print(f"Original x2: {x2_original}\n")

# 1. Scaling by Max
x1_scaled_max = x1_original / np.max(x1_original)
x2_scaled_max = x2_original / np.max(x2_original)
print(f"Scaled by Max x1: {x1_scaled_max.round(3)}")
print(f"Scaled by Max x2: {x2_scaled_max.round(3)}\n")

# 2. Mean Normalization
mu1 = np.mean(x1_original)
range1 = np.max(x1_original) - np.min(x1_original)
x1_mean_norm = (x1_original - mu1) / range1

mu2 = np.mean(x2_original)
range2 = np.max(x2_original) - np.min(x2_original)
x2_mean_norm = (x2_original - mu2) / range2

print(f"Mean Normalized x1 (mean={mu1:.1f}, range={range1:.1f}): {x1_mean_norm.round(3)}")
print(f"Mean Normalized x2 (mean={mu2:.1f}, range={range2:.1f}): {x2_mean_norm.round(3)}\n")

# 3. Z-score Normalization
std1 = np.std(x1_original)
x1_zscore = (x1_original - mu1) / std1

std2 = np.std(x2_original)
x2_zscore = (x2_original - mu2) / std2

print(f"Z-score Normalized x1 (mean={mu1:.1f}, std={std1:.1f}): {x1_zscore.round(3)}")
print(f"Z-score Normalized x2 (mean={mu2:.1f}, std={std2:.1f}): {x2_zscore.round(3)}\n")

--- Demo 2: Feature Scaling Examples ---
Original x1: [ 300  800 1200 1500 2000]
Original x2: [0 1 2 3 5]

Scaled by Max x1: [0.15 0.4  0.6  0.75 1.  ]
Scaled by Max x2: [0.  0.2 0.4 0.6 1. ]

Mean Normalized x1 (mean=1160.0, range=1700.0): [-0.506 -0.212  0.024  0.2    0.494]
Mean Normalized x2 (mean=2.2, range=5.0): [-0.44 -0.24 -0.04  0.16  0.56]

Z-score Normalized x1 (mean=1160.0, std=581.7): [-1.478 -0.619  0.069  0.584  1.444]
Z-score Normalized x2 (mean=2.2, std=1.7): [-1.279 -0.697 -0.116  0.465  1.627]



## Debugging Gradient Descent: Convergence

* **Objective:** Verify Gradient Descent is correctly minimizing the cost function.
* **Method:** Plot the **cost function $J$ vs. the number of iterations**. This plot is called a **learning curve**.
    * Horizontal axis: Number of iterations.
    * Vertical axis: Cost $J$.

* **Expected Behavior (Proper Convergence):**
    * Cost $J$ must **decrease on every single iteration**.
    * The curve should eventually **flatten out**, indicating convergence to a minimum.
    * Number of iterations to converge varies by application (tens to hundreds of thousands).

* **Signs of Problems:**
    * **Cost $J$ increases** on any iteration, or **fluctuates (goes up and down):**
        * Likely $\alpha$ (learning rate) is **too large**.
        * Could indicate a **bug in the code** (e.g., using `+` instead of `-` in the update rule).
    * **Cost $J$ decreases very slowly:**
        * $\alpha$ is **too small**.

---

## Choosing the Learning Rate ($\alpha$)

* **Importance:** Crucial for efficient and correct convergence.
* **Debugging Tip:** If gradient descent isn't working, set $\alpha$ to a **very small number** (e.g., 0.0001). If $J$ still increases, there's likely a bug in your code.
* **Finding Optimal $\alpha$:**
    1.  **Try a range of values:** Start small (e.g., 0.001).
    2.  **Increment by factors of 3 (or 10):** Example sequence: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3...
    3.  **Plot learning curves** for each $\alpha$ (run for a handful of iterations).
    4.  **Choose:** The largest $\alpha$ that results in a **rapid but consistently decreasing** cost function.

---

## Feature Engineering and Polynomial Regression

* **Feature Engineering:** Designing new features by transforming or combining existing ones, using your knowledge about the problem.
    * **Example:** Create `area = lot_width * lot_depth`. Use `area` as a new feature.
    * **Benefit:** Can significantly improve model accuracy by providing more relevant information.

* **Polynomial Regression:** A type of feature engineering for linear regression to fit **non-linear curves**.
    * **Idea:** Create new features by raising existing features to different powers (e.g., $x, x^2, x^3$).
    * **Example:** Model house price based on `size` ($x$), `size^2` ($x^2$), `size^3` ($x^3$).
        * Model: $f_{W,b}(X) = w_1x + w_2x^2 + w_3x^3 + b$.
        * This is still linear in its parameters but fits a non-linear curve with respect to the original feature $x$.
    * **Critical Note:** When using polynomial features, **feature scaling is even more important** due to vastly different ranges (e.g., $x$ from 1-1000, $x^2$ from 1-1,000,000).
    * **Other options:** Use functions like $\sqrt{x}$.

---

### Scikit-learn (Mentioned)

* A widely used open-source Python ML library.
* Provides ready-to-use implementations for algorithms like linear regression.
* Understanding manual implementation is still important for solid comprehension and debugging, even when using libraries.

---

## Overfitting and Underfitting

* **Underfitting (High Bias):**
    * **Definition:** Model is too simple; it **doesn't fit the training data well** and misses clear patterns.
    * **Characteristics:** Poor performance on both training and new data.
    * **Example:** Fitting a straight line to data that clearly needs a curve.
* **Overfitting (High Variance):**
    * **Definition:** Model is too complex; it **fits the training data "too well"** (even noise), but performs poorly on new, unseen data.
    * **Characteristics:** Excellent performance on training data, but poor **generalization**.
    * **Example:** Fitting a very high-order polynomial that creates a "wiggly" curve through all training points.
* **"Just Right" Model:** Balances fitting training data well (low bias) and generalizing well (low variance). This is the goal.

---

## Addressing Overfitting

1.  **Collect More Training Data:**
    * **Effect:** Helps complex models learn more general patterns.
    * **Feasibility:** Not always possible.

2.  **Use Fewer Features (Feature Selection):**
    * **Method:** Manually select a subset of the most relevant features.
    * **Disadvantage:** May discard useful information.

3.  **Regularization:**
    * **Core Idea:** Encourage the learning algorithm to keep **parameter values ($w_j$) small** (but not necessarily zero).
    * **Effect:** A model with smaller parameters tends to be "simpler" and "smoother," reducing overfitting.
    * **Convention:** Typically regularize $w_1, \dots, w_n$, but not the bias parameter $b$.

---

### Regularization Cost Function

* **Modified Cost Function:** Adds a **regularization term** to the original cost function.
* **For Linear Regression (Regularized Mean Squared Error):**
    $$ J(W, b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{W,b}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 $$
* **For Logistic Regression (Regularized Log Loss):**
    $$ J(W, b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(f_{W,b}(x^{(i)})) + (1-y^{(i)})\log(1-f_{W,b}(x^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 $$
* **Components:**
    1.  **Original Cost Term:** Measures how well the model fits the training data.
    2.  **Regularization Term:** Penalizes large parameter values ($w_j^2$).
* **$\lambda$ (Lambda): Regularization Parameter:**
    * Controls the **trade-off** between fitting the training data and keeping parameters small.
    * **$\lambda = 0$:** No regularization, prone to overfitting.
    * **$\lambda$ is very large:** Strong regularization, forces $w_j$ close to 0, prone to underfitting.
    * **"Just right" $\lambda$:** Balances the two goals for optimal performance.

---

### Gradient Descent with Regularization

* **Update Rule for each $w_j$ ($j=1, \dots, n$):**
    $$ w_j := w_j - \alpha \left[ \left( \frac{1}{m} \sum_{i=1}^{m} (f_{W,b}(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m} w_j \right] $$
    * **Intuition:** Each $w_j$ is slightly shrunk by multiplying it with $(1 - \alpha \frac{\lambda}{m})$ before applying the usual update.
* **Update Rule for $b$ (No Regularization):**
    $$ b := b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{W,b}(x^{(i)}) - y^{(i)}) $$
* **Important:**
    * The definition of $f_{W,b}(x)$ depends on the model (linear vs. logistic regression).
    * Always perform **simultaneous updates** for all parameters.
    * **Feature scaling** remains crucial for faster convergence.

---