# Machine Learning Specialization - Supervised Learning: Regression and Classification
## Week 2: Multiple Linear Regression and Feature Engineering

### Learning Objectives:
- Extend linear regression to multiple features
- Understand feature engineering and polynomial regression
- Learn about learning curves and debugging algorithms
- Apply normalization and feature scaling

### Key Concepts:
- **Multiple Linear Regression**: Predicting with multiple input features
- **Feature Engineering**: Creating new features to improve model performance
- **Polynomial Regression**: Using polynomial features for non-linear relationships
- **Feature Scaling**: Normalizing features to improve gradient descent convergence
- **Learning Curves**: Diagnosing bias and variance in models

Building on last week's linear regression with one variable, we'll now work with multiple features and learn techniques to improve our models.

### 1. Import Required Libraries

Let's import the necessary libraries for our exercises.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, load_boston
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

### 2. Generate Multiple Feature Dataset

We'll create a dataset with multiple features to predict house prices based on size, number of bedrooms, and age.

In [None]:
# Generate synthetic data for house prices with multiple features
X, y = make_regression(n_samples=200, n_features=3, noise=15, random_state=42)

# Add intercept term (x0 = 1) to X
X = np.column_stack([np.ones(X.shape[0]), X])

# Reshape y to be a column vector
y = y.reshape(-1, 1)

# Feature names for interpretability
feature_names = ['Intercept', 'Size', 'Bedrooms', 'Age']

print(f"Dataset shape: X = {X.shape}, y = {y.shape}")
print(f"Features: {feature_names}")
print(f"First 5 samples:")
for i in range(5):
    print(f"Size: {X[i, 1]:.2f}, Bedrooms: {X[i, 2]:.2f}, Age: {X[i, 3]:.2f}, Price: {y[i, 0]:.2f}")

### 3. Feature Scaling (Normalization)

Feature scaling is crucial for gradient descent convergence when features have different scales. We'll implement z-score normalization:

$$x_j^{(i)} := \frac{x_j^{(i)} - \mu_j}{ \sigma_j }$$

Where:
- $\mu_j$ is the mean of feature j
- $\sigma_j$ is the standard deviation of feature j

In [None]:
def feature_normalize(X):
    """
    Normalize features using z-score normalization.
    
    Args:
        X: Feature matrix (m x n), excluding intercept term
    
    Returns:
        X_norm: Normalized feature matrix
        mu: Mean of each feature
        sigma: Standard deviation of each feature
    """
    # YOUR CODE HERE - Implement feature normalization
    # Hint: Don't normalize the intercept term (first column)
    mu = None  # Replace with your implementation
    sigma = None  # Replace with your implementation
    X_norm = None  # Replace with your implementation
    
    return X_norm, mu, sigma

# Normalize features (excluding intercept)
X_norm, mu, sigma = feature_normalize(X[:, 1:])

# Add intercept back
X_norm = np.column_stack([np.ones(X_norm.shape[0]), X_norm])

print(f"Original X range: [{X.min():.2f}, {X.max():.2f}]")
print(f"Normalized X range: [{X_norm.min():.2f}, {X_norm.max():.2f}]")
print(f"Feature means: {mu}")
print(f"Feature std devs: {sigma}")

### 4. Multiple Linear Regression Implementation

Now we'll implement linear regression for multiple features. The hypothesis remains the same:

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$$

But now we have multiple features to work with.

In [None]:
def predict(X, theta):
    """
    Make predictions using linear regression model.
    
    Args:
        X: Feature matrix (m x n)
        theta: Parameter vector (n x 1)
    
    Returns:
        predictions: Predicted values (m x 1)
    """
    return np.dot(X, theta)

def compute_cost(X, y, theta):
    """
    Compute the cost function for linear regression.
    
    Args:
        X: Feature matrix (m x n)
        y: Target values (m x 1)
        theta: Parameter vector (n x 1)
    
    Returns:
        cost: The cost value (scalar)
    """
    m = len(y)
    predictions = predict(X, theta)
    errors = predictions - y
    cost = (1 / (2 * m)) * np.sum(errors ** 2)
    return cost

def gradient_descent(X, y, theta, alpha, num_iterations):
    """
    Perform gradient descent to learn theta.
    
    Args:
        X: Feature matrix (m x n)
        y: Target values (m x 1)
        theta: Initial parameter vector (n x 1)
        alpha: Learning rate
        num_iterations: Number of iterations
    
    Returns:
        theta: Optimized parameter vector
        cost_history: List of cost values over iterations
    """
    m = len(y)
    cost_history = []
    
    for iteration in range(num_iterations):
        predictions = predict(X, theta)
        errors = predictions - y
        gradient = (1 / m) * np.dot(X.T, errors)
        theta = theta - alpha * gradient
        
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
        
        if iteration % 100 == 0:
            print(f"Iteration {iteration}: Cost = {cost:.4f}")
    
    return theta, cost_history

### 5. Training the Model

Let's train our multiple linear regression model using the normalized features.

In [None]:
# Initialize theta with zeros
theta_initial = np.zeros((X_norm.shape[1], 1))

# Set hyperparameters
alpha = 0.1
num_iterations = 1000

# Run gradient descent
theta_optimized, cost_history = gradient_descent(X_norm, y, theta_initial, alpha, num_iterations)

print(f"\nOptimized theta: {theta_optimized.flatten()}")
print(f"Final cost: {cost_history[-1]:.4f}")

# Make predictions
y_pred = predict(X_norm, theta_optimized)
mse = mean_squared_error(y, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

### 6. Feature Engineering - Polynomial Regression

Sometimes linear relationships aren't sufficient. We can create polynomial features to capture non-linear relationships:

For example, if we have feature x, we can create: $x^2$, $x^3$, etc.

This allows us to fit curves instead of straight lines.

In [None]:
# Create polynomial features
def create_polynomial_features(X, degree):
    """
    Create polynomial features up to the specified degree.
    
    Args:
        X: Feature matrix (excluding intercept)
        degree: Maximum polynomial degree
    
    Returns:
        X_poly: Polynomial feature matrix
    """
    # YOUR CODE HERE - Implement polynomial feature creation
    # Hint: Use loops or vectorized operations to create x^2, x^3, etc.
    X_poly = None  # Replace with your implementation
    
    return X_poly

# Create polynomial features (degree 2)
X_poly = create_polynomial_features(X[:, 1:], 2)  # Exclude intercept
X_poly = np.column_stack([np.ones(X_poly.shape[0]), X_poly])

print(f"Original features: {X.shape[1]}")
print(f"Polynomial features (degree 2): {X_poly.shape[1]}")

# Normalize polynomial features
X_poly_norm, _, _ = feature_normalize(X_poly[:, 1:])
X_poly_norm = np.column_stack([np.ones(X_poly_norm.shape[0]), X_poly_norm])

### 7. Training Polynomial Regression

Let's train a polynomial regression model and compare it with linear regression.

In [None]:
# Train polynomial regression
theta_poly = np.zeros((X_poly_norm.shape[1], 1))
theta_poly_opt, cost_poly_history = gradient_descent(X_poly_norm, y, theta_poly, alpha, num_iterations)

print(f"Polynomial theta: {theta_poly_opt.flatten()}")
print(f"Polynomial final cost: {cost_poly_history[-1]:.4f}")

# Compare predictions
y_pred_linear = predict(X_norm, theta_optimized)
y_pred_poly = predict(X_poly_norm, theta_poly_opt)

mse_linear = mean_squared_error(y, y_pred_linear)
mse_poly = mean_squared_error(y, y_pred_poly)

print(f"\nLinear Regression MSE: {mse_linear:.4f}")
print(f"Polynomial Regression MSE: {mse_poly:.4f}")
print(f"Improvement: {((mse_linear - mse_poly) / mse_linear * 100):.2f}%")

### 8. Learning Curves

Learning curves help us diagnose bias and variance problems:

- **High bias (underfitting)**: Both training and validation error are high
- **High variance (overfitting)**: Training error is low, but validation error is high

Let's implement learning curves by training on different dataset sizes.

In [None]:
def learning_curve(X, y, X_val, y_val, alpha, num_iterations):
    """
    Generate learning curve data.
    
    Args:
        X: Training feature matrix
        y: Training target values
        X_val: Validation feature matrix
        y_val: Validation target values
        alpha: Learning rate
        num_iterations: Number of iterations
    
    Returns:
        train_errors: Training errors for different dataset sizes
        val_errors: Validation errors for different dataset sizes
        m_values: Dataset sizes used
    """
    m_values = np.arange(1, len(X) + 1, 10)  # Every 10th sample
    train_errors = []
    val_errors = []
    
    # YOUR CODE HERE - Implement learning curve generation
    # Hint: Train on progressively larger subsets and compute errors
    
    return train_errors, val_errors, m_values

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_norm, y, test_size=0.3, random_state=42)

# Generate learning curves for linear regression
train_errors, val_errors, m_values = learning_curve(X_train, y_train, X_val, y_val, alpha, num_iterations)

# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(m_values, train_errors, 'b-', label='Training Error')
plt.plot(m_values, val_errors, 'r-', label='Validation Error')
plt.xlabel('Training Set Size')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves - Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 9. Model Comparison and Analysis

Let's compare our linear and polynomial models and analyze their performance.

In [None]:
# Plot cost histories
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(cost_history, label='Linear Regression')
plt.plot(cost_poly_history, label='Polynomial Regression')
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Cost vs Iterations')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot predictions vs actual
plt.subplot(1, 2, 2)
plt.scatter(y, y_pred_linear, alpha=0.7, label='Linear', color='blue')
plt.scatter(y, y_pred_poly, alpha=0.7, label='Polynomial', color='red')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', linewidth=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predictions vs Actual')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print model coefficients interpretation
print("Linear Regression Coefficients:")
for i, name in enumerate(feature_names):
    print(f"{name}: {theta_optimized[i, 0]:.4f}")

print("\nPolynomial Regression has", X_poly_norm.shape[1], "coefficients")

### 10. Experimentation and Questions

Now let's explore some important concepts:

1. **Feature Scaling Impact**: Compare gradient descent convergence with and without feature scaling.

2. **Polynomial Degree**: Try different polynomial degrees (1, 2, 3, 4) and observe the effects on training and validation error.

3. **Learning Rate Tuning**: Experiment with different learning rates and see how they affect convergence.

4. **Bias-Variance Tradeoff**: Based on your learning curves, does your model suffer from high bias or high variance?

5. **Feature Selection**: Which features seem most important? How could you determine this systematically?

**Challenge**: Implement k-fold cross-validation to get a more robust estimate of model performance.

In [None]:
# Experiment: Different polynomial degrees
degrees = [1, 2, 3, 4]
mse_train_list = []
mse_val_list = []

for degree in degrees:
    # Create polynomial features
    X_poly_exp = create_polynomial_features(X_train[:, 1:], degree)
    X_poly_exp = np.column_stack([np.ones(X_poly_exp.shape[0]), X_poly_exp])
    
    # Normalize
    X_poly_exp_norm, _, _ = feature_normalize(X_poly_exp[:, 1:])
    X_poly_exp_norm = np.column_stack([np.ones(X_poly_exp_norm.shape[0]), X_poly_exp_norm])
    
    # Train model
    theta_exp = np.zeros((X_poly_exp_norm.shape[1], 1))
    theta_exp_opt, _ = gradient_descent(X_poly_exp_norm, y_train, theta_exp, alpha, num_iterations)
    
    # Evaluate
    y_train_pred = predict(X_poly_exp_norm, theta_exp_opt)
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_train_list.append(mse_train)
    
    # For validation, we need to create the same polynomial features
    X_val_poly = create_polynomial_features(X_val[:, 1:], degree)
    X_val_poly = np.column_stack([np.ones(X_val_poly.shape[0]), X_val_poly])
    X_val_poly_norm, _, _ = feature_normalize(X_val_poly[:, 1:])
    X_val_poly_norm = np.column_stack([np.ones(X_val_poly_norm.shape[0]), X_val_poly_norm])
    
    y_val_pred = predict(X_val_poly_norm, theta_exp_opt)
    mse_val = mean_squared_error(y_val, y_val_pred)
    mse_val_list.append(mse_val)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(degrees, mse_train_list, 'b-o', label='Training Error')
plt.plot(degrees, mse_val_list, 'r-o', label='Validation Error')
plt.xlabel('Polynomial Degree')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff: Polynomial Degree vs Error')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Training MSE by degree:", [f"{mse:.2f}" for mse in mse_train_list])
print("Validation MSE by degree:", [f"{mse:.2f}" for mse in mse_val_list])

### Key Takeaways

1. **Multiple Linear Regression** extends single-variable regression to handle multiple features simultaneously.

2. **Feature Scaling** is crucial for gradient descent convergence when features have different scales.

3. **Feature Engineering** (like polynomial features) allows us to capture non-linear relationships.

4. **Learning Curves** help diagnose whether a model has high bias (underfitting) or high variance (overfitting).

5. **Bias-Variance Tradeoff**: More complex models (higher polynomial degrees) can reduce bias but increase variance.

### Next Steps

In the next notebook, we'll move from regression to classification problems and learn about logistic regression for binary classification tasks.