# Machine Learning Specialization - Supervised Learning: Regression and Classification
## Week 3: Logistic Regression and Regularization

### Learning Objectives:
- Understand logistic regression for binary classification
- Implement the sigmoid function and logistic cost function
- Apply regularization techniques to prevent overfitting
- Compare regularized vs non-regularized models

### Key Concepts:
- **Logistic Regression**: A classification algorithm that predicts probabilities
- **Sigmoid Function**: Maps any real value to a probability between 0 and 1
- **Logistic Cost Function**: Measures error for classification tasks
- **Regularization**: Techniques to prevent overfitting (L1, L2)
- **Overfitting Prevention**: Methods to improve model generalization

This week we transition from regression to classification, learning how to predict discrete categories instead of continuous values.

### 1. Import Required Libraries

Let's import the necessary libraries for our classification exercises.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

### 2. Generate Binary Classification Dataset

We'll create a dataset for binary classification. Let's use a medical diagnosis scenario where we predict whether a patient has a disease based on test results.

In [None]:
# Generate synthetic binary classification data
X, y = make_classification(n_samples=300, n_features=2, n_informative=2, n_redundant=0, 
                          n_clusters_per_class=1, random_state=42)

# Add intercept term (x0 = 1) to X
X = np.column_stack([np.ones(X.shape[0]), X])

# Ensure y is a column vector
y = y.reshape(-1, 1)

print(f"Dataset shape: X = {X.shape}, y = {y.shape}")
print(f"Classes: {np.unique(y.flatten())}")
print(f"Class distribution: Class 0 = {np.sum(y == 0)}, Class 1 = {np.sum(y == 1)}")
print(f"First 5 samples:")
for i in range(5):
    print(f"Features: [{X[i, 1]:.2f}, {X[i, 2]:.2f}], Label: {y[i, 0]}")

### 3. Visualize the Classification Data

Let's plot our data to understand the classification problem.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X[y.flatten() == 0, 1], X[y.flatten() == 0, 2], color='blue', alpha=0.7, label='Class 0')
plt.scatter(X[y.flatten() == 1, 1], X[y.flatten() == 1, 2], color='red', alpha=0.7, label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Binary Classification Dataset')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 4. The Sigmoid Function

Logistic regression uses the sigmoid function to map any real-valued number to a probability between 0 and 1:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where z is our linear combination of features: $z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$

The hypothesis becomes: $h_\theta(x) = \sigma(z)$

In [None]:
def sigmoid(z):
    """
    Compute sigmoid function.
    
    Args:
        z: Input value(s) (scalar, vector, or matrix)
    
    Returns:
        sigmoid_value: Sigmoid of input (same shape as z)
    """
    # YOUR CODE HERE - Implement the sigmoid function
    # Hint: Use np.exp() and be careful with numerical stability
    sigmoid_value = None  # Replace with your implementation
    
    return sigmoid_value

# Test the sigmoid function
z_test = np.array([-5, -1, 0, 1, 5])
sigmoid_test = sigmoid(z_test)
print(f"Sigmoid values for z = {z_test}: {sigmoid_test}")

# Plot sigmoid function
z_range = np.linspace(-10, 10, 100)
sigmoid_range = sigmoid(z_range)

plt.figure(figsize=(10, 6))
plt.plot(z_range, sigmoid_range, 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision Boundary (0.5)')
plt.axvline(x=0, color='k', linestyle='--', alpha=0.7)
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 5. Logistic Regression Hypothesis

Now we'll implement the logistic regression hypothesis function.

In [None]:
def predict_proba(X, theta):
    """
    Compute probability predictions using logistic regression.
    
    Args:
        X: Feature matrix (m x n)
        theta: Parameter vector (n x 1)
    
    Returns:
        probabilities: Probability predictions (m x 1)
    """
    z = np.dot(X, theta)
    return sigmoid(z)

def predict(X, theta, threshold=0.5):
    """
    Make binary predictions using logistic regression.
    
    Args:
        X: Feature matrix (m x n)
        theta: Parameter vector (n x 1)
        threshold: Classification threshold
    
    Returns:
        predictions: Binary predictions (m x 1)
    """
    probabilities = predict_proba(X, theta)
    return (probabilities >= threshold).astype(int)

# Test the prediction functions
theta_test = np.array([[0.5], [1.0], [-0.5]])
proba_test = predict_proba(X[:5], theta_test)
pred_test = predict(X[:5], theta_test)

print(f"Test probabilities: {proba_test.flatten()}")
print(f"Test predictions: {pred_test.flatten()}")
print(f"Actual labels: {y[:5].flatten()}")

### 6. Logistic Regression Cost Function

For logistic regression, we use the logistic loss (cross-entropy loss):

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))]$$

This cost function penalizes wrong predictions more heavily.

In [None]:
def compute_cost(X, y, theta):
    """
    Compute the logistic regression cost function.
    
    Args:
        X: Feature matrix (m x n)
        y: Target values (m x 1)
        theta: Parameter vector (n x 1)
    
    Returns:
        cost: The cost value (scalar)
    """
    m = len(y)
    
    # YOUR CODE HERE - Implement the logistic cost function
    # Hint: Use predict_proba and be careful with log(0) issues
    probabilities = None  # Replace with your implementation
    cost = None  # Replace with your implementation
    
    return cost

# Test the cost function
theta_test = np.array([[0.0], [0.0], [0.0]])
cost_test = compute_cost(X, y, theta_test)
print(f"Cost with zero theta: {cost_test:.4f}")

# Test with random theta
theta_random = np.random.randn(3, 1) * 0.1
cost_random = compute_cost(X, y, theta_random)
print(f"Cost with random theta: {cost_random:.4f}")

### 7. Gradient Descent for Logistic Regression

The gradient for logistic regression is:

$$\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

Interestingly, this looks similar to linear regression!

In [None]:
def gradient_descent(X, y, theta, alpha, num_iterations):
    """
    Perform gradient descent for logistic regression.
    
    Args:
        X: Feature matrix (m x n)
        y: Target values (m x 1)
        theta: Initial parameter vector (n x 1)
        alpha: Learning rate
        num_iterations: Number of iterations
    
    Returns:
        theta: Optimized parameter vector
        cost_history: List of cost values over iterations
    """
    m = len(y)
    cost_history = []
    
    for iteration in range(num_iterations):
        # YOUR CODE HERE - Implement gradient descent for logistic regression
        # Hint: The gradient computation is similar to linear regression
        probabilities = None  # Replace with your implementation
        errors = None  # Replace with your implementation
        gradient = None  # Replace with your implementation
        theta = None  # Replace with your implementation
        
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
        
        if iteration % 100 == 0:
            print(f"Iteration {iteration}: Cost = {cost:.4f}")
    
    return theta, cost_history

# Initialize theta with zeros
theta_initial = np.zeros((X.shape[1], 1))

# Train the model
alpha = 0.1
num_iterations = 1000
theta_optimized, cost_history = gradient_descent(X, y, theta_initial, alpha, num_iterations)

print(f"\nOptimized theta: {theta_optimized.flatten()}")
print(f"Final cost: {cost_history[-1]:.4f}")

### 8. Model Evaluation

Let's evaluate our logistic regression model using various metrics.

In [None]:
# Make predictions
y_pred_proba = predict_proba(X, theta_optimized)
y_pred = predict(X, theta_optimized)

# Calculate accuracy
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y, y_pred))

# Confusion matrix
cm = confusion_matrix(y, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix')
plt.show()

### 9. Decision Boundary Visualization

Let's visualize the decision boundary learned by our logistic regression model.

In [None]:
def plot_decision_boundary(X, y, theta):
    """
    Plot the decision boundary for logistic regression.
    
    Args:
        X: Feature matrix
        y: Target values
        theta: Parameter vector
    """
    # Plot data points
    plt.scatter(X[y.flatten() == 0, 1], X[y.flatten() == 0, 2], color='blue', alpha=0.7, label='Class 0')
    plt.scatter(X[y.flatten() == 1, 1], X[y.flatten() == 1, 2], color='red', alpha=0.7, label='Class 1')
    
    # Create grid for decision boundary
    x1_min, x1_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    x2_min, x2_max = X[:, 2].min() - 1, X[:, 2].max() + 1
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 100),
                          np.linspace(x2_min, x2_max, 100))
    
    # Create feature matrix for grid points
    grid_points = np.column_stack([np.ones(xx1.ravel().shape[0]), xx1.ravel(), xx2.ravel()])
    
    # Predict probabilities for grid points
    Z = predict_proba(grid_points, theta)
    Z = Z.reshape(xx1.shape)
    
    # Plot decision boundary (where probability = 0.5)
    plt.contour(xx1, xx2, Z, levels=[0.5], colors='black', linewidths=2)
    
    # Plot probability contours
    plt.contourf(xx1, xx2, Z, alpha=0.3, cmap='RdYlBu_r')
    
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Logistic Regression Decision Boundary')
    plt.legend()
    plt.colorbar(label='Probability of Class 1')
    plt.grid(True, alpha=0.3)

plt.figure(figsize=(10, 8))
plot_decision_boundary(X, y, theta_optimized)
plt.show()

### 10. Regularization

Regularization helps prevent overfitting by adding a penalty term to the cost function:

- **L2 Regularization (Ridge)**: $J(\theta) = J_{original} + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$
- **L1 Regularization (Lasso)**: $J(\theta) = J_{original} + \frac{\lambda}{2m} \sum_{j=1}^n |\theta_j|$

Where λ is the regularization parameter.

In [None]:
def compute_cost_regularized(X, y, theta, lambda_reg):
    """
    Compute regularized logistic regression cost function.
    
    Args:
        X: Feature matrix (m x n)
        y: Target values (m x 1)
        theta: Parameter vector (n x 1)
        lambda_reg: Regularization parameter
    
    Returns:
        cost: Regularized cost value
    """
    m = len(y)
    
    # Original cost
    probabilities = predict_proba(X, theta)
    original_cost = -np.mean(y * np.log(probabilities + 1e-15) + (1 - y) * np.log(1 - probabilities + 1e-15))
    
    # Regularization term (L2)
    reg_term = (lambda_reg / (2 * m)) * np.sum(theta[1:] ** 2)  # Exclude theta_0
    
    return original_cost + reg_term

def gradient_descent_regularized(X, y, theta, alpha, num_iterations, lambda_reg):
    """
    Perform regularized gradient descent for logistic regression.
    
    Args:
        X: Feature matrix (m x n)
        y: Target values (m x 1)
        theta: Initial parameter vector (n x 1)
        alpha: Learning rate
        num_iterations: Number of iterations
        lambda_reg: Regularization parameter
    
    Returns:
        theta: Optimized parameter vector
        cost_history: List of cost values over iterations
    """
    m = len(y)
    cost_history = []
    
    for iteration in range(num_iterations):
        probabilities = predict_proba(X, theta)
        errors = probabilities - y
        
        # Regularized gradient
        gradient = (1 / m) * np.dot(X.T, errors)
        gradient[1:] += (lambda_reg / m) * theta[1:]  # Regularization for theta_1 to theta_n
        
        theta = theta - alpha * gradient
        
        cost = compute_cost_regularized(X, y, theta, lambda_reg)
        cost_history.append(cost)
        
        if iteration % 100 == 0:
            print(f"Iteration {iteration}: Cost = {cost:.4f}")
    
    return theta, cost_history

# Compare regularized vs non-regularized models
lambda_values = [0, 0.01, 0.1, 1.0]
models = []

for lambda_reg in lambda_values:
    theta_reg = np.zeros((X.shape[1], 1))
    if lambda_reg == 0:
        theta_reg_opt, cost_reg_history = gradient_descent(X, y, theta_reg, alpha, num_iterations)
    else:
        theta_reg_opt, cost_reg_history = gradient_descent_regularized(X, y, theta_reg, alpha, num_iterations, lambda_reg)
    
    models.append({
        'lambda': lambda_reg,
        'theta': theta_reg_opt,
        'cost_history': cost_reg_history,
        'final_cost': cost_reg_history[-1]
    })
    
    print(f"Lambda = {lambda_reg}: Final theta = {theta_reg_opt.flatten()}, Final cost = {cost_reg_history[-1]:.4f}")

### 11. Regularization Comparison

Let's compare the performance and decision boundaries of regularized vs non-regularized models.

In [None]:
# Plot cost histories for different regularization strengths
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for model in models:
    plt.plot(model['cost_history'], label=f'λ = {model["lambda"]}')
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Cost vs Iterations (Different Regularization)')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot decision boundaries for different regularization strengths
plt.subplot(1, 2, 2)
colors = ['red', 'blue', 'green', 'orange']
for i, model in enumerate(models):
    if i == 0:
        plot_decision_boundary(X, y, model['theta'])
        plt.title('Decision Boundaries (Different Regularization)')
        break  # Only plot the first one to avoid overcrowding

# Actually, let's plot them separately for clarity
plt.tight_layout()
plt.show()

# Plot separate decision boundaries
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for i, model in enumerate(models):
    plt.sca(axes[i])
    plot_decision_boundary(X, y, model['theta'])
    plt.title(f'Decision Boundary (λ = {model["lambda"]})')

plt.tight_layout()
plt.show()

### 12. Experimentation and Questions

Now let's explore some important concepts:

1. **Regularization Effects**: How does increasing λ affect the decision boundary and model parameters?

2. **Overfitting Prevention**: Create a more complex dataset (with polynomial features) and see how regularization helps prevent overfitting.

3. **Threshold Tuning**: Experiment with different classification thresholds (not just 0.5) and see how it affects precision and recall.

4. **Feature Scaling**: How does feature scaling affect logistic regression training?

5. **Multi-class Classification**: How would you extend logistic regression for multi-class problems?

**Challenge**: Implement polynomial feature expansion and compare regularized vs non-regularized performance.

In [None]:
# Experiment: Different classification thresholds
thresholds = [0.3, 0.5, 0.7]
accuracies = []
precisions = []
recalls = []

from sklearn.metrics import precision_score, recall_score

for threshold in thresholds:
    y_pred_thresh = predict(X, theta_optimized, threshold)
    
    acc = accuracy_score(y, y_pred_thresh)
    prec = precision_score(y, y_pred_thresh)
    rec = recall_score(y, y_pred_thresh)
    
    accuracies.append(acc)
    precisions.append(prec)
    recalls.append(rec)
    
    print(f"Threshold {threshold}: Accuracy={acc:.3f}, Precision={prec:.3f}, Recall={rec:.3f}")

# Plot threshold effects
plt.figure(figsize=(10, 6))
plt.plot(thresholds, accuracies, 'b-o', label='Accuracy')
plt.plot(thresholds, precisions, 'r-o', label='Precision')
plt.plot(thresholds, recalls, 'g-o', label='Recall')
plt.xlabel('Classification Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Key Takeaways

1. **Logistic Regression** is used for binary classification and outputs probabilities between 0 and 1.

2. **Sigmoid Function** maps linear combinations to probabilities, enabling probabilistic predictions.

3. **Logistic Cost Function** (cross-entropy) is convex and works well for classification.

4. **Regularization** prevents overfitting by penalizing large parameter values.

5. **Decision Boundaries** are linear in logistic regression (can be made non-linear with feature engineering).

### Next Steps

In the Advanced Learning Algorithms section, we'll learn about neural networks, which can learn complex non-linear decision boundaries automatically without manual feature engineering.