## **Lecture 5. Linear Regression**

# **Summary of Linear Classifiers and the Perceptron Algorithm**

## **1. Introduction to Linear Classifiers**
- **Goal**: Learn a function that maps input feature vectors $ x $ to binary labels $ y \in \{-1, +1\} $.
- **Definition**:
   - A **linear classifier** divides the feature space into two regions using a hyperplane:
     $
     h(x) = \text{sign}(\theta^\top x + \theta_0)
     $
     - $ \theta $: Weight vector (determines the direction of separation).
     - $ \theta_0 $: Offset (shifts the hyperplane).

---

## **2. Linear Classifier Properties**
- **Linear Separability**:
   - A dataset is linearly separable if a hyperplane exists that separates positive and negative labels.
- **Training and Test Error**:
   - **Training Error**: Fraction of misclassified training samples.
   - **Test Error**: Fraction of misclassified unseen samples.
   - **Goal**: Minimize test error by choosing a classifier that generalizes well.

---

## **3. Perceptron Algorithm**
- **Objective**: Find a linear classifier that minimizes the training error, given a linearly separable dataset.
- **Algorithm Steps**:
  1. **Initialization**:
     - Set $ \theta = 0 $, $ \theta_0 = 0 $ (optional offset).
  2. **Iterate Through Training Examples**:
     - For a misclassified example $ (x_i, y_i) $, update:
       $
       \theta \leftarrow \theta + y_i x_i
       $
       $
       \theta_0 \leftarrow \theta_0 + y_i
       $
  3. **Repeat Until Convergence**:
     - Iterate through the dataset multiple times until no misclassifications occur&#8203;:contentReference.

- **Key Insight**:
   - Each update adjusts the hyperplane toward correctly classifying the current example.

---

## **4. Perceptron Algorithm Analysis**
- **Correctness**:
   - If the data is linearly separable, the perceptron algorithm converges to a solution in a finite number of updates.
- **Limitations**:
   - The algorithm cannot handle datasets that are not linearly separable.
   - There are infinitely many valid solutions, and the algorithm does not optimize for margins (distance between points and the hyperplane).

---

## **5. Loss Functions for Linear Classifiers**
- **Training Error**:
   - Measures whether predictions match labels but does not provide a gradient for optimization.
- **Perceptron Loss**:
   - Penalizes misclassified points:
     $
     \text{Loss} = \max(0, 1 - y (\theta^\top x + \theta_0))
     $
- **Hinge Loss** (used in SVMs):
   - Encourages a margin of at least 1 for correctly classified points:
     $
     \text{Loss} = \max(0, 1 - y (\theta^\top x + \theta_0))
     $
   - Differs from perceptron loss by capping the penalty for well-classified points.

---

## **6. Advantages and Limitations of Linear Classifiers**
### **Advantages**:
1. **Simplicity**: Computationally efficient and easy to implement.
2. **Interpretability**: Decisions are based on a linear combination of features.
3. **Scalability**: Works well with high-dimensional data.

### **Limitations**:
1. **Limited Flexibility**:
   - Linear classifiers fail for datasets that are not linearly separable.
2. **Sensitivity to Outliers**:
   - Outliers can significantly affect the learned hyperplane.

---

## **7. Applications of Linear Classifiers**
- Widely used for binary classification tasks such as:
  - Spam detection.
  - Sentiment analysis.
  - Disease diagnosis.

---

## **8. Key Takeaways**
- Linear classifiers divide the feature space using a hyperplane and are suitable for linearly separable data.
- The perceptron algorithm is a simple method for finding a linear classifier but cannot handle non-linear problems or optimize for margins.
- Advanced methods like SVMs (support vector machines) extend linear classifiers with margin optimization and kernel techniques for non-linear problems.

## **Lecture 3 Hinge loss, Margin boundaries and Regularization 10 of 12 possible points**

### **Summary of Large Margin Classification and Hinge Loss**

## **1. Introduction to Large Margin Classification**
- **Goal**: Find a linear classifier that maximizes the **margin** (distance between decision boundary and closest points).
- Large margin classifiers are more **robust to noise** in the training data compared to classifiers that tightly fit the data.

---

## **2. Margin Boundaries**
- **Decision Boundary**: Defined as the set of points satisfying:
   $
   \theta^\top x + \theta_0 = 0
   $
- **Margin Boundaries**: Parallel lines equidistant from the decision boundary:
   - Positive margin boundary: $ \theta^\top x + \theta_0 = 1 $
   - Negative margin boundary: $ \theta^\top x + \theta_0 = -1 $
- Distance between margin boundaries is inversely proportional to the norm of $ \theta $:
   $
   \text{Margin} = \frac{2}{\|\theta\|}
   $

---

## **3. Objective for Large Margin Classification**
- The goal is to maximize the margin while ensuring correct classification of training data:
   - Regularization term: Encourages a large margin by minimizing $ \|\theta\|^2 $.
   - Loss term: Penalizes misclassified or boundary-violating points.

### **Hinge Loss**:
- Hinge loss penalizes points inside or on the wrong side of the margin boundary:
   $
   \text{Hinge Loss} = \max(0, 1 - y (\theta^\top x + \theta_0))
   $
- **Interpretation**:
   - Loss is $ 0 $ when $ y (\theta^\top x + \theta_0) \geq 1 $ (correctly classified with margin).
   - Increases linearly when points violate the margin.

---

## **4. Regularization and Trade-off**
- **Objective Function**: Balances loss and margin regularization:
   $
   J(\theta, \theta_0) = \frac{1}{n} \sum_{i=1}^n \max(0, 1 - y_i (\theta^\top x_i + \theta_0)) + \frac{\lambda}{2} \|\theta\|^2
   $
   - First term: Average hinge loss over all training examples.
   - Second term: Regularization penalty (controls margin size).

- **Regularization Parameter ($ \lambda $)**:
   - Large $ \lambda $: Favors large margins but allows some training loss.
   - Small $ \lambda $: Prioritizes minimizing training loss, possibly at the cost of a small margin.

---

## **5. Optimization for Large Margin Classifiers**
- **Gradient-Based Updates**:
   - Minimize the objective function iteratively using gradient descent.
   - Update rule for $ \theta $:
     $
     \theta \leftarrow \theta + \eta \cdot y \cdot x \quad \text{if } y (\theta^\top x + \theta_0) < 1
     $
   - $ \eta $: Learning rate that controls step size.

- **Steps**:
   1. Initialize $ \theta $ and $ \theta_0 $.
   2. For each misclassified point, update $ \theta $ to reduce hinge loss.
   3. Regularize by scaling $ \theta $ to maintain margin.

---

## **6. Geometric Interpretation**
- **Signed Distance**:
   - Measures how far a point lies from the decision boundary.
   - Positive distance: Correct side of the margin.
   - Negative distance: Violates the margin.

- **Robustness**:
   - Large margins ensure better generalization to test data and robustness to small perturbations.

---

## **7. Key Takeaways**
- Large margin classification improves robustness by maximizing the distance between decision boundaries and training points.
- **Hinge loss** quantifies how much a point violates the margin boundary, while regularization controls the size of the margin.
- The optimization objective balances hinge loss and regularization to find the best linear classifier.
- Gradient-based methods iteratively minimize the loss function, ensuring convergence to a solution.


## **Lecture 4. Linear Classification and Generalization**

### **Summary of Regularization, Large Margins, and Stochastic Optimization**

## **1. Introduction to Regularization and Large Margins**
- **Objective**: Balance goodness-of-fit to data and intrinsic plausibility of a model using regularization.
- **Key Idea**:
   - Regularization discourages overreliance on features and favors simpler models by penalizing large weights (e.g., L2 norm: $ \|\theta\|^2 $).
   - In classification, maximizing the margin between classes improves generalization.

---

## **2. Regularization in Large Margin Classification**
- **Hinge Loss**:
   - Penalizes examples within the margin or misclassified:
     $
     \text{Loss} = \max(0, 1 - y (\theta^\top x + \theta_0))
     $
   - Zero loss for correctly classified points outside the margin.

- **Objective Function**:
   - Combines hinge loss with regularization:
     $
     J(\theta, \theta_0) = \frac{1}{n} \sum_{i=1}^n \max(0, 1 - y_i (\theta^\top x_i + \theta_0)) + \frac{\lambda}{2} \|\theta\|^2
     $
   - $ \lambda $: Regularization parameter controlling the trade-off between margin size and fit to training data.

---

## **3. Effects of Regularization Parameter ($ \lambda $)**
- **High $ \lambda $**:
   - Emphasizes larger margins.
   - Allows for some training loss to prioritize generalization.
- **Low $ \lambda $**:
   - Reduces training loss but risks overfitting by shrinking the margin.

- **Training vs. Test Loss**:
   - **U-shaped Test Loss Curve**:
     - **Underfitting**: Too much regularization (high $ \lambda $) leads to poor fit.
     - **Overfitting**: Too little regularization (low $ \lambda $) fits training data too closely.
     - Optimal $ \lambda $ balances these effects to minimize test loss.

---

## **4. Stochastic Gradient Descent (SGD)**
- **Why Use SGD**:
   - Efficient for large datasets by updating weights using a single or small batch of training examples.

- **Update Rule**:
   $
   \theta \leftarrow \theta - \eta (\nabla \text{Loss} + \lambda \theta)
   $
   - $ \eta $: Learning rate.
   - Includes regularization gradient ($ \lambda \theta $) to shrink weights.

- **Steps**:
   1. Sample a training example $ (x_i, y_i) $.
   2. Compute gradient for hinge loss:
      $
      \nabla = 
      \begin{cases} 
      0 & \text{if } y (\theta^\top x_i + \theta_0) \geq 1 \\
      -y x_i & \text{if } y (\theta^\top x_i + \theta_0) < 1 
      \end{cases}
      $
   3. Update $ \theta $.

---

## **5. Quadratic Programming for Support Vector Machines (SVMs)**
- **Formulation**:
   - Exact solutions for maximum margin problems can be obtained using quadratic programming.
   - In separable cases, constraints enforce all points lie outside the margin.

---

## **6. Key Takeaways**
- **Regularization**:
   - Reduces overfitting by controlling model complexity (e.g., penalizing large weights).
   - The regularization parameter ($ \lambda $) balances margin size and training loss.
- **Large Margins**:
   - Improve robustness to noise and enhance generalization.
   - Margins are controlled by the regularization term $ \|\theta\|^2 $.
- **Optimization**:
   - Stochastic gradient descent efficiently optimizes large datasets by updating weights incrementally.
   - Quadratic programming offers exact solutions for simpler cases but is less scalable than SGD.

