## **Lecture 5. Linear Regression**

# **Summary of Linear Regression and Regularization**

## **1. Introduction to Linear Regression**
- **Goal**: Learn a **linear mapping** from input features $ x $ to continuous values $ y $.
- **Setup**:
   - Input: Feature vectors $ x \in \mathbb{R}^d $.
   - Output: Continuous label $ y \in \mathbb{R} $.
   - **Linear Form**:  
     $
     f(x) = \theta^\top x + \theta_0
     $
     where $ \theta $ is the parameter vector and $ \theta_0 $ is the offset.

### **Why Linear Regression?**
- Though simple, linear regression works well when:
   - Features are carefully designed.
   - Problems are appropriately transformed into a **feature space** where linear functions suffice.

---

## **2. Defining the Objective**
- The objective of linear regression is to minimize the **Empirical Risk**:
   - Measures the deviation between predicted and true values:
     $
     R_n(\theta) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2} \left( y_i - \theta^\top x_i \right)^2
     $
- **Squared Error Loss**:
   - Penalizes larger deviations more heavily, ensuring sensitivity to large prediction errors.

### **Two Types of Mistakes**:
1. **Structural Errors**: Linear regression cannot model nonlinear relationships.
2. **Estimation Errors**: Limited or noisy data leads to poorly estimated parameters.

---

## **3. Solving Linear Regression**
### **Gradient-Based Approach**
- Iteratively update parameters in the direction of the negative gradient:
   $
   \theta \leftarrow \theta - \eta \nabla R_n(\theta)
   $
   where $ \eta $ is the learning rate.

### **Closed-Form Solution**
- When the empirical risk is **convex**, the solution can be computed analytically:
   $
   \theta = A^{-1} d
   $
   - $ A $: Covariance matrix of the features.
   - $ d $: Feature-label correlation vector.

- **Conditions**:
   - The number of training examples $ n $ must be larger than the dimensionality $ d $ for $ A $ to be invertible.
   - Computational cost: $ O(d^3) $.

---

## **4. Regularization in Linear Regression**
- **Goal**: Improve generalization by preventing overfitting to noisy training data.

### **Ridge Regression (L2 Regularization)**
- Adds a **penalty** for large parameter values to the loss function:
   $
   R_n^\lambda(\theta) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2} \left( y_i - \theta^\top x_i \right)^2 + \frac{\lambda}{2} \|\theta\|^2
   $
   - **Effect of $ \lambda $**:
     - Large $ \lambda $: Prioritizes small \( \theta \), ignoring small variations in the data.
     - Small $ \lambda $: Focuses on fitting training data more closely.

### **Regularized Gradient Update**:
- Adjust gradient updates to include regularization:
   $
   \theta \leftarrow \theta - \eta \left( \nabla R_n(\theta) + \lambda \theta \right)
   $
   - Pushes parameters $ \theta $ towards zero while optimizing the loss.

---

## **5. Benefits of Regularization**
- Reduces **estimation errors** by discouraging overfitting to noise.
- Balances fitting the training data and keeping parameters small.
- Regularization introduces a "sweet spot" where test error is minimized, even if training error is slightly higher.

---

## **6. Key Takeaways**
- Linear regression maps input features to continuous outputs using a linear function.
- The objective minimizes the **squared error loss**, balancing structural and estimation errors.
- **Gradient-based methods** and **closed-form solutions** are used for parameter optimization.
- **Regularization** (e.g., ridge regression) improves generalization by penalizing large parameters, reducing overfitting.


## **Lecture 6. Nonlinear Classification**

### **Summary of Non-Linear Classification and Kernel Methods**

## **1. Introduction to Non-Linear Classification**
- **Goal**: Extend linear classifiers to solve **non-linear problems** using feature transformations.
- **Key Idea**:
   - Map input $ x $ into a higher-dimensional **feature space** $ \phi(x) $.
   - Perform **linear classification** in the new space, which corresponds to a **non-linear decision boundary** in the original input space.

---

## **2. Feature Transformation for Non-Linearity**
- **Feature Maps**:
   - Transform $ x $ into new features, e.g., $ \phi(x) = [x, x^2, x^3, \dots] $.
   - Higher-order features (e.g., cross-terms) allow for more expressive decision boundaries.

### **Example**:
1. **1D Input**: $ x $ mapped to $ \phi(x) = [x, x^2] $.
   - Linear classification in the transformed space results in a **quadratic decision boundary**.
2. **2D Input**: $ x = [x_1, x_2] $ with features $ \phi(x) = [x_1, x_2, x_1^2, x_2^2, x_1x_2] $.
   - The feature space becomes 5-dimensional.

---

## **3. The Curse of Dimensionality**
- Adding higher-order terms increases the dimensionality rapidly:
   - $ d $-dimensional input with up to $ p $-order features results in $ O(d^p) $ features.
- **Challenge**: Explicitly constructing high-dimensional feature vectors is computationally expensive.

---

## **4. Kernel Methods**
- **Solution**: Avoid explicit feature construction by operating directly on **inner products** in the feature space using **kernel functions**.

### **Kernel Function**:
- A kernel $ K(x, x') $ computes the inner product of $ \phi(x) $ and $ \phi(x') $ implicitly:
   $
   K(x, x') = \langle \phi(x), \phi(x') \rangle
   $
- Examples:
   1. **Linear Kernel**: $ K(x, x') = x^\top x' $
   2. **Polynomial Kernel**: $ K(x, x') = (1 + x^\top x')^p $
   3. **Radial Basis Function (RBF) Kernel**:  
      $
      K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)
      $
      - Provides infinite-dimensional feature representations.

### **Benefits**:
- Kernels allow non-linear transformations **without explicitly computing features**.
- Computational cost depends only on the kernel function.

---

## **5. Kernel Perceptron Algorithm**
- Perceptron updates expressed using kernel functions:
   $
   \theta = \sum_{j=1}^n \alpha_j y_j K(x_j, x)
   $
   where $ \alpha_j $ counts the mistakes made on training samples.

### **Steps**:
1. Initialize $ \alpha_j = 0 $ for all $ j $.
2. For each training sample $ x_i $, check if a mistake is made.
3. If a mistake occurs, update $ \alpha_i $ by $ 1 $.

### **Interpretation**:
- Kernel $ K(x, x_j) $ measures the similarity between $ x $ and the $ j $-th training sample.
- Predictions are made using a weighted sum of kernel evaluations between the test point and training samples.

---

## **6. Radial Basis Function (RBF) Kernel**
- The RBF kernel creates a **smooth, nonlinear decision boundary** in the input space.
- **Key Property**:
   - Even with infinite features, the RBF kernel is computationally efficient.
- **Application**:
   - RBF kernels can perfectly separate any linearly separable data with sufficient training iterations.

---

## **7. Decision Trees and Random Forests**
- **Decision Trees**:
   - Create **axis-aligned splits** in the input space to form non-linear decision boundaries.
- **Random Forests**:
   - Combine multiple decision trees with randomness:
     - Randomly select features for splits.
     - Use **bootstrap samples** of the training data.

---

## **8. Key Takeaways**
- **Feature transformations** enable non-linear classification by mapping inputs to a higher-dimensional space.
- **Kernel methods** allow implicit computation in high-dimensional spaces using kernel functions, avoiding explicit feature construction.
- The **Kernel Perceptron** algorithm uses kernel functions to efficiently solve non-linear classification problems.
- **RBF kernels** are especially powerful, offering infinite-dimensional representations with computational efficiency.
- **Alternative Non-Linear Methods**:
   - Decision trees and random forests provide interpretable, axis-aligned non-linear classifiers.


## **Lecture 7. Recommender Systems**

### **Summary of Recommender Systems**

## **1. Introduction to Recommender Systems**
- **Goal**: Predict user preferences for items (e.g., movies, products) based on prior behavior.
- Applications: Widely used in e-commerce (e.g., Amazon, Netflix).
- The problem involves a sparse matrix $ Y $ with user preferences for items, where most entries are missing.

---

## **2. Problem Definition**
- $ Y $: $ n \times m $ matrix (users $ n $, items $ m $).
- Objective: Predict missing entries in $ Y $, filling it into a complete matrix $ X $.
- Challenges:
  - Sparsity: Few ratings compared to the size of the matrix.
  - Feature extraction: Not always clear which features determine preferences.

---

## **3. K-Nearest Neighbors (KNN) for Recommendations**
- **Basic Idea**:
  - Identify $ K $ -nearest users to a target user based on similarity (e.g., cosine similarity).
  - Predict ratings as the weighted average of neighbors' ratings.
- **Limitations**:
  - Fails to capture hidden patterns (e.g., a user liking both gardening and machine learning books).
  - Struggles with complex relationships between users and items.

---

## **4. Collaborative Filtering with Matrix Factorization**
- **Key Assumption**: The user-item matrix $ X $ is low-rank, reflecting latent factors (e.g., user preferences and item attributes).

### **Low-Rank Decomposition**:
1. Factorize $ X $ into two matrices:
   - $ U $: User features ($ n \times k $).
   - $ V $: Item features ($ m \times k $).
   - $ X = U \cdot V^\top $, where $ k $ is the rank.
2. Significantly reduces the number of parameters from $ n \times m $ to $ n \times k + m \times k $.

### **Interpretation**:
- $ U $: Represents user preferences across $ k $ latent factors.
- $ V $: Represents item characteristics across $ k $ latent factors.

---

## **5. Optimization Problem**
- **Objective**:
  - Minimize the reconstruction error between known entries of $ Y $ and the predicted matrix $ X $:
    $
    J(U, V) = \sum_{(a, i) \in D} (Y_{ai} - U_a \cdot V_i^\top)^2 + \lambda (\|U\|^2 + \|V\|^2)
    $
  - Regularization term ($ \lambda $): Prevents overfitting by penalizing large values in $ U $ and $ V $.

- **Algorithm**:
  - Alternating minimization:
    1. Fix $ V $, optimize $ U $.
    2. Fix $ U $, optimize $ V $.
  - Repeat until convergence.

---

## **6. Example of Matrix Factorization**
- **Rank-1 Example**:
  - Initialize $ V $ randomly.
  - Update $ U $ to minimize loss for observed entries in $ Y $.
  - Iteratively update $ U $ and $ V $ until convergence.
- Results:
  - $ U $ and $ V $ encode latent factors, allowing $ X $ to be reconstructed.

---

## **7. Advantages of Matrix Factorization**
1. **Captures Latent Patterns**:
   - Discovers hidden relationships among users and items.
   - Handles multi-dimensional preferences (e.g., user preferences across different genres).
2. **Scalable**:
   - Operates efficiently even for large matrices by reducing the number of parameters.
3. **Generalizable**:
   - Extends to higher ranks $ k $ for richer representations.

---

## **8. Key Takeaways**
- Recommender systems predict user-item interactions by leveraging past data.
- Simple methods like KNN provide basic recommendations but lack the ability to uncover hidden patterns.
- Matrix factorization (e.g., collaborative filtering) effectively models complex user-item relationships by factorizing the preference matrix into latent factors.
- Regularization and iterative optimization ensure robust performance, avoiding overfitting and enabling scalability.