# Classification

## Introduction

**Given:**  
Training Set
- A dataset $D$ with $N$ labeled instances $D = \{x^{(i)}, y^{(i)}\}_{i=1}^N$
- $y^{(i)} \in {1, 2, ..., K}$

**Goal:**  
Given an input $x$, assign it to one of $K$ classes.

**Examples:**
- Email Spam Detection
- Medical Diagnosis
- Handwritten digit recognition

**Decision Boundary:**

**Definition:** A dividing hyperplane that separates different classes in a feature space, also known as **Decision Surface**.

In a $d$-dimensional feature space, the decision boundary for a linear classifier is a hyper plane of dimension $d-1$.

<div style="text-align:center">
  <img src="images/decisionBoundary.png" alt="Decision Boundary Examples">
</div>

**Regression vs Classification:**

**Linear Regression:**
- Target: Continuous values (real numbers)

**Linear Classifier:**
- Target: Binary or multi-category labels

## Discriminant Functions

### **Introduction**

**Definition:** A discriminant function $f_i(x)$ that assigns a score to an input vector $x$, for each class $C_i(i=1, ..., K)$

**How it works:**
* **Binary Classification:**
    - Two functions $f_1(x)$ and $f_2(x)$ for classes $C_1$ and $C_2$. The class is predicted by comparing these two functions:
    $$
    \hat{y} = 
    \begin{cases} 
    C_1 & \text{if } f_1(x) \gt f_2(x) \\
    C_2 & \text{otherwise}
    \end{cases}
    $$
    - Decision boundary: $f(x) = 0$
    - For binary classification we can only find a function $f: R^d \rightarrow R$ where:
        - $f_1(x) = f(x)$
        - $f_2(x) = -f(x)$
<br><br>
* **General Case:**
    - For $k$-class problems, we compute $f_i(x)$ for every class $i$, and assign $x$ to class with highest score:
    $$\hat{y} = \argmax_i f_i(x)$$

### **Linear Classifier**

#### **Introduction**

##### **Core Concepts (definitions, limitations, decision surface)**

**Definition:** Decision boundaries are linear in $x$, or linear in some given set of functions of $x$ ($\phi(x)$).

**Linearly separable data:** data points that can be exactly classified by a linear decision surface.

**Why Linear Classifier:** *(even they are not optimal)*
- Simplicity
- Easy to compute
- Efficiency
- Attractive candidates for initial, trial classifiers
- Effectiveness

**Two Category Classification**
$$f(x;w) = w^Tx + w_0 = w_0 + w_1x_1 + ... + w_dx_d$$
Where:
- $x = [x_1, x_2, ..., x_d]$
- $w = [w_1, w_2, ..., w_d]$
- $w_0$: Bias

$$
\hat{y} = 
\begin{cases} 
C_1 & \text{if } w^Tx + w_0 \ge 0 \\
C_2 & \text{otherwise}
\end{cases}
$$

**Decision Boundary(Surface):** 
$$w^Tx + w_0 = 0$$
Decision Boundary is a $(d−1)$-dimensional hyperplane in $d$-dimensional space.

##### **$w$ is orthogonal to every vector lying within the decision surface**

Decision Surface:
$$w^Tx + w_0 = 0$$
For any two points $x_1$ and $x_2$ on the decision surface:
$$w^Tx_1 + w_0 = 0 \quad \text{and} \quad w^Tx_2 + w_0 = 0$$
Subtracting these equations:
$$w^T(x_1 - x_2)  = 0$$
Which:
$x_1 - x_2$: A vector parallel to the decision surface.

**Conclusion:**
- $w$ is *orthogonal* to every vector lying within the decision surface
- $w$ acts as the *normal vector* to the decision surface

##### **Signed measure of the perpendicular distance $r$ of the point $x$ from the decision surface:**

- Decision surface $H$ is determined by the normal vector $w = [w_1, w_2, ..., w_d]$: $H = w^Tx + w_0$
- $w_0$ determine the location of the surface
- The normal distance from the origin to the decision surface is $\frac{w_0}{||w||}$

$$x = x_\perp + r\frac{w}{||w||}$$
Where, $r$ is the scalar distance (signed).

Evaluate $H$ at $x$:
$$w^Tx + w_0 = w^T(x_\perp + r\frac{w}{||w||}) + w_0$$
$$ = (w^Tx_\perp + w_0) + r\frac{w^Tw}{||w||}$$

Since $x_\perp \in H$ then $w^Tx_\perp + w_0 = 0$. So:
$$ = r\frac{||w||^2}{||w||} = r||w||$$

As a result:
$$w^Tx + w_0 = r||w||$$

Solve for r:
$$r = \frac{w^Tx + w_0}{||w||}$$

<div style="text-align:center">
  <img src="images/LinearClassifierDistance.png" alt="Linear Classifier Distance">
</div>

Linear boundary geometry POV summary:
<div style="text-align:center">
  <img src="images/LinearClassifierSummary.png" alt="Linear boundary geometry POV summary">
</div>

##### **Non-Linear Decision Boundary**

- **Feature Transformation ($\phi(x)$):** Non-linearity is introduced by transforming features into a higher dimensional space.

- **Linear in Transformed Space:** The decision boundary becomes linear in the new space, but non-linear in the original space.

Same as what we did in *Polynomial Regression* problem.

- Example:
<div style="text-align:center">
  <img src="images/non-linearDecisionBoundary.png" alt="Non-Linear Decision Boundary Example">
</div>

$$x_1^2 + x_2^2 = 1$$
$$\phi(x) = [1, x_1, x_2, x_1^2, x_2^2, x_1x_2]$$
$$w = [w_0, w_1, ..., w_6] = [-1, 0, 0, 1, 1, 0]$$
$$
y = 
\begin{cases} 
1 & \text{if } w^T\phi(x) \ge 0 \\
-1 & \text{otherwise}
\end{cases}
$$

#### **Cost function**

##### **Introduction**

Finding linear classifier can be formulated as *optimization problem*:

**Given:**
- Training set $D = \{x^{(i)},y^{(i)}\}$
- Cost function $J(w)$

**Find:**
- Optimal $\hat{f}(x) = f(x;\hat{w})$ where:
$$\hat{w} = \argmin_w J(w)$$

Unlike Regression problem, we will investigate several cost functions for Classification problem.

##### **Sum of Squared Error (SSE)**

**Formula:**
$$J(w) = \sum_{i=1}^n(w^Tx^{(i)} - y^{(i)})^2$$

**Limitations:**
- If the model predicts close to the true class but not exactly 0 or 1, SSE still shows positive error, even for correct predicts
- SSE penalizes too correct predictions (ones which lie a long way on the correct side of the decision)
- Lack of robustness to noise. Small variations can cause significant changes in the cost

<div style="text-align:center">
  <img src="images/SSECostFuncForClassification.png" alt="SSE cost function example">
</div>

##### **Alternative for SSE**

**Signed Activation Function**

make my notes better and prettier
**Definition:** Measures how many samples are misclassified by the model, penalizing each error by 4 units

**Formula:**
$$J(w) = \sum_{i=1}^n(\text{sign}(w^Tx^{(i)}) - y^{(i)})^2$$

Sign Function:
$$
\text{sign}(z) = 
\begin{cases} 
1 & z \lt 0 \\
-1 & z \ge 0
\end{cases}
$$

| **True Label (y)** | **Prediction** | **Calculation**               | **Cost (J)**|  
|--------------------|----------------|-------------------------------|-------------|  
|     $+1$           |     $+1$       |   $(+1 - (+1))^2 = 0$         | 0           |  
|     $-1$           |     $-1$       |   $(-1 - (-1))^2 = 0$         | 0           |  
|     $-1$           |     $+1$       |   $(-1 - (+1))^2 = 4$         | 4           |  
|     $+1$           |     $-1$       |   $(+1 - (-1))^2 = 4$         | 4           |  


**Limitations:**
- **Non-Differentiable:**
    - The $\text{sign}(z)$ function has discontinuities at $z=0$, making gradient-based optimization impossible.
- **Flat Gradients:**
    - The cost landscape has large flat regions.
- **Coarse Error Sensitivity:**
    - All misclassifications are penalized equally (4 units), regardless of how *close* the prediction was to the boundary.

**Sigmoid Activation Function**

- **Far from decision boundary**: Output approaches **1** (high confidence).
- **Close to decision boundary**: Output ≈ **0.5** (uncertain classification).
- Solves the **non-differentiable** problem of the sign function by providing smooth gradients.

**Formula:**
$$
\sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{(Output range: 0 to 1)}
$$

**Limitations:**
- **Non-Convex Cost Function:**
    - May converge to local minima during optimization.

##### **Perceptron**

- **Core concept:**  
The perceptron criterion focuses on misclassified points.

- **Formula:**
    $$J_p(w) = -\sum_{i \in M}y^{(i)}w^Tx^{(i)}$$
    Where:
    - $y^{(i)} \in \{-1,+1\}$
    - $M$: Subset of training data what are misclassified
    - $w^T x^{(i)}$: Signed distance from decision boundary

- **Decompose $y^{(i)}w^Tx^{(i)}$ term:**
    - $y^{(i)}$ Set the sign (direction) of the perceptron.
    - $w^T x^{(i)}$: Signed distance from the decision boundary. It causes points farther from the boundary to be penalized more.

- **Classification Cases:**
    - **Correct Prediction:**
        - *Ignored in the sum (since $i \notin M$)*
    - **Incorrect Prediction** ($\text{sign}(w^T x^{(i)}) \neq y^{(i)}$)
        - $ y^{(i)} w^T x^{(i)} < 0 $ → **Positive penalty** in $ J_p(w) $  
        - Penalty scales with distance from boundary (farther = worse)

#### **Perceptron**

##### **Introduction**

**Perceptron Unit:**
- **Basic Building Block:**
  - A perceptron is the simplest type of artificial neuron used in machine learning.
- **Linear Classifier:**
  - It maps input features to an output by applying a linear combination and a threshold.
- **Binary Decision:**
  - Outputs 1 if the weighted sum of inputs exceeds the threshold.
- **Components:**
  - Inputs, weights, bias & an activation function (step or sigmoid function)

<div style="text-align:center">
  <img src="images/perceptronUnit.png" alt="Perceptron Unit">
</div>

**Inspired By Neurons:** Perceptron mimics the basic function of biological neurons in the brain

<div style="text-align:center">
  <img src="images/biologicalNeuron.png" alt="Biological Motivation Behind Perceptron">
</div>

**Single Neuron as a Linear Decision Boundary**

The output of a single neuron is:
$$y = f(w^Tx+w_0)$$
Where:
- $x$: Input vector
- $w$: weight vector
- $w_0$: Bias term
- $f$: Activation function (e.g: step, sigmoid)

**Linear Separation:** A neuron defines a linear decision boundary:
$$w^Tx+w_0 = threshold \text{ (0 for step, 0.5 for sigmoid)}$$

**Decision Rule:**
$$
y = 
\begin{cases} 
C_1 & \text{if } w^Tx+w \ge threshold \\
C_2 & \text{otherwise}
\end{cases}
$$

**Limitations of a Single Perceptron**
- **Performs Linear Separation:**
    - A single perceptron can handle linearly separable problems such as:
        - AND operation
        - OR operation

- **Fails on Non-Linear Problems:**
    - A single perceptron fails to solve non-linear problems like XOR
    - Non-Linear problem: Data points cannot be separated by a straight line.
    - Handle by using *Multi-Layer Perceptron* (MLP)

<div style="text-align:center">
  <img src="images/XORProblem.png" alt="XOR Problem">
</div>

**Multi-Layer Perceptron**

- **Adding Layers for More Complexity:**
    - MLP consists of multiple layers of neurons that allow us to model more complex functions
    - Each layer has new decision boundaries, making possible to separate non-linear data

- **Two-Layer Example**
    - Input Layer $\rightarrow$ Hidden Layer $\rightarrow$ Output Layer
    - Hidden layers introduce non-linear transformations through activation functions, enabling the network to model complex decision boundaries.

<div style="text-align:center">
  <img src="images/MLP.png" alt="2-Layer Perceptron">
</div>    

##### **Perceptron Algorithm**

- Binary Classification:
    $$y \in \{-1, 1\}$$
- Goal:
    $$\forall_i, x^{(i)} \in C_1 \rightarrow w^Tx^{(i)} \gt 0$$
    $$\forall_i, x^{(i)} \in C_2 \rightarrow w^Tx^{(i)} \lt 0$$
- Activation function:
    $$f(x;w) = \text{sign}(w^Tx)$$

**Perceptron criterion**
$$J_p(w) = - \sum_{i \in M} w^Tx^{(i)}y^{(i)}$$
Where:
- $M$: Subset of training data that are misclassified

*Which is discussed linear classifier cost function section.*

**Goal:** Minimize the loss by correctly classified all points.

##### **Batch Perceptron**

**Definition:** Updates the weight vector using all misclassified points in each iteration.

**Gradient Descent:** Adjusting weights in the direction that reduces the loss:
    $$w^t+1 = w^t - \eta \nabla_w J_p(w^t)$$
    $$\nabla_w J_p(w) = - \sum_{i \in M} x^{(i)}y^{(i)}$$

Batch Perceptron converges in finite number of steps for linearly separable data.

> Initialize $w$  
> Repeat:  
> &nbsp;&nbsp;&nbsp;&nbsp;$w = w + \eta \sum_{i \in M} x^{(i)}y^{(i)}$  
> Until:  
> &nbsp;&nbsp;&nbsp;&nbsp;$\eta \sum_{i \in M} x^{(i)}y^{(i)} < \theta$

##### **Single-Sample Perceptron**

**Definition:** Updates the weight vector after each individual point.

**Stochastic Gradient Descent (SGD):**
$$J_p(w) = \sum_{k=1}^KJ_p^{(k)}(w)$$
- Using only one misclassified sample at a time:
    $$w^t+1 = w^t + \eta x^{(i)}y^{(i)}$$
- Lower computational cost per iteration, maybe faster convergence.
- If we predicted wrong:
$$
w^{t+1} = 
\begin{cases} 
w^t + \eta x^{(i)} & \text{if } y^{(i)} \gt 0 \\
w^t - \eta x^{(i)} & \text{if } y^{(i)} \lt 0
\end{cases}
$$


If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps.

> Initialize $w$, $t \leftarrow 0$  
> Repeat:  
> &nbsp;&nbsp;&nbsp;&nbsp;$t \leftarrow t + 1$  
> &nbsp;&nbsp;&nbsp;&nbsp;$i \leftarrow t \bmod N$  
> &nbsp;&nbsp;&nbsp;&nbsp;if $x^{(i)}$ is misclassified then:  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$w = w + \eta x^{(i)} y^{(i)}$  
> Until all patterns are properly classified

##### **Pocket Algorithm**

**Limitations:**
- The Perceptron stops learning as soon as all training points are correctly classified, even if the decision boundary is suboptimal.
- When no linear decision boundary can perfectly separate the classes, the Perceptron fails to converge.

For the data that are not linearly separable due to noise:
Keeps in its pocket the best 𝒘 encountered up to now.

> Initialize w  
> for $t = 1, ..., T$  
> &nbsp;&nbsp;&nbsp;&nbsp;$i \leftarrow t \text{mod} N$  
> &nbsp;&nbsp;&nbsp;&nbsp;if $x^{(i)}$ is misclassified then  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$w^{new} = w + x^{(i)}y^{(i)}$  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if $E_{train}(w^{new}) \lt E_{train}(w)$ then  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;w = w^{new}  
> end  

### **Linear Discriminant Algorithm (LDA)**

#### **Introduction**

Fisher's Linear Discriminant Analysis 

LDA, like the Perceptron algorithm, seeks a line (or hyperplane) to separate data, but its approach is different.

**How it works?**
- Predicts the class of an observation $x$ by first projecting it to the space of discriminant variables and then classifying it in this space
- Predicts the class of an observation $x$ by first **projecting it** to the space of new discriminant variables and then classifying it in this space

**Goal:**  After projection, the mapped data should be as separable as possible (unlike the Perceptron, which only focuses on correct classification).

**Dimensionality Reduction:** LDA can also reduce dimensions by creating a new feature (linear combination of original features) that best preserves class discrimination.

<div style="text-align:center">
  <img src="images/LDA.png" alt="LDA example">
</div>

**LDA Problem Definition**

**Problem:**
- $C = 2$ classes
- $\{(x^{i}, y^{(i)})\}_{i=1}^N$ training samples with $N_1$ samples from the first class ($C_1$)
and $N_2$ samples from the second class ($C_2$)

**Goal:**
- finding the best direction $w$ that we hope to enable accurate classification

**Tip:**
- The projection of sample $x$ onto a line in direction $w$ is $w^Tx$

#### **Measure of Separation in the Projected Direction**

##### **large separation between the projected class means**

**Goal:**
$$\hat{w} = \max_w J(w) = (\mu_1' - \mu_2')^2$$
$$\text{s.t.} ||w|| = 1$$

Where:
$$\mu_1 = \frac{\sum_x^{(i) \in C_1}}{N_1}, \quad \mu_1' = w^T\mu_1$$
$$\mu_2 = \frac{\sum_x^{(i) \in C_2}}{N_2}, \quad \mu_2' = w^T\mu_2$$

**Problem:** It does not consider the variances of the classes in the projected direction

##### **LDA Criteria**

**Fisher Idea:**
Find a projection direction $w$ that maximizes class separability by:
- **Maximizing the distance** between the projected means of the two classes *(between-class separation)*.
- **Minimizing the variance** (scatter) within each class after projection *(within-class compactness)*.

**Formula:**
$$J(w) = \frac{|\mu'_1 - \mu'_2|^2}{{s'_1}^2 + {s'_2}^2}$$

Where:
- $\mu'_1, \mu'_2$: Projected means of classes 1 and 2 onto direction w.
- ${s'_1}^2, {s'_2}^2$: Scatter (variance) of projected data within each class. *(Scatter matrices)*

Comparison between large separation between means *(left pic)* and LDA criteria *(right pic)*:
<div style="text-align:center">
  <img src="images/LDACriteria.png" alt="Comparison image">
</div>

**Scatter Matrix**

Measures how tightly data points are clustered around their class mean.  
Scatter Matrix is a better choice rather than variance because it is sensitive to the number of samples per class.

The scatters of the original data:
$$s_1^2 = \sum_{x^{(i)} \in C_1} ||x^{(i)} - \mu_1||^2$$
$$s_2^2 = \sum_{x^{(i)} \in C_2} ||x^{(i)} - \mu_2||^2$$

The scatters of projected data:
$${s'_1}^2 = \sum_{x^{(i)} \in C_1} (w^Tx^{(i)} - w^T\mu_1)^2$$
$${s'_2}^2 = \sum_{x^{(i)} \in C_2} (w^Tx^{(i)} - w^T\mu_2)^2$$

- **Objective Function (Fisher's Criterion):**  
    Maximize the ratio of between-class separation to within-class scatter:
    $$J(w) = \frac{|\mu'_1 - \mu'_2|^2}{{s'_1}^2 + {s'_2}^2}$$
    Where:
    - $\mu_k'$: projected mean of class $k$
    - $s_k'^2$: scatter of projected class $k$

- **Key Components**  
    - Between-Class Separation:
        $$|\mu_1' - \mu_2'|^2 = |w^T\mu_1 - w^T\mu_2|^2$$
        $$ = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw$$
        
        - Between-class scatter matrix:
            $$S_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T$$

    - Within-Class Scatter
        $${s'_1}^2 = \sum_{x^{(i)} \in C_1}(w^Tx^{(i)} - \mu_1)^2 = w^T\left(\sum_{x^{(i)} \in C_1} \left(x^{(i)} - \mu_1\right) \left(x^{(i)} - \mu_1 \right)^T \right)w$$

        $${s'_2}^2 = \sum_{x^{(i)} \in C_2}(w^Tx^{(i)} - \mu_2)^2 = w^T\left(\sum_{x^{(i)} \in C_2} \left(x^{(i)} - \mu_2\right) \left(x^{(i)} - \mu_2 \right)^T \right)w$$

        $$S_1 = \sum_{x^{(i)} \in C_1} {(x^{(i)} - \mu_1)(x^{(i)} - \mu_1)^T}$$
        $$S_2 = \sum_{x^{(i)} \in C_2} {(x^{(i)} - \mu_2)(x^{(i)} - \mu_2)^T}$$

        - Since ${s'_1}^2 = w^T s_1 w$ and ${s'_2}^2 = w^T s_2 w$ then:
            $${s'_1}^2 + {s'_2}^2 = w^T(s_1 + s_2)w$$
            $$S_W = S_1 + S_2$$

- **Generalized Objective**  
    Rewrite $J(w)$ using $S_B$ and $S_W$:
    $$J(w) = \frac{w^TS_Bw}{w^TS_Ww}$$

- **Optimization**  
    Since:  
    $$\frac{\partial}{\partial w} w^TAw = 2Aw$$
    Take the derivative and set it to zero:  
    $$\frac{\partial J(w)}{\partial w} = \frac{\frac{\partial w^TS_Bw}{\partial w}w^TS_Ww - \frac{\partial w^TS_Ww}{\partial w}w^TS_Bw}{(w^TS_Ww)^2} = \frac{(2S_Bw)w^TS_Ww - (2S_Ww)w^TS_Bw}{(w^TS_Ww)^2} = 0$$
    So:  
    $$(2S_Bw)w^TS_Ww - (2S_Ww)w^TS_Bw = 0$$
    $$(S_Bw)w^TS_Ww = (S_Ww)w^TS_Bw$$
    Let $w^TS_Ww = \beta$ and $w^TS_Bw = \alpha$ (scalars):  
    $$S_Bw = \frac{\alpha}{\beta}S_Ww$$
    Define $\frac{\alpha}{\beta} = \lambda$:  
    $$S_Bw = \lambda S_Ww$$

- **Eigenvalue Problem:**  
    *The core relationship is $Av=\lambda v$, where A is the matrix, $v$ is the eigenvector, and $\lambda$ is the (scalar) eigenvalue.*  
    If $S_W$ is invertible (full-rank):
    $$S_W^{-1}S_Bw = \lambda w$$

    $S_Bw \propto (\mu_1 - \mu_2)$ (since $S_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T$).  
    Thus, the optimal $w$ is:
    $$w \propto S_W^{-1}(\mu_1 - \mu_2)$$

##### **LDA Algorithm Summary**

- Find $\mu_1$ and $\mu_2$ as the mean of class 1 and 2
- Find $S_1$ and $S_2$ as scatter matrix of class 1 and 2
- $S_W = S_1 + S_2$
- $S_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T$
- **Feature Extraction:** $w = S_W^{-1}(\mu_1 - \mu_2)$
- **Classification:** Using a threshold on $w^Tx$, we can classify $x$

### **Multi-Category Classification**

#### **Introduction**

**What is it?**
- Solutions to multi-category problems

**How to solve:**
- Extend the learning algorithm to support multi-class:
    - A function $f_i(x) for each class $C_i$ us found$
    - $x$ is assigned to $C_i$ if $f_i(x) \gt f_j(x) \quad \forall j \neq i$
    $$\hat{y} = \argmax_{i=1, 2, ..., C} f_i(x)$$

- Converting the problem to a set of two-class problems

#### **Converting the problem to a set of two-class problems**

##### **One-vs-Rest** or **One Against All**

For each class $C_i$, a linear discriminant function that separates samples of $C_i$ from all the other samples is found.
- Totally linearly separable

**Decision Making:**  
Decision process for a new input works as follows:
- If the new data point clearly lies on the positive side of **only one classifier**, it is **unambiguously** assigned to that class.
- In ambiguous cases, the data point might lie on the positive side of multiple classifiers.  
In such situations, the algorithm compares the distance of the point from each decision boundary.  
For example, using the signed distance in the perceptron model:  
$$f_i(x) = w_i^Tx$$

**Limitations:**
- Even if the overall classes are linearly separable from each other, there can be no linear boundary that separates one specific class against all the rest. (we will fix this issue using next algorithm)
- Ambiguity

##### **One-vs-One**

$\binom{C}{2} = \frac{C(C-1)}{2}$ linear discriminant functions are used, one to separate samples of a pair of classes.
- Pairwise linearly separable

**Decision Function:**
For each classifier that separates class $i$ from class $j$, we define a function:
$$f_{ij}(x) = w_{ij}^Tx$$
This function returns the signed distance of the input $x$ from the decision boundary between classes $i$ and $j$.

**Decision Making:**
- First method:
    - Compute all $f_{ij}(x)$ for every class pair.
    - For each pair $(i, j)$if $f_{ij}(x) \gt 0$ we count a vote for class $i$, otherwise for class $j$.
    - The class with the highest number of votes is chosen as the predicted class for $x$.

    **Limitation:**  If two classes receive the same number of votes.

- Second method:
    - We resolve the previous ambiguity using the following approach:
    $$C_i = \argmax_i \sum_{j} f_{i,j}(x)$$
    - Using this approach helps ypu to select the class with largest sum.
    - **Limitation:** This method compares independently trained classifiers, which might not be consistent with each other.


##### **Ambiguity**

Converting the multi-class problem to a set of two-class problems can lead to **regions in which the classification is undefined**:

<div style="text-align:center">
  <img src="images/LDAAmbiguity.png" alt="Ambiguity">
</div>

#### **Linear Machine**

##### **Introduction**

**Definition:**
- Alternative to *One-vs-Rest* and *One-vs-One* methods; each class is represented by its own discriminant function $f_i(x) = w_i^Tx + w_0$ for each class $C_i (i=1, 2, ..., K)$

**Decision Rule:**  
$x$ is assigned to class $C_i$ if:
$$f_i(x) \gt f_j(x) \quad \forall j \neq i$$
Or:
$$\hat{y} = \argmax_{i=1, 2, .., c} f_i(x)$$

**Decision Surfaces (Boundaries):**  
Boundary of the region $i$ and $j$ is:
$$\forall x, f_i(x) = f_j(x)$$
$$(w_i - w_j)^Tx + (w_{0i} - w_{0j}) = 0$$

<div style="text-align:center">
  <img src="images/LinearMachine.png" alt="Linear Machine example">
</div>

##### **Perceptron Multi-Class**

Maintain a weight matrix $w \in R^{mK}$, where $m$ is the *number of features* and $K$ is the *number of classes*.  
Each column $w_k$ of the matrix corresponds to the weight vector for class $k$.
$$\hat{y} = \argmax_{i=1, 2, ..., c} w_i^Tx$$
$$J_p(w) = -\sum_{i \in M} (w_{y^{(i)}} - w_{\hat{y}^{(j)}})^Tx^{(i)}$$
Where:
- $M$: Subset of training data that are misclassified.

> Initialize $w = [w_1, ..., w_c]$  
> Repeat  
> &nbsp;&nbsp;&nbsp;&nbsp;$k \leftarrow (k+1) \text{ mod } N$  
> &nbsp;&nbsp;&nbsp;&nbsp;if x^{(i)} is misclassified then  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$w_{\hat{y}^{(i)}} - \eta x^{(i)}$  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$w_{y^{(i)}} + \eta x^{(i)}$  
> Until all patterns properly classified.  

## Probabilistic Classification

### **Introduction**

- **Class Prior:**  
    $p(y=C_k) = \int p(x,y) dy$  
    Percentage of samples belonging to class $C_k$
    
- **Likelihood:**  
    $p(x|C_k)$  
    Probability density function (PDF) of feature $x$ for class $C_k$

- **Prior Probability:**  
    $p(C_k)$  
    Probability that a randomly selected sample belongs to $C_k$

- **Posterior Probability:**  
    $p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)}$  
    Probability of class $C_k$ given prior knowledge.

- **Evidence:**  
    $p(x) = \sum_{k=1}^K p(x|C_k)p(C_k)$  
    PDF of feature vector $x$  

#### **Bayesian Decision Rule**

**Optimal Classifier:** Choose the class with the **maximum posterior probability**
$$
y = 
\begin{cases} 
C_1 &  \text{if } p(C_1 | x) > p(C_2 | x) \\
C_2 & otherwise
\end{cases}
$$

**Classification Error:**
$$
p(error | x) = 
\begin{cases} 
p(C_2 | x) & \text{if we decide } C_1 \\
p(C_1 | x) & \text{if we decide } C_2
\end{cases}
$$

*Minimizing error at each point $x$:*
$$p(error | x) = min\{p(C_1 | x), p(C_2 | x)\}$$

The Bayesian decision rule can be simplified:

Posterior Comparison:
$$
y = 
\begin{cases} 
C_1 &  \text{if } p(C_1 | x) \gt p(C_2 | x) \\
C_2 & otherwise
\end{cases}
$$

Bayesian theorem on Posterior Comparison:
$$
y = 
\begin{cases} 
C_1 &  \text{if } \frac{p(x|C_1)p(C_1)}{p(x)} \gt \frac{p(x|C_2)p(C_2)}{p(x)} \\
C_2 & otherwise
\end{cases}
$$

Likelihood-Prior Product
$$
y = 
\begin{cases} 
C_1 &  \text{if } p(x|C_1)p(C_1) \gt p(x|C_2)p(C_2) \\
C_2 & otherwise
\end{cases}
$$

#### **Minimizing Misclassification Rate**

Assume:
- **Decision Function:**  
    $\alpha(x)$: Outputs class label $k$ for each $x$

- **Decision Regions:**  
    $R_k = \{x|\alpha(x) = k\}$  
    All $x$ in $R_k$ is assigned to class $C_k$

**Total Probability Error:**
$$p(error) = E_{x,y}[I(\alpha(x) \neq y)]$$
$$ = p(x \in R_1, C_2) + p(x \in R_2, C_1)$$
$$ = \int_{R_1} p(x, C_2)dx + \int_{R_2} p(x, C_1)dx$$
$$ = \int_{R_1}p(C_2|x)p(x)dx + \int_{R_2}p(C_1|x)p(x)dx$$

**Optimal Decision Regions:**  
To minimize $p(error)$, assign $x$ to the class with the highest posterior:
$$\alpha(x) = \argmax_k p(C_k|x)$$

**Bayes Minimum Error Classifier:**

Objective:
$$min_{\alpha}\{E_{x,y}[I(\alpha(x) \neq y)]\}$$

Solution (if true probabilities are known):
$$\alpha(x) = \argmax_y p(y|x)$$

*In practice we can estimate $p(y|x)$ based on the set of training samples $D$*

### **Generative Approach**

Assume:
- Assume Gaussian distribution for $p(x|C_1)$ and $p(x|C_2)$
- We already know the prior ($\pi$) $p(C_1)$ and $p(C_2) = 1 - p(C_1)$ *(only for binary classification)*

Recall that for samples $D = \{x^{(1)}, x^{(2)}, ..., x^{(N)}\}$ with Gaussian distribution, MLE estimates will be:
$$\mu = \frac{1}{N} \sum_{i=1}^N x^{(i)}$$
$$\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x^{(i)} - \mu)^2$$

So we know prior and likelihood. product of these 2 value give us posterior probability.

**Covariance Matrix**

**Definition**  
The **covariance matrix** ($\Sigma$) is a square matrix that captures the pairwise covariances between features in a dataset. For a random vector $x = [x_1, x_2, ..., x_n]^T$, it's defined as:

$$
\Sigma = \begin{bmatrix}
\sigma_{11} & \sigma_{12} & \cdots & \sigma_{1n} \\
\sigma_{21} & \sigma_{22} & \cdots & \sigma_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
\sigma_{n1} & \sigma_{n2} & \cdots & \sigma_{nn}
\end{bmatrix}
$$

where:
- Diagonal elements $\sigma_{ii}$ = variance of feature $x_i$
- Off-diagonal elements $\sigma_{ij}$ = covariance between $x_i$ and $x_j$

**Captures relationships**:
   - $\sigma_{ij} > 0$: Features increase together
   - $\sigma_{ij} < 0$: One increases when other decreases
   - $\sigma_{ij} = 0$: No linear relationship

**Example (2D Case)**  
For features $x$ and $y$:
$$
\Sigma = \begin{bmatrix}
\text{Var}(x) & \text{Cov}(x,y) \\
\text{Cov}(y,x) & \text{Var}(y)
\end{bmatrix}
$$

**Multivariate Gaussian Distribution:**
$$p(x|\mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} e^{(-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu))}$$
Where:
- $x$: is a vector in $R^d$ ($d$-dimensional space) representing the random variables.
- $\mu$: is the mean vector.
- $\Sigma$: is the covariance matrix *(which we will discuss about it soon)*
- $|\Sigma|$ is the determinant of $\Sigma$.

So for $p(x|C_k) = p(x|y=k)$ likelihood is:
$$p(x|y=k) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} e^{(-\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k))}$$

Prior Distribution ($p(x|C_k)$)  
$$p(y=1) = \pi, \quad p(y=0) = 1 - \pi$$

**MLE for Multivariate Gaussian**

Assume:
- Dataset: $\{x^{(1)}, x^{(2)}, \dots, x^{(N)}\}$ where $x^{(i)} \in R^d$
- Samples drawn from a multivariate Gaussian distribution

For each class MLE estimates:
- **Mean Vector ($\mu$)**
    $$\mu = \frac{\sum_{i-1}^N x^{(i)}}{N}$$

- **Covariance Matrix ($\Sigma$)**
    $$\Sigma = \frac{1}{N} \sum_{i=1}^N (x^{(i)} - \mu)(x^{(i)} - \mu)^T$$

**Decision Boundary for Gaussian Bayes Classifier**

$$p(C_1 | x) = p(C_2 | x)$$
$$\ln p(C_1 | x) = \ln p(C_2 | x)$$
$$\ln p(x|C_1) + \ln p(C_1) - \ln p(x) = \ln p(x|C_2) + \ln p(C_2) - \ln p(x)$$
$$\ln p(x|C_1) + \ln p(C_1) = \ln p(x|C_2) + \ln p(C_2)$$
Where:
$$\ln p(x|C_k) = -\frac{d}{2} \ln 2\pi - \frac{1}{2} \ln|\Sigma_k^{-1}| - \frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)$$

Key Result:
- **Quadratic Boundary**: Curved surface when $\Sigma_1 \neq \Sigma_2$
- The posterior $p(C_k \mid x)$ follows a **sigmoidal (logistic) curve**

<div style="text-align:center">
  <img src="images/decisionSurfaceGaussianBayes.png" alt="Decision Boundary Example">
</div>

**Shared Covariance Matrix**

When class shared a single covariance matrix $\Sigma = \Sigma_1 = \Sigma_2 = ... = \Sigma_k$:
$$p(x|y=k) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} e^{(-\frac{1}{2}(x - \mu_k)^T \Sigma^{-1} (x - \mu_k))}$$
$$p(C_1) = \pi \quad p(C_2) = 1 - \pi$$
Where:
$$\pi = \frac{N_1}{N}$$
$$\mu_1 = \frac{\sum_{n=1}^N y^{(n)}x^{(n)}}{N_1}$$
$$\mu_2 = \frac{\sum_{n=1}^N (1 - y^{(n)})x^{(n)}}{N_2}$$
$$\Sigma = \frac{1}{N}\left(\sum_{n \in C_1}(x^{(n)} - \mu_1)(x^{(n)} - \mu_1)^T + \sum_{n \in C_2}(x^{(n)} - \mu_2)(x^{(n)} - \mu_2)^T\right)$$

**Decision Boundary for Shared Covariance Gaussian**

$$p(C_1 | x) = p(C_2 | x)$$
$$\ln p(C_1 | x) = \ln p(C_2 | x)$$
$$\ln p(x|C_1) + \ln p(C_1) - \ln p(x) = \ln p(x|C_2) + \ln p(C_2) - \ln p(x)$$
$$\ln p(x|C_1) + \ln p(C_1) = \ln p(x|C_2) + \ln p(C_2)$$
Where:
$$\ln p(x|C_k) = -\frac{d}{2} \ln 2\pi - \frac{1}{2} \ln|\Sigma^{-1}| - \frac{1}{2}(x - \mu_k)^T\Sigma^{-1}(x - \mu_k)$$


$$\ln P(x|C_k) = \underbrace{-\frac{d}{2} \ln 2\pi - \frac{1}{2} \ln |\Sigma|}_{\text{Constant}} - \frac{1}{2} (x - \mu_k)^T \Sigma^{-1} (x - \mu_k)$$

Simplify the $x$ contained term:
$$-\frac{1}{2} (x - \mu_1)^T \Sigma^{-1} (x - \mu_1) + \ln \pi = -\frac{1}{2} (x - \mu_2)^T \Sigma^{-1} (x - \mu_2) + \ln (1 - \pi)$$

Expand $(x - \mu_k)^T \Sigma^{-1} (x - \mu_k)$ term:
$$(x - \mu_k)^T \Sigma^{-1} (x - \mu_k) = x^T \Sigma^{-1} x - 2 \mu_k^T \Sigma^{-1} x + \mu_k^T \Sigma^{-1} \mu_k$$

Substitute back:
$$-\frac{1}{2} \left( x^T \Sigma^{-1} x - 2 \mu_1^T \Sigma^{-1} x + \mu_1^T \Sigma^{-1} \mu_1 \right) + \ln \pi = \\
-\frac{1}{2} \left( x^T \Sigma^{-1} x - 2 \mu_2^T \Sigma^{-1} x + \mu_2^T \Sigma^{-1} \mu_2 \right) + \ln (1 - \pi)$$

Since $\Sigma$ is shared, $x^T \Sigma^{-1} x$ cancels out:
$$\mu_1^T \Sigma^{-1} x - \frac{1}{2} \mu_1^T \Sigma^{-1} \mu_1 + \ln \pi = \mu_2^T \Sigma^{-1} x - \frac{1}{2} \mu_2^T \Sigma^{-1} \mu_2 + \ln (1 - \pi)$$

So the problem will change to linear equation for $x$:
$$\underbrace{(\mu_1 - \mu_2)^T \Sigma^{-1}}_{w^T} x + \underbrace{\ln \frac{\pi}{1 - \pi} - \frac{1}{2} (\mu_1^T \Sigma^{-1} \mu_1 - \mu_2^T \Sigma^{-1} \mu_2)}_{w_0} = 0$$

**Naive Bayes Classifier**

**Generative Method Issue**
- **High number of parameters:**
    - Mean vectors:  
    $\mu_1 \in \mathbb{R}^d$ (first class), $\mu_2 \in \mathbb{R}^d$ (second class)
    Total: $2d$ parameters

    - Covariance matrix:  
    Symmetric $\Sigma \in \mathbb{R}^{d×d}$ with $\frac{d(d+1)}{2}$ unique parameters

**Assumption:**
- Conditional independency of features:
    $$p(x|C_k) = p(x_1|C_k)p(x_2|C_k)...p(x_d|C_k)$$

**Geometric Interpretation:**
- Forces the covariance matrix $\Sigma$ to be diagonal
- Ignores correlations between features
- Decision boundaries become axis-aligned

**Naive Bayes Decision Rule:**
$$y = \argmax_{k=1,2,...,K} p(C_k | x) = \argmax_{k=1,2,...,K} p(C_k)\prod_{i=1}^Np(x_i | C_k)$$

For binary features:
- Original: $2^d - 1$ parameters per class
- Naive Bayes: Only $d$ parameters per class

### **Discriminative Classifiers**

#### **Introduction**

**Generative Approach:** *(left graph)*
- **Inference Stage:**
    - **Gaol:** Model the joint distribution $p(x,y)$
    - **Steps:**
        - **Likelihood:** Estimate $p(x | C_k)$ using: 
            - Gaussian (univariate)
            - Multivariate Gaussian (for multi-dimensional features)
        - **Prior:** Estimate $p(C_k)$ using:
            - Bernoulli (binary classes)
            - Multinomial (multi-class)
        - **Posterior:** Apply Bayes- Theorem:
            $$p(C_k | x) = \frac{p(x|C_k)p(C_k)}{p(x)}$$
- **Decision Stage:**
    - After learning model, optimal class for new input is:
    $$\argmax_k p(C_k|x)$$

- **Discriminative Approach:** *(right graph)*
    - Directly estimate $p(C_k | x)$ for each class $C_k$

**Two-Class Problem:**

Assume:
$$\alpha(x) = w^T(x) + w_0$$

<div style="text-align:center">
  <img src="images/discriminativeVsGenerative.png" alt="Comparison Generative and Discriminative Approaches">
</div>

$p(C_k | x)$ can be written as a *Sigmoid (logistic) function*:
$$p(C_1 | x) = \frac{1}{1+e^{(-\alpha(x))}} = \sigma(w^Tx+w_0)$$
$$p(C_0 | x) = 1 - p(C_1 | x)$$

**Multi-Class Problem:**

Assume:
$$\alpha(x) = w^T(x) + w_0$$

<div style="text-align:center">
  <img src="images/MultiDiscVsGen.png" alt="Comparison Multi-Class Generative and Discriminative Approaches">
</div>

$p(C_k | x)$ can be written as a *soft-max function*:
$$p(C_k | x) = \frac{e^{\alpha_k(x)}}{\sum_{j=1}^K e^{(\alpha_j(x))}}$$

#### **Logistic Regression**

##### **Introduction**

**Sigmoid (Logistic) Function:**

Activation Function:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

<div style="text-align:center">
  <img src="images/sigmoid.png" alt="sigmoid function">
</div>

**Key Points:**
- It is a good candidate for activation function
- It gives us a number between 0 and 1 smoothly
- It is differentiable

**Result:**

$$p(y = 1 | x, w) = f(x;w)$$
$$p(y = 0 | x, w) = 1 - f(x;w)$$

Where:
$$f(x;w) = \sigma(w^Tx)$$
$$0 \le f(x;w) \le 1$$
$$\sigma(w^Tx) = \frac{1}{1+e^{(-w^Tx)}}$$

##### **Decision Surface (Boundary)**

**Recall:**
- Definition: A dividing hyperplane that separates different classes in a feature space.
- In a $d$-dimensional feature space, the decision boundary for a linear classifier is a hyper plane of dimension $d-1$.

- Decision surface $f(x;w) = \text{constant}$
    $$f(x;w) = \sigma(w^Tx) = \frac{1}{1 + e ^{-(w^Tx)}} = 0.5$$

- Decision surfaces are **linear functions** of $x$:
$$
\hat{y} = 
\begin{cases} 
1 & \text{if } f(x;w) \ge 0.5 \\
0 & \text{otherwise}
\end{cases}
$$

##### **Maximum Likelihood Estimation (MLE)**

**Maximum Log Likelihood:**
$$\hat{w} = \argmax_w \log \left(\prod_{i=1}^N p(y^{(i)} | w, x^{(i)})\right)$$

**Bernoulli Model**  
- For binary classification:
$$
p(y^{(i)} | w, x^{(i)}) = 
\begin{cases} 
f(x^{(i)};w) & \text{if } y^{(i)} = 1 \\
1 - f(x^{(i)};w) & \text{if } y^{(i)} = 0
\end{cases}
$$
- Concept Form:
$$p(y^{(i)} | w, x^{(i)}) = f(x^{(i)};w)^{y^{(i)}} (1 - f(x^{(i)};w))^{(1 - y^{(i)})}$$

**Substitute In MLE formula:**
$$\log \left(p(y^{(i)} | w, x^{(i)})\right) = \sum_{i=1}^N \left[y^{(i)}\log \left(f(x^{(i)};w)\right) + (1 - y^{(i)}) \log\left(1 - f(x^{(i)};w)\right) \right]$$

##### **Cost Function**

**Cost Function: Negative Likelihood**

To convert maximization to minimization:
$$J(w) = -\sum_{i=1}^N \log \left(p(y^{(i)} | w, x^{(i)}) \right)$$
$$ = -\sum_{i=1}^N y^{(i)}\log \left(f(x^{(i)};w)\right) + (1 - y^{(i)}) \log \left(1 - f(x^{(i)};w)\right)$$

So:
$$\hat{w} = \argmin_w J(w)$$

**Key Properties:**
- No Closed form solution for $\nabla_wJ(w) = 0$  
- However $J(w)$ is **convex** and has global minimum.
- Solution Method: Use iterative optimization (e.g., gradient descent).

##### **Gradient Descent**

Recall:
$$w^{t+1} = w^t - \eta \nabla_wJ(w^t)$$

Where:
$$\nabla_wJ(w^t) = \sum_{i=1}^N \left(f(x^{(i)};w) - y^{(i)}\right)x^{(i)}$$

Recall gradient descent of SSE using in linear regression:
$$\nabla_wJ(w^t) = \sum_{i=1}^N \left(w^Tx^{(i)} - y^{(i)}\right)x^{(i)}$$

**Proof**

Cost Function:
$$J(w) = -\sum_{i=1}^N y^{(i)}\log \left(f(x^{(i)};w)\right) + (1 - y^{(i)}) \log \left(1 - f(x^{(i)};w)\right)$$

Where:
- $f(x;w) = \sigma(w^Tx^{(i)})$ is the sigmoid function.

Derivative of a single term $(x^{(i)}, y^{(i)})$:
$$\frac{\partial}{\partial w}\left[y^{(i)}\log \left(\sigma(w^Tx^{(i)})\right) + (1 - y^{(i)})\log \left(1 - \sigma(w^Tx^{(i)})\right) \right]$$

Using Chain Rule:
$$\frac{\partial}{\partial w}\sigma(w^Tx^{(i)}) = \sigma(w^Tx^{(i)})(1 - \sigma(w^Tx^{(i)}))x^{(i)}$$

So Break the cost function into two parts
- if $y^{(i)} = 1$:
$$\frac{\partial}{\partial w} \log(\sigma(w^Tx^{i})) = \frac{1}{\sigma(w^Tx^{(i)})}\sigma(w^Tx^{(i)})(1 - \sigma(w^Tx^{(i)}))x^{(i)} = (1 - \sigma(w^Tx^{(i)}))x^{(i)}$$
- if $y^{(i)} = 0$:
$$\frac{\partial}{\partial w} \log(1 - \sigma(w^Tx^{i})) = \frac{-1}{1 - \sigma(w^Tx^{(i)})}\sigma(w^Tx^{(i)})(1 - \sigma(w^Tx^{(i)}))x^{(i)} = -(\sigma(w^Tx^{(i)}))x^{(i)}$$

Combine Case (The gradient for one sample simplifies to):
$$\left(\sigma(w^Tx^{(i)}) - y^{(i)}\right)x^{(i)} = \left(f(x^{(i)};w) - y^{(i)}\right)x^{(i)}$$

Sum over all $N$ samples (Full Gradient):
$$\nabla_wJ(w^t) = \sum_{i=1}^N \left(w^Tx^{(i)} - y^{(i)}\right)x^{(i)}$$

**Loss Function**

**Definition:** Single overall measure of loss incurred for taking our decisions over entire dataset.

**Formula:**
$$Loss(y,f(x;w)) = -y \log \left(f(x;w)\right) - (1 - y) \log \left(1 - f(x;w) \right)$$
Since in binary classification either $y = 1$ or $y = 0$ then:
$$
Loss(y,f(x;w))
\begin{cases} 
\log \left(f(x;w)\right) & \text{if } y = 1 \\
\log \left(1 - f(x;w)\right) & \text{if } y = 0
\end{cases}
$$

**Key Properties:**
- **Penalty for Overconfidence:**
    - It heavily penalizes confident but incorrect predictions (e.g., predicting a value close to 1 when the true label is 0). This encourages the model to be cautious unless it’s highly certain.
- **Encourages Confidence Near Decision Boundary:**
    - Even correct predictions incur a loss if they are close to the decision threshold (e.g., predicting 0.51 when the true label is 1), pushing the model to be more confident and create better class separation.

##### **Multi-Class Logistic Regression**

- **Definition:** A problem where we have K classes and every sample only belongs to one class *(for simplicity)*

- For each class $k$, $f_k(x;W)$ predicts the probability of $y=k$:
    $$p(y=k|x,W)$$
    Where:
    - $W$ denotes a matrix of $w_i$'s, which each $w_i$ is a weight vector dedicated for class label $i$.
    - $f_k(x;W) = \sigma_k(x;w)$

- On a new input $x$, pick the class that maximizes $f_k(x;W)$:
$$\alpha(x) = \argmax_{k=1,2,...,K}f_k(x)$$

**Problem Setup**

**Assumptions:**
- $K \gt 2$
- $y \in \{1, 2, ..., K\}$:

**Normalized exponential (Softmax)**
$$f_k(x,W) = p(y=k | x) = \frac{e^{w_k^Tx}}{\sum_{j = 1}^K e^{w_j^Tx}}$$

**Decision Behavior:**  
$$\text{If } {w_k^Tx} \gg {w_j^Tx} \quad \forall j \neq k \Rightarrow
\begin{cases}
p(C_k|x) \approx 1 \\
p(C_j|x) \approx 0
\end{cases}$$

Recall *Bayes' theorem* formulation:
$$p(C_k|x) = \frac{p(x|C_k)p(C_k)}{\sum_{j=1}^K p(x|C_j)p(C_j)}$$

**Softmax Function**

Softmax function is good candidate because:
- Smoothly highlights the maximum probability
- Differentiable.
- Handle negative values because of exponential function
- Normalization:
$$\sum_{k=1}^K \frac{e^{w_k^Tx}}{\sum_{j=1}^K e^{w_j^Tx}} = 1$$

**Recall Multinomial Distribution:**

Parameter Definition:
$$\theta = [\theta_1, \theta_2, ..., \theta_K]$$
Where:
$$\theta_k \in [0, 1] \quad \text{and} \quad \sum_{k=1}^{K} \theta_k = 1$$
$$\theta_k = p(x_k = 1)$$

Likelihood:
$$P(x|\theta) = \prod_{k=1}^K \theta_k^{x_k} = \theta_j \quad \text{(when $x_j = 1$)}$$

Set **cost function** as **negative of log likelihood**.

We need $\hat{W} = \argmin_W J(W)$

$$J(W) = -\log\prod_{i=1}^Np(y^{(i)} | x^{(i)}, W)$$
$$ = -\log \prod_{i=1}^N \prod_{k=1}^K f_k(x^{(i)};W)^{{y_k}^{(i)}}$$
$$ = -\log \sum_{i=1}^N \sum_{k=1}^K {{y_k}^{(i)}}\log \left(f_k(x^{(i)};W)\right)$$

There is no closed-form solution for $\hat{W}$.  
Use iterative optimization instead.

**Gradient Descent**
$$w_j^{t+1} = w_j^{t} - \eta \nabla_w(W^t)$$
Where:
$$\nabla_{w_j}(W) = \sum_{i=1}^N (f_j(x^{(i)};w) - y_j^{(i)})x^{(i)}$$

### **Summary**

**Logistic Regression (LR)**

- **Binary Classification:**
    - **Type:** Linear classifier with probabilistic outputs
    - **Assumption:** Bernoulli-distributed $P(y|x)$ with mean $\sigma(w^Tx) = \frac{1}{1+e^{-w^Tx}}$
    - **Optimization:**
        - MLE-derived cost: $J(w) = -\sum [y_i\log\sigma(w^Tx_i) + (1-y_i)\log(1-\sigma(w^Tx_i))]$
        - No closed-form solution for its optimization problem
        - Convex: Global optimum via gradient ascent

- **Multi-Class Extension**
    - **Assumption:** Multinomial Distribution for $K$ classes with:
        - Probability vector: $\theta = [\theta_1,...,\theta_K]$ where $\sum_k \theta_k=1$
        - One-hot labels: $y \in {0,1}^K$ (single 1 per sample)
    - **Softmax Regression** (Generalizes logistic regression):
    $$p(y=k | x) = \frac{e^{w_k^Tx}}{\sum_{j = 1}^K e^{w_j^Tx}}$$
    - **Optimization:**
        - MLE-derived cost
        - No closed-form solution for its optimization problem
        - Convex: Global optimum via gradient ascent

**Discriminative vs. Generative: Number of Parameters**

for $d$-dimensional feature space

- **Logistic Regression:** $d+1$ parameters
    - $w = (w_0, w_1, ..., w_d)$

- **Generative Approach:** Gaussian class-conditional with shared covariance matrix:
    - $2d$ parameters for means
    - $\frac{d(d+1)}{2}$ parameters for shared covariance matrix
    - one parameter for class prior $p(C_k)$

LR is **more robust**, **less sensitive** to incorrect modeling assumptions.