# Part I: Overview of Multiclass Classification with Binary Logistic Regression

Multiclass Logistic Regression is a technique for categorizing samples into one of three or more classes. While logistic regression is inherently designed for binary classification, it can be extended to handle multiclass problems using techniques such as the **One vs. All** and **All-Pairs** approaches. Both methods leverage binary logistic regression classifiers for making multiclass predictions, but they employ them in fundamentally different ways. The **One vs. All** treats each class separately against all others, while the **All-Pairs** approach trains a binary classifier for every pair of classes and combining their outputs. For binary logistic regression, the sigmoid function is used for representation, outputting probabilites and defining the decision boundary at 0.5. The log loss function measures the difference between predicted probabilities and actual labels, guiding optimization through stochastic gradient descent (SGD). Training continues until either a maximum of 1000 epochs is reached or the convergence threshold of $1 \times 10^4$ is met.


## Binary Logistic Regression Math 
Logistic Regression uses the sigmoid function, which is defined as follows:
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
where z is $\langle w,x \rangle$
This function takes in values of $X = \mathbb{R}^d$ and outputs continuous values in [0,1] that correspond to probabilities that are used to classify the points as $Y$ = {1, -1}.

The decision boundary based on this classifier is still $\langle w,x \rangle = 0$ and corresponds to a probability of 50%.

Now moving on to the loss for logisitic regression, in the binary case, log loss is as follows:
$$
\ell(h_{\mathbf{w}}, (\mathbf{x}, y)) = \log(1 + \exp(-y \langle \mathbf{w}, \mathbf{x} \rangle))
$$

This loss function penalizes the degree of wrongness in the case of misclassification.

Log loss is also convex, which moves us onto the optimization of the loss function. The optimization is done according to empirical risk minimization, which aims to find the hypothesis within the hypothesis class that minimizes the expected loss over all available data. In other words, ERM selects the hypothesis that produces the lowest average loss on the entire dataset. Since log-loss is convex, it is known that there is at most one global minimum, which would be where the loss is the smallest. 
In order to find this minimima, the gradients of $L(w)$ are computed with respect to each weight $w_j$. This method, called gradient descent, is used to iteratively minimize the loss function by adjusting the model parameters in the direction that reduces loss. Specifically, the weights are updated iteratively as follows:

$$
w_j = w_j - \alpha \frac{\partial L}{\partial w_j}
$$

$\alpha$ in this equation is the learning rate, which controls the size of the steps taken during gradient descent to update model parameters. It is important to select this parameter carefully because an overy large $\alpha$ can cause the model to overshoot the optimal values, while an overly small $\alpha$ can result in slow convergence or getting stuck in local minima.

## Binary Logistic Regression Pseudocode

1. **Initialize Parameters**:
   - Initialize weights `W` as a vector of small random values or zeros.
   - Initialize bias `b` as a small random value or zero.

2. **Sigmoid Function**:
   ```python
   def sigmoid(z):
       return 1 / (1 + exp(-z))

      z = W * X + b
      y_hat = sigmoid(z)

      # Compute loss
      loss = - (1 / m) * sum(y * log(y_hat) + (1 - y) * log(1 - y_hat))

      # Compute gradients
      dW = (1 / m) * X.T * (y_hat - y)  # Gradient of loss with respect to W
      db = (1 / m) * sum(y_hat - y)     # Gradient of loss with respect to b

      # Update parameters
      W = W - learning_rate * dW
      b = b - learning_rate * db

      def predict(X_new):
         z_new = W * X_new + b
         y_new_hat = sigmoid(z_new)

         if y_new_hat >= 0.5:
            return 1
         else:
            return 0



## All-Pairs

1. **Training**

   In the all-pairs approach for multi-class classification, multiple binary logistic regression models are trained for each pair of classes. Here’s the math involved:

   The total number of unique pairs of classes for $K$ classes is:

   $$\text{Number of pairs} = \binom{K}{2} = \frac{K(K - 1)}{2}$$

   For each pair of classes $(C_i, C_j)$, we train a binary logistic regression model to distinguish between data points in $C_i$ and $C_j$.

2. **Probability Estimation**
   The probability that a data point $x$ belongs to class $C_i$ rather than $C_j$ is given by:

   $$P(y = 1 | x; \theta^{(i, j)}) = \frac{1}{1 + e^{-\theta^{(i, j) \top} x}}$$

   where $ \theta^{(i, j)}$ is the parameter vector specific to the classifier for classes $C_i$ and $C_j$.


3. **Pseudocode**

   #### Train method:
   **Input**:  
      - Training data  $X$ (features) and $Y$ (labels)  
      - Binary logistic regression model

   **Steps**:
      1. **Validate input data** to ensure that $X $ and $Y$ are correctly formatted and consistent.
      2. **Create all possible class pairs** $(C_i, C_j)$ where $C_i < C_j$. This results in the set of class pairs for multi-class classification.
      3. **For each pair of classes** $(C_i, C_j)$:
         - Create a **mask** where $Y$ is either $C_i$ or $C_j$. .
         - **Filter** the training data $X$ and labels \( Y \) using the mask to get the sub-dataset $S_X$ and corresponding labels $S_Y$.
         - **Convert** $S_Y$ to binary values $[1, 0]$, where data points from $C_i$ are labeled 1 and those from $C_j$ are labeled 0.
      4. **Initialize and train** a binary logistic regression classifier using $S_X$ and the binary $S_Y$ labels.
      5. **Store** the trained classifier for later use in the prediction phase.

   ### Predict method:
   **Input**:  
      - Training data $X$ (features) and $Y$ (labels)  
      - Binary logistic regression model

   **Steps**:
      1. **Validate input data** to ensure that $X$ and $Y$ are correctly formatted and consistent. 
      2. **Initialize a vote array** with zeros to store votes for each class for each sample in $X$. 
      3. **For each pair of classes** $(C_i, C_j)$ and their respective classifiers: 
         - Use the classifier to **predict binary labels** (either 1 or 0).
         - If the predicted label is 1, **add a vote to $C_i$**. 
         - If the predicted label is 0, **add a vote to $C_j$**.
      4. **For each sample in $X$**, assign the **class label** corresponding to the class with the highest vote count. 
      5. **Return** the predicted class label for each sample.

## One-vs-all Algorithm


One-vs-all is an approach to multiclass classification that converts a multiclass problem into multiple binary classification problems. The process involves first creating a separate binary classifier for each class in the dataset. Each classifier treats the class as the "positive" class and all the other classes as the "negative" class. For a given data point, we run each of these binary classification algorithms and output the class that corresponds to the highest predicted probability. 


1. **Training**

  In the one-vs-all approach for multi-class classification, multiple binary logistic regression models are trained, one for each class. Here’s the math involved:

   For $K$ classes, we train $K$ binary classifiers. Each classifier $i$ is trained to distinguish between the data points in class $C_i$ and all other classes.

   The binary labels for the classifier corresponding to class $C_i$ are:
    $$y = 
            \begin{cases}
                1 & \text{if} \; x \; \text{if the data point belongs to class $C_{i}$} \; \\
                0 & \text{if} \; x \; \text{otherwise} \; 
            \end{cases}$$
   
2. **Probability Estimation**

The probability that a data point $x$ belongs to class $C_i$ is given by:

   $$P(y = 1 | x; \theta^{(i, j)}) = \frac{1}{1 + e^{-\theta^{(i, j) \top} x}}$$

   where $\theta^{(i)}$ is the parameter vector specific to the classifier for class $C_i$.


3. **Pseudocode**

   #### Train method:
   **Input**:  
      - Training data  $X$ (features) and $Y$ (labels)  
      - Binary logistic regression model

   **Steps**:

      1. $\text{Initialize an empty list, } \textit{models}, \text{to store each class's logistic regression model}$ <br />
      2. $\text{For each class } i \text{ in range } 1 \text{ to } k:$ <br />
         $\quad$ a. $\text{Create a new binary label vector } y_i \text{ where:}$ <br />
         $\quad \quad$ - $y_i[j] = 1$ $\text{ if } y[j] = i \text{ (current class)}$ <br />
         $\quad \quad$ - $y_i[j] = 0$ $\text{ otherwise (all other classes)}$ <br />
         $\quad$ b. $\text{Initialize and train a logistic regression model } model_i \text{ using } {\bf X} \text{ and } y_i$ <br />
         $\quad$ c. $\text{Store } model_i \text{ in the list } \textit{models}$ <br />

     **Output**:
     A list of $K$ trained binary classifiers.

   #### Predict method:
   **Input**:  
      - Test data $X$ (features) and $Y$ (labels)  
      - Trained binary classifiers

   **Steps**:
   1. **Validate input data** to ensure that $X$ and $Y$ are correctly formatted and consistent. 
   2. **Initialize a probability** array with shape $(N, K)$, where $N$ is the number of test samples and $K$ is the number of classes.
   3. **For each class** $C_i$ and its respective classifier:
         - Use the classifier to **predict probabilities** for all samples in $X$
         - Store the probabilities in the $i$-th column of the probability array.
   4. **For each sample in $X$**, assign the **class label** corresponding to the class with the highest highest probability. (np.argmax)
   5. **Return** the predicted class label for each sample.
      
      **Output**:
      An array of predicted class labels for each sample.

## Advantages and Disadvantages

### One-vs-All

**Advantages**:
1. **Simplicity**: The One-vs-All method is conceptually straightforward and easily-implemented. It decomposes the multiclass problem into multiple independent binary classification tasks, which can be handled by standard binary logistic regression classifiers.
2. **Efficiency**: For a dataset with $N$ classes, OvR requires training only $N$ classifiers, making it computationally efficient for smaller class sizes.

**Disadvantages**:
1. **Class Imbalance**: If the classes are imbalanced, some classifiers could be biased toward the dominant class, which could lead to suboptimal performance.
2. **Overlapping Classes**: This method assumes that each class is independent of the others. When classes have significant overlap, this can cause poor performance as the decision boundaries learned by each classifier may not capture the relationships between classes.
3. **Suboptimal Decision Boundaries**: Since the classifiers are trained independently, they may not effectively handle interactions between classes, potentially leading to decision boundaries that are not optimal for multiclass tasks.

### All-Pairs

**Advantages**:
1. **Higher Accuracy**: The All-Pairs method often performs better than One-vs-All, as it explicitly models pairwise relationships between classes. This method captures more complex decision boundaries that can lead to improved prediction accuracy.
2. **Captures Class Interactions**: Since All-Pairs trains on class pairs, All-Pairs can capture inter-class relationships more effectively, which is useful when classes have overlapping features.
3. **Improved Generalization**: Because the method takes into account pairwise comparisons, it can generalize better in situations where the decision boundaries are not easily separable by individual classifiers.

**Disadvantages**:
1. **Computational Complexity**: All-Pairs requires training $\binom{N}{2}$ classifiers, which grows with the number of classes. This can be impractical for problems with a large number of classes.
2. **Prediction Complexity**: During prediction, the outputs of many classifiers must be combined, which increases the complexity of the model and can lead to slower prediction times compared to One Vs. All.
3. **Scalability Issues**: While All-Pairs can offer better performance, the computational efficiency decreases as the number of classes increases, making it less scalable for large datasets.

## Citations and References ##
- Pawara, P., Okafor, E., Groefsema, M., He, S., Schomaker, L.R.B. and Wiering, M.A. (2020). One-vs-One classification for deep neural networks. Pattern Recognition, 108, p.107528. doi:https://doi.org/10.1016/j.patcog.2020.107528.
- Rifkin, R. and Klautau, A. (2004). In Defense of One-Vs-All Classification. Journal of Machine Learning Research, [online] 5, pp.101–141. Available at: https://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf.


# Part II: Model