## Multiclass Logistic Regression

#### Why We Need Multiclass Models

Logistic regression naturally predicts two classes (binary classification).
But real-world problems often have more than two classes — for example:
- Handwritten digits (0–9)
- Sentiment categories (positive/neutral/negative)
- Flower species (setosa, versicolor, virginica)

So, to handle multiple classes, logistic regression must be extended. This video explains three ways to do that.

#### 1. One-vs-Rest (OvR) – “One Against All”

This is a simple way to turn a binary classifier into a multiclass classifier:

✔ You train K models if there are K classes.   
✔ Each model learns to distinguish one class from all others.   
✔ At prediction time, you run all K models on the same input and choose the class that returns the highest probability.   

For example, with 4 classes (K = 4), you train:
- Class 1 vs (2,3,4)
- Class 2 vs (1,3,4)
- Class 3 vs (1,2,4)
- Class 4 vs (1,2,3)

Each classifier predicts a probability that the input belongs to its class. The model picks the class with the largest probability.

Why One-vs-Rest Works Well
- Easy to train and understand.
- The number of models grows linearly with K.
- Works with most binary classifiers (as long as they produce probabilities).

Limitations
- If the binary classifier does not produce probabilities (like some versions of perceptron), you can’t pick “most likely class”.
- The boundaries between classes can be weird when there’s overlap.

How Boundaries Look

Even though each OvR classifier is binary, the decision regions join together into linear regions in the input space (just like binary logistic regression). They meet where two classifiers give equal output. 

#### 2. One-vs-One (OvO) – Pairwise Classifiers

Instead of training K models, OvO trains one model for every pair of classes:

✔ For K classes, number of binary models = K × (K−1) / 2.   
✔ Example: For K = 4, you train 6 classifiers:    

- 1 vs 2, 1 vs 3, 1 vs 4
- 2 vs 3, 2 vs 4
- 3 vs 4

Each classifier decides between two specific classes.
At prediction time, you run all pairwise models and use a majority vote — whichever class gets the most votes wins. 

Benefits
- Can be used with any binary classifier (even if it doesn’t give probabilities).
- Because each model sees only two classes, training is often faster per model.

Drawbacks
- The number of models grows quadratically with K — too many if you have many classes.
- Sometimes voting ties can happen, so the method “can’t decide”.

#### 3. Multinomial Logistic Regression (“Softmax”)

Instead of breaking the problem into many binary tasks, we can model all classes at once.

This uses a function called softmax that generalizes the logistic (sigmoid) function to multiple outputs. The model learns weights for each class so that:

✔ For a given input, it outputs K probabilities (one per class).   
✔ These probabilities add up to 1 (because softmax enforces this).   
✔ The predicted class is the one with the highest probability.  

This approach is mathematically more elegant because it directly optimizes the probability of the correct class across all classes at once.

#### Intuition: OvR vs Multinomial

| Method      | Treats Classes Independently? | Optimizes All Classes Together? |
| ----------- | ----------------------------- | ------------------------------- |
| One-vs-Rest | ✔                             | ✘                               |
| One-vs-One  | ✘                             | ✘                               |
| Multinomial | ✘                             | ✔                               |

Multinomial logistic regression often produces **better-calibrated probabilities** because it considers classes together in one optimization problem, while OvR trains them independently.

#### Python Implementaions


1. **One-vs-Rest Multiclass**

By default, scikit-learn uses OvR when `multi_class='ovr'` or if you use solver settings that default to OvR:

```python
from sklearn.linear_model import LogisticRegression
model_ovr = LogisticRegression(multi_class='ovr', solver='liblinear')
model_ovr.fit(X_train, y_train)
predictions_ovr = model_ovr.predict(X_test)
```

* `liblinear` solver often defaults to OvR.
* Good for small datasets and binary/multiclass with OvR.

2. **Multinomial Logistic Regression**

For true multiclass softmax logic:

```python
model_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model_multi.fit(X_train, y_train)
predictions_multi = model_multi.predict(X_test)
```

* `multi_class='multinomial'` tells scikit-learn to use multiclass softmax.
* `lbfgs` or `newton-cg` or `saga` solvers are recommended for true multinomial. 

3. **Prediction and Probabilities**

For both strategies, you can also ask the model for probabilities:

```python
probabilities = model.predict_proba(X_test)
```

* This gives you the probability per class.
* With OvR: the probability is computed per binary model and then transformed.
* With Multinomial: softmax directly gives class probabilities that add to 1. 

#### Examples of Decision Regions

Visual examples show that OvR and multinomial logistic regression produce different decision boundaries (even if both are linear), because:

* OvR trains each class independently.
* Multinomial considers all classes in one shot. 

#### Summary of Pros and Cons

**One-vs-Rest (OvR)**

✔ Easy to implement  
✔ Works with many binary classifiers  
✔ Fewer models (linear growth)  
✘ Each classifier sees full “other class” set imbalance  
✘ Might not capture joint relationships  
 
**One-vs-One (OvO)**

✔ Works with any binary classifier  
✔ Smaller training sets per classifier  
✘ Many binary models (quadratic growth)  
✘ Voting ties possible  


**Multinomial**

✔ One unified model  
✔ Class probabilities optimized jointly  
✔ Good calibration of probabilities  
✘ Needs solvers that support multinomial (lbfgs, saga, etc.)  

**Recap**

To handle more than two classes in logistic regression:

1. **One-vs-Rest (OvR)** — train K binary models — simplest.
2. **One-vs-One (OvO)** — train pairwise models — more models, smaller datasets.
3. **Multinomial** — train a single model for all classes — best integrated softmax approach.

Sources: 

[1](https://www.digitalocean.com/community/tutorials/logistic-regression-with-scikit-learn?utm_source=chatgpt.com "Mastering Logistic Regression with Scikit-Learn: A Complete Guide | DigitalOcean")
[2](https://www.geeksforgeeks.org/machine-learning/one-vs-rest-strategy-for-multi-class-classification/?utm_source=chatgpt.com "One-vs-Rest strategy for Multi-Class Classification - GeeksforGeeks")
[3](https://scikit-learn.org/1.7/auto_examples/linear_model/plot_logistic_multinomial.html?utm_source=chatgpt.com "Decision Boundaries of Multinomial and One-vs-Rest Logistic Regression — scikit-learn 1.7.2 documentation")
[4](https://scikit-learn.org/0.16//_downloads/scikit-learn-docs.pdf?utm_source=chatgpt.com "scikit-learn user guide")
[5](https://www.geeksforgeeks.org/artificial-intelligence/multiclass-logistic-regression/?utm_source=chatgpt.com "Multiclass logistic regression - GeeksforGeeks")
[6](https://www.geeksforgeeks.org/plot-multinomial-and-one-vs-rest-logistic-regression-in-scikit-learn/?utm_source=chatgpt.com "Plot Multinomial and One-vs-Rest Logistic Regression in Scikit Learn - GeeksforGeeks")