## Softmax Regression

#### Introduction to Regression

Regression is the go-to approach for *quantitative* predictions such as predicting prices, win counts, or hospital stay durations. But it's crucial to recognize the nuances within regression models. For example:

- House prices are always positive and often changes are *relative* to a base price, suggesting the use of logarithmic regression.
- Hospital stay durations are *discrete nonnegative* values, making least mean squares not the best choice. This leads to a specialized area known as *survival modeling*.

#### Key Takeaway
- Estimation is more than just minimizing squared errors.
- Supervised learning is broader than just regression.

#### Transition to Classification
We now shift from *how much* questions to *which category* ones, focusing on **classification** problems. Examples include:

- Categorizing emails as spam or not.
- Predicting customer subscription sign-ups.
- Identifying animals in images.
- Recommending movies.
- Predicting readers interest in book sections.

#### Classification Types
Classification in machine learning can mean:
1. **Hard assignments**: Directly categorizing examples into classes.
2. **Soft assignments**: Assessing the probability of each categorys applicability.

#### Classification Problem Basics
- **Context**: Simple image classification with $2\times2$ grayscale images.
- **Features**: Each image has four pixel values, represented as $x_1, x_2, x_3, x_4$.
- **Categories**: Images are categorized into "cat", "chicken", and "dog".

#### Label Representation
- **Natural Impulse**: Use integers $y \in \{1, 2, 3\}$ for $\{\textrm{dog}, \textrm{cat}, \textrm{chicken}\}$.
- **Ordinal Regression**: Useful if categories have a natural ordering, like ages or stages. See [Ordinal Regression](https://en.wikipedia.org/wiki/Ordinal_regression).
- **One-Hot Encoding**: For non-ordered classes. Example: "cat" $(1, 0, 0)$, "chicken" $(0, 1, 0)$, "dog" $(0, 0, 1)$.

One-hot encoding is a common technique for categorical data in classification problems without natural orderings.


#### Multi-Output Linear Models for Classification
- **Objective**: Estimate conditional probabilities for each class.
- **Model Design**: Use multiple affine functions, one for each class output.
- **Parameters**: 
  - Weights: Represented as $w$, 12 scalars for 4 features and 3 outputs.
  - Biases: Represented as $b$, 3 scalars.
- **Formulation**:
  $$o_1 = x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,$$
  $$o_2 = x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,$$
  $$o_3 = x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.$$

- **Neural Network Diagram**: Softmax regression as a single-layer neural network with a fully connected layer.
- **Vector Notation**: $\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$, with a $3 \times 4$ weight matrix and a bias vector in $\mathbb{R}^3$.


#### Softmax Operation
- **Problem**: Directly minimizing the difference between outputs $\mathbf{o}$ and labels $\mathbf{y}$ is not ideal due to lack of guarantees on outputs being probabilities.
- **Solution**: Use the softmax function to "squish" outputs into a probabilistic format:
  $$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad \textrm{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}.$$
- **History**: The softmax concept dates back to Gibbs (1902), rooted in statistical physics by Boltzmann.

#### Vectorization for Efficiency
- **Minibatch Processing**: Vectorize calculations for computational efficiency in batches.
- **Mathematical Formulation**: 
  $$ \mathbf{O} = \mathbf{X} \mathbf{W} + \mathbf{b}, \quad \hat{\mathbf{Y}} = \mathrm{softmax}(\mathbf{O}). $$

#### Loss Function: Cross-Entropy
- **Objective**: Maximize mapping accuracy from features $\mathbf{x}$ to probabilities $\mathbf{\hat{y}}$.
- **Method**: Maximum likelihood estimation; negative log-likelihood as the loss function.
- **Cross-Entropy Loss**: 
  $$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$

#### Information Theory Context
- **Entropy**: Quantifies information amount in data.
- **Surprisal**: Measures unexpectedness of an event.
- **Cross-Entropy**: Expected surprisal for an observer with subjective probabilities upon seeing actual data.
