---
---

## Definition and Purpose
**Front:** What is the primary goal of logistic regression? <br/>
**Back:** To model the probability that a given input vector $\mathbf{x}$ belongs to a particular class (e.g., Class 1). It's used for binary classification.

## Model Output
**Front:** What is the direct output of the logistic regression model for an input $\mathbf{x}$? <br/>
**Back:** A probability value between 0 and 1: $p(C_1|\mathbf{x})$. The probability for the other class is $p(C_2|\mathbf{x}) = 1 - p(C_1|\mathbf{x})$.

## Linear Predictor
**Front:** What is the "linear predictor" or "activation" $a$ in logistic regression? <br/>
**Back:** A linear combination of the inputs (and a bias): $a = \mathbf{w}^T \mathbf{x} + w_0$, where $\mathbf{w}$ are the weights and $w_0$ is the bias term.

## Logistic Sigmoid Function
**Front:** What function maps the linear predictor $a$ to a valid probability, and what is its formula? <br/>
**Back:** The logistic sigmoid function: $\sigma(a) = \frac{1}{1 + \exp(-a)}$.

## Sigmoid Properties
**Front:** What are two key mathematical properties of the logistic sigmoid function? <br/>
**Back:** 1. It is bounded: $\sigma(a) \in (0, 1)$. 2. Its derivative is simple: $\frac{d\sigma}{da} = \sigma(a)(1 - \sigma(a))$.

## Discriminative Model
**Front:** What type of model is logistic regression: generative or discriminative? What does it model directly? <br/>
**Back:** It is a discriminative model. It directly models the posterior class probability $p(C_k|\mathbf{x})$, not the class-conditional densities $p(\mathbf{x}|C_k)$.

## Bernoulli Likelihood
**Front:** For a single data point with target $t \in \{0,1\}$, what is the probability (likelihood) under the model? <br/>
**Back:** $p(t | \mathbf{x}, \mathbf{w}) = y^t (1-y)^{1-t}$, where $y = \sigma(\mathbf{w}^T\mathbf{x})$ is the model's predicted probability for class 1.

## Cross-Entropy Error Derivation
**Front:** How is the error function $E(\mathbf{w})$ derived from the model likelihood for a full dataset? <br/>
**Back:** By taking the negative logarithm of the likelihood function. For N independent points, $E(\mathbf{w}) = -\sum_{n=1}^N [t_n \ln y_n + (1-t_n)\ln(1-y_n)]$.

## Error Function Name
**Front:** What is the common name for the error function $E(\mathbf{w})$ used in logistic regression? <br/>
**Back:** The binary cross-entropy error function.

## Gradient of the Error
**Front:** What is the gradient of the cross-entropy error $E(\mathbf{w})$ with respect to the weights $\mathbf{w}$? <br/>
**Back:** $\nabla E(\mathbf{w}) = \sum_{n=1}^N (y_n - t_n) \mathbf{x}_n$. It has a simple form resembling the error times the input.

## No Closed-Form Solution
**Front:** Can we find a closed-form solution for the optimal weights $\mathbf{w}$ by setting $\nabla E(\mathbf{w}) = 0$? Why or why not? <br/>
**Back:** No. Because $y_n = \sigma(\mathbf{w}^T\mathbf{x}_n)$ is a nonlinear function of $\mathbf{w}$, the gradient equation is nonlinear and has no closed-form solution.

## Optimization Method
**Front:** What general class of algorithms is required to find the optimal weights in logistic regression? <br/>
**Back:** Iterative numerical optimization algorithms, such as gradient descent or Newton's method.

## Newton's Method / IRLS
**Front:** What is a second-order optimization method used for logistic regression, and what is its common name in this context? <br/>
**Back:** Newton-Raphson method. When applied to logistic regression, it's called Iterative Re-weighted Least Squares (IRLS).

## IRLS Intuition
**Front:** Briefly, why is the Newton-Raphson update for logistic regression called "Iterative Re-weighted Least Squares"? <br/>
**Back:** Each update step solves a weighted least squares problem, where the weights on the data points depend on the current probability estimates, and these weights are re-computed each iteration.

## Multiclass Extension
**Front:** What is the standard extension of logistic regression to $K > 2$ classes called? <br/>
**Back:** Multinomial logistic regression, or softmax regression.

## Softmax Function
**Front:** What is the softmax function that generalizes the sigmoid for $K$ classes? <br/>
**Back:** For a vector of linear activations $\mathbf{a}$, the softmax for class $k$ is: $\text{softmax}(a_k) = \frac{\exp(a_k)}{\sum_{j=1}^K \exp(a_j)}$.

## Linear Decision Boundary
**Front:** What is the shape of the decision boundary (where $p(C_1|\mathbf{x}) = p(C_2|\mathbf{x})$) in basic logistic regression? <br/>
**Back:** It is linear, defined by the equation $\mathbf{w}^T\mathbf{x} + w_0 = 0$.

## Feature Transforms
**Front:** How can logistic regression model nonlinear decision boundaries? <br/>
**Back:** By using fixed, nonlinear basis functions $\phi(\mathbf{x})$ to transform the input. The model becomes $p(C_1|\mathbf{x}) = \sigma(\mathbf{w}^T\phi(\mathbf{x}))$.

## Pitfall: Separable Data
**Front:** What pathological problem occurs if the training data is perfectly linearly separable? <br/>
**Back:** Maximum likelihood estimation will drive the magnitude of $\mathbf{w}$ to infinity. The sigmoid becomes infinitely steep, outputting overconfident probabilities of 0 or 1.

## Pitfall: Overconfidence Consequence
**Front:** Why is the "infinite weights" solution for separable data undesirable? <br/>
**Back:** It represents severe overfitting. The model loses its probabilistic interpretation (becomes a step function) and will generalize poorly, especially near the decision boundary.

## Special Consideration: Regularization
**Front:** What is the standard technique to prevent the infinite weight problem and control model complexity? <br/>
**Back:** Add a regularization term to the error function, e.g., $E_{reg}(\mathbf{w}) = E(\mathbf{w}) + \frac{\lambda}{2}\|\mathbf{w}\|^2$. This is equivalent to placing a Gaussian prior on the weights (MAP estimation).

## Regularization Effect
**Front:** How does regularization (e.g., $\lambda \|\mathbf{w}\|^2$) affect the learned weights and probabilities? <br/>
**Back:** It shrinks the weights toward zero, preventing extreme values. This makes the predicted probability curves smoother and less overconfident.