## Sum of Squared Errors (SSE) Problem
**Front:** In linear regression, the cost function is Sum of Squared Errors (SSE). Why would SSE be a poor choice for logistic regression? <br/>
**Back:** SSE $J(\theta) = \frac{1}{2m}\sum (h_\theta(x^{(i)}) - y^{(i)})^2$ applied to $h_\theta(x)=\sigma(\theta^T x)$ yields a **non-convex** function with many local minima. Gradient descent might not find the global optimum.

## Convexity Requirement
**Front:** What key property should a cost function have for reliable optimization via gradient descent? <br/>
**Back:** **Convexity**. A convex function has a single global minimum, guaranteeing gradient descent will converge to the optimal solution regardless of initialization.

## Negative Log-Likelihood (NLL) Convexity
**Front:** Why is the Negative Log-Likelihood (NLL) $J(\theta) = -\frac{1}{m} l(\theta)$ a convex cost function for logistic regression? <br/>
**Back:** For logistic regression, the Hessian (matrix of second derivatives) of $J(\theta)$ is **positive semi-definite**. This mathematical property guarantees convexity.

## Probabilistic Foundation
**Front:** What is the statistical justification for using Negative Log-Likelihood as the cost function? <br/>
**Back:** Maximizing the likelihood $L(\theta)$ is equivalent to finding parameters that make the observed data most probable. Minimizing NLL is equivalent to **Maximum Likelihood Estimation (MLE)**.

## Bernoulli Distribution Link
**Front:** How does the form of the NLL cost function naturally arise from the model's probabilistic assumptions? <br/>
**Back:** Assuming $y|x \sim \text{Bernoulli}(h_\theta(x))$, the log-likelihood is $\sum [y\log h + (1-y)\log(1-h)]$. Negating gives the **cross-entropy** between true labels and predictions.

## Gradient Behavior
**Front:** How does the gradient of the NLL cost function behave with respect to errors, compared to SSE? <br/>
**Back:** NLL gradient $\nabla_\theta J = \frac{1}{m}\sum (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}$ is **linear in the error**. SSE's gradient would have an extra $\sigma'(\theta^Tx)$ factor, causing **vanishing gradients** when $h_\theta(x)$ is near 0 or 1.

## Penalty for Wrong Confidence
**Front:** How does NLL penalize confidently wrong predictions differently from SSE? <br/>
**Front:** NLL imposes an **extremely high cost** for confident wrong predictions (e.g., true y=0 but $h_\theta(x)\approx1$). SSE's quadratic penalty would be at most 1, regardless of confidence.

## Information Theory Interpretation
**Front:** What is the information-theoretic interpretation of the NLL cost function? <br/>
**Back:** It measures the **cross-entropy** between the true distribution (the labels y) and the predicted distribution ($h_\theta(x)$). Minimizing it minimizes the extra "surprise" from using our model instead of the true labels.

## SSE vs NLL Visual Comparison
**Front:** If you plot SSE vs NLL as a function of $h_\theta(x)$ for a single sample with y=1, what key difference appears? <br/>
**Back:** SSE: $(h-1)^2$ is symmetric and parabolic. NLL: $-\log(h)$ is asymmetric, approaching infinity as $h\to0$ (wrong confidence) but going to 0 as $h\to1$.

## Practical Optimization
**Front:** What practical advantage does the convexity of NLL provide during training? <br/>
**Back:** We can use efficient optimization algorithms (gradient descent, Newton's method) that are guaranteed to find the global minimum, leading to more reliable and reproducible model training.