
## Linear Combination Definition
**Front:** What is $\theta^T X$ in the context of logistic regression? <br/>
**Back:** It is the **linear combination** (or dot product) between the parameter vector $\theta$ and the input feature vector $X$. Mathematically: $\theta^T X = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$, where $\theta_0$ is the bias/intercept term (often with $x_0 = 1$ included in $X$).


## Geometric Interpretation
**Front:** Geometrically, what does $\theta^T X = 0$ represent? <br/>
**Back:** It defines a **hyperplane** (e.g., a line in 2D) in the feature space. This hyperplane is the **decision boundary** where the model predicts exactly 0.5 probability for both classes.


## Hypothesis Function
**Front:** In binary logistic regression, what function replaces the linear hypothesis $h_\theta(x) = \theta^T x$? <br/>
**Back:** for activation function as $g(x)$ : $h_\theta(x) = g(\theta^T x)$

* The sigmoid activation function: $h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$.

$ 0 \leq h_\theta(x) \leq 1 $

## Role in Hypothesis Function
**Front:** What is the role of $\theta^T X$ in the logistic regression hypothesis function $h_\theta(x)$? <br/>
**Back:** $\theta^T X$ serves as the input to the sigmoid function: $h_\theta(x) = \sigma(\theta^T X) = \frac{1}{1 + e^{-\theta^T X}}$. It is the **linear predictor** or **logit**.

## Probabilistic Interpretation (Class 1)
**Front:** How is the hypothesis $h_\theta(x)$ interpreted probabilistically for binary classification? <br/>
**Back:** $h_\theta(x)$ represents the estimated probability that $y=1$ given $x$: 

$p(y=1|x;\theta) = h_\theta(x)$.

## Probabilistic Interpretation (Class 0)
**Front:** What is the probability that $y=0$ given $x$ in binary logistic regression? <br/>
**Back:** 

$p(y=0|x;\theta) = 1 - p(y=1|x;\theta) = 1 - h_\theta(x)$.

## Compact Probability Expression
**Front:** How can we write the probability $p(y|x;\theta)$ for both classes ($y=0$ or $y=1$) in a single formula? <br/>
**Back:** 

$p(y|x;\theta) = (h_\theta(x))^{y} \cdot (1 - h_\theta(x))^{1-y}$. 

## Decision Boundary Definition
**Front:** What is the decision boundary in logistic regression, and where is it located? <br/>
**Back:** It is the set of points $x$ where $p(y=1|x;\theta) = p(y=0|x;\theta) = 0.5$. 

This occurs when $\theta^T x = 0$.

## Region of Positive Prediction
**Front:** For what values of θᵀx will the model predict y=1? <br/>
**Back:** When θᵀx > 0, which means h_θ(x) = σ(θᵀx) > 0.5. The model predicts class 1 in this region.

## Region of Negative Prediction
**Front:** For what values of θᵀx will the model predict y=0? <br/>
**Back:** When θᵀx < 0, which means h_θ(x) = σ(θᵀx) < 0.5. The model predicts class 0 in this region.

## Likelihood Function
**Front:** For a training set of $m$ i.i.d. examples, what is the likelihood function $L(\theta)$? <br/>
**Back:** The product of individual probabilities: $L(\theta) = \prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta) = \prod_{i=1}^m (h_\theta(x^{(i)}))^{y^{(i)}} (1-h_\theta(x^{(i)}))^{1-y^{(i)}}$.

* this is a Mximization problem

## Log-Likelihood
**Front:** Why do we take the log of the likelihood, and what is $l(\theta) = \log L(\theta)$? <br/>
**Back:** Log transforms products into sums, simplifying calculus. $l(\theta) = \sum_{i=1}^m [y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))]$.

* this is a Mximization problem

## Cost Function (Negative Log-Likelihood)
**Front:** What is the standard cost function $J(\theta)$ for logistic regression, and how is it derived? <br/>
**Back:** It is the average negative log-likelihood. <br/>
$J(\theta) = -l(\theta) = - \sum_{i=1}^m [y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))]$

Average form :<br/>
$J(\theta) = -\frac{1}{m} l(\theta)$

* Minimizing $J(\theta)$ is equivalent to maximizing the likelihood $L(\theta)$.

## Sum of Squared Errors (SSE) Problem
**Front:** In linear regression, the cost function is Sum of Squared Errors (SSE). Why would SSE be a poor choice for logistic regression? <br/>
**Back:** SSE $J(\theta) = \frac{1}{2}\sum (h_\theta(x^{(i)}) - y^{(i)})^2$ applied to $h_\theta(x)=\sigma(\theta^T x)$ yields a **non-convex** function with many local minima. Gradient descent might not find the global optimum.

## Gradient Descent Update Rule
**Front:** What is the parameter update rule for gradient descent using the cost $J(\theta) = - l(\theta)$? <br/>
**Back:** 
General form: $ \begin{cases} \theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}\\ \frac{\partial J(\theta)}{\partial \theta_j} = \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \end{cases}$

$\Rightarrow$
$ \begin{cases} \Delta \theta_0 = - \alpha[\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})] & j=0\\ \Delta \theta_j = - \alpha[\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}] & j \geq 1\end{cases}$

## Special Consideration: Regularization in Cost
**Front:** How is L2 regularization typically added to the logistic regression cost function $J(\theta)$? <br/>
**Back:** $J_{reg}(\theta) = -l(\theta) + \frac{\lambda}{2} \sum_{j=1}^n \theta_j^2$. 

The bias term $\theta_0$ is often excluded from regularization.

## Gradient Descent Update Rule with L2 regularization
**Front:** What is the parameter update rule for gradient descent using L2 regularization? <br/>
**Back:** 
General form: 
$ \begin{cases} J(\theta) = - l(\theta) + \frac{\lambda}{2} \sum_{j=1}^n \theta^2 
\\ \theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j} 
\\ \frac{\partial J(\theta)}{\partial \theta_j} = \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} + \lambda \theta_j\end{cases}$

$\Rightarrow$
$ \begin{cases} \Delta \theta_0 = - \alpha[\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})] & j=0\\ \Delta \theta_j = - \alpha[\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} + \lambda \theta_j] & j \geq 1\end{cases}$

## Derivative of the Sigmoid
**Front:** What is the derivative of the sigmoid function $\sigma(z)$, and why is this useful? <br/>
**Back:** $\frac{d}{dz}\sigma(z) = \sigma(z)(1 - \sigma(z))$. This simplifies the gradient calculation for the log-likelihood.

## Multi-class: One-vs-Rest (OvR)
**Front:** How does the One-vs-Rest (OvR) strategy work for multi-class classification with logistic regression? <br/>
**Back:** For $K$ classes, train $K$ separate binary classifiers. Classifier $k$ learns to distinguish class $k$ (positive) from all other classes (negative). Prediction is the class with the highest $h_\theta^{(k)}(x)$.

## Multi-class: One-vs-One (OvO)
**Front:** How does the One-vs-One (OvO) strategy work? <br/>
**Back:** Train a binary classifier for every pair of classes. This requires $\binom{K}{2} = \frac{K(K-1)}{2}$ classifiers. Prediction is made by majority vote from all classifiers.

## Multi-class: Softmax Regression
**Front:** What is the direct multi-class generalization of logistic regression called, and what is its hypothesis function? <br/>
**Back:** Softmax Regression. For class $k$ with parameters $\theta^{(k)}$, $p(y=k|x;\Theta) = \frac{e^{\theta^{(k)T} x}}{\sum_{j=1}^K e^{\theta^{(j)T} x}}$, where $\Theta$ is the matrix of all parameters.

## Softmax Decision Boundaries
**Front:** In softmax regression, what is the nature of the decision boundary between any two classes $k$ and $l$? <br/>
**Back:** The boundary where $p(y=k|x) = p(y=l|x)$ is linear, defined by $(\theta^{(k)} - \theta^{(l)})^T x = 0$.

## Softmax Cost Function
**Front:** What is the cost function for softmax regression, and what is it called? <br/>
**Back:** The categorical cross-entropy: $J(\Theta) = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K 1\{y^{(i)}=k\} \log \frac{e^{\theta^{(k)T} x^{(i)}}}{\sum_{j=1}^K e^{\theta^{(j)T} x^{(i)}}}$.

## Pitfall: Interpreting Raw Scores
**Front:** Before applying the sigmoid or softmax, what do the raw scores $\theta^T x$ or $\theta^{(k)T} x$ represent? <br/>
**Back:** They are log-odds (for binary) or logits. They exist on the whole real line and are not probabilities. The activation function converts them to probabilities.

## Pitfall: Numerical Stability (Softmax)
**Front:** What numerical issue can occur when computing $e^{\theta^{(k)T} x}$ for softmax, and how is it fixed? <br/>
**Back:** Large exponents can cause overflow. The fix is to subtract the maximum logit: $p(y=k|x) = \frac{e^{\theta^{(k)T} x - C}}{\sum_j e^{\theta^{(j)T} x - C}}$, where $C = \max_j(\theta^{(j)T} x)$.


---
---

## Definition and Purpose
**Front:** What is the primary goal of logistic regression? <br/>
**Back:** To model the probability that a given input vector $\mathbf{x}$ belongs to a particular class (e.g., Class 1). It's used for binary classification.

## Model Output
**Front:** What is the direct output of the logistic regression model for an input $\mathbf{x}$? <br/>
**Back:** A probability value between 0 and 1: $p(C_1|\mathbf{x})$. The probability for the other class is $p(C_2|\mathbf{x}) = 1 - p(C_1|\mathbf{x})$.

## Linear Predictor
**Front:** What is the "linear predictor" or "activation" $a$ in logistic regression? <br/>
**Back:** A linear combination of the inputs (and a bias): $a = \mathbf{w}^T \mathbf{x} + w_0$, where $\mathbf{w}$ are the weights and $w_0$ is the bias term.

## Logistic Sigmoid Function
**Front:** What function maps the linear predictor $a$ to a valid probability, and what is its formula? <br/>
**Back:** The logistic sigmoid function: $\sigma(a) = \frac{1}{1 + \exp(-a)}$.

## Sigmoid Properties
**Front:** What are two key mathematical properties of the logistic sigmoid function? <br/>
**Back:** 1. It is bounded: $\sigma(a) \in (0, 1)$. 2. Its derivative is simple: $\frac{d\sigma}{da} = \sigma(a)(1 - \sigma(a))$.

## Discriminative Model
**Front:** What type of model is logistic regression: generative or discriminative? What does it model directly? <br/>
**Back:** It is a discriminative model. It directly models the posterior class probability $p(C_k|\mathbf{x})$, not the class-conditional densities $p(\mathbf{x}|C_k)$.

## Bernoulli Likelihood
**Front:** For a single data point with target $t \in \{0,1\}$, what is the probability (likelihood) under the model? <br/>
**Back:** $p(t | \mathbf{x}, \mathbf{w}) = y^t (1-y)^{1-t}$, where $y = \sigma(\mathbf{w}^T\mathbf{x})$ is the model's predicted probability for class 1.

## Cross-Entropy Error Derivation
**Front:** How is the error function $E(\mathbf{w})$ derived from the model likelihood for a full dataset? <br/>
**Back:** By taking the negative logarithm of the likelihood function. For N independent points, $E(\mathbf{w}) = -\sum_{n=1}^N [t_n \ln y_n + (1-t_n)\ln(1-y_n)]$.

## Error Function Name
**Front:** What is the common name for the error function $E(\mathbf{w})$ used in logistic regression? <br/>
**Back:** The binary cross-entropy error function.

## Gradient of the Error
**Front:** What is the gradient of the cross-entropy error $E(\mathbf{w})$ with respect to the weights $\mathbf{w}$? <br/>
**Back:** $\nabla E(\mathbf{w}) = \sum_{n=1}^N (y_n - t_n) \mathbf{x}_n$. It has a simple form resembling the error times the input.

## No Closed-Form Solution
**Front:** Can we find a closed-form solution for the optimal weights $\mathbf{w}$ by setting $\nabla E(\mathbf{w}) = 0$? Why or why not? <br/>
**Back:** No. Because $y_n = \sigma(\mathbf{w}^T\mathbf{x}_n)$ is a nonlinear function of $\mathbf{w}$, the gradient equation is nonlinear and has no closed-form solution.

## Optimization Method
**Front:** What general class of algorithms is required to find the optimal weights in logistic regression? <br/>
**Back:** Iterative numerical optimization algorithms, such as gradient descent or Newton's method.

## Newton's Method / IRLS
**Front:** What is a second-order optimization method used for logistic regression, and what is its common name in this context? <br/>
**Back:** Newton-Raphson method. When applied to logistic regression, it's called Iterative Re-weighted Least Squares (IRLS).

## IRLS Intuition
**Front:** Briefly, why is the Newton-Raphson update for logistic regression called "Iterative Re-weighted Least Squares"? <br/>
**Back:** Each update step solves a weighted least squares problem, where the weights on the data points depend on the current probability estimates, and these weights are re-computed each iteration.

## Multiclass Extension
**Front:** What is the standard extension of logistic regression to $K > 2$ classes called? <br/>
**Back:** Multinomial logistic regression, or softmax regression.

## Softmax Function
**Front:** What is the softmax function that generalizes the sigmoid for $K$ classes? <br/>
**Back:** For a vector of linear activations $\mathbf{a}$, the softmax for class $k$ is: $\text{softmax}(a_k) = \frac{\exp(a_k)}{\sum_{j=1}^K \exp(a_j)}$.

## Linear Decision Boundary
**Front:** What is the shape of the decision boundary (where $p(C_1|\mathbf{x}) = p(C_2|\mathbf{x})$) in basic logistic regression? <br/>
**Back:** It is linear, defined by the equation $\mathbf{w}^T\mathbf{x} + w_0 = 0$.

## Feature Transforms
**Front:** How can logistic regression model nonlinear decision boundaries? <br/>
**Back:** By using fixed, nonlinear basis functions $\phi(\mathbf{x})$ to transform the input. The model becomes $p(C_1|\mathbf{x}) = \sigma(\mathbf{w}^T\phi(\mathbf{x}))$.

## Pitfall: Separable Data
**Front:** What pathological problem occurs if the training data is perfectly linearly separable? <br/>
**Back:** Maximum likelihood estimation will drive the magnitude of $\mathbf{w}$ to infinity. The sigmoid becomes infinitely steep, outputting overconfident probabilities of 0 or 1.

## Pitfall: Overconfidence Consequence
**Front:** Why is the "infinite weights" solution for separable data undesirable? <br/>
**Back:** It represents severe overfitting. The model loses its probabilistic interpretation (becomes a step function) and will generalize poorly, especially near the decision boundary.

## Special Consideration: Regularization
**Front:** What is the standard technique to prevent the infinite weight problem and control model complexity? <br/>
**Back:** Add a regularization term to the error function, e.g., $E_{reg}(\mathbf{w}) = E(\mathbf{w}) + \frac{\lambda}{2}\|\mathbf{w}\|^2$. This is equivalent to placing a Gaussian prior on the weights (MAP estimation).

## Regularization Effect
**Front:** How does regularization (e.g., $\lambda \|\mathbf{w}\|^2$) affect the learned weights and probabilities? <br/>
**Back:** It shrinks the weights toward zero, preventing extreme values. This makes the predicted probability curves smoother and less overconfident.