# Lab 2: Logistic Regression (Solutions)
<sub><sup>*written by riya karumanchi & isabel sieh, cs124 staff team, winter '25/'26*</sup></sub>

## Part 1: Sigmoid Intuition

First, form a group of 3 students to work together! Introduce yourselves to one another.

In logistic regression, we compute a real-valued score
$$
z = w\cdot x + b
$$
and then turn it into a probability using the sigmoid (logistic) function
$$
\hat{y} = s(z)=\frac{1}{1+\exp(-z)}.
$$
**$z$** can be any real number, but **$s(z)$** is always between 0 and 1, and it "squashes" large-magnitude values toward 0 or 1.

### A. "Same sign, different confidence"

This part is meant to be straightforward—it's a sanity check to make sure you can interpret the sigmoid curve correctly. Look at the sigmoid function plotted below:

![Sigmoid function](sigmoid.png)

Below are three different scores ($z$). Using the figure above, rank the predicted probabilities ($\hat{y}=s(z)$) from largest to smallest, and briefly justify using only the *shape* of the sigmoid:

1. $z = 0.2$
2. $z = 2$
3. $z = 6$

**Question 1.** Which one has the highest probability of $y=1$? Which one has the lowest? Why?

Rank the probabilities:
$$
s(6) > s(2) > s(0.2).
$$
- Highest probability of $y=1$: $z=6$
- Lowest probability of $y=1$: $z=0.2$

Justification: sigmoid is increasing; larger $z$ corresponds to larger $s(z)$.

**Question 2.** Which pair is *closer together* as probabilities: $s(0.2)$ vs $s(2)$, or $s(2)$ vs $s(6)$? Why?

Which pair is closer together as probabilities?
$$
s(2)\ \text{vs}\ s(6)\ \text{are closer}.
$$
Justification: the sigmoid curve is **flatter for large positive $z$** (saturation), so increasing $z$ from 2 to 6 changes probability less than increasing $z$ from 0.2 to 2.

### B. "How the sigmoid moves points"

Given reference values:
* $s(0)=0.5$
* $s(2)\approx 0.88$
* $s(-2)\approx 0.12$
* $s(4)\approx 0.98$
* $s(-4)\approx 0.02$

**Question 3.** Suppose two different feature vectors produce scores $z=2$ and $z=4$. In terms of probability, how much did the prediction change? What does this illustrate about "squashing"? (One sentence.)

Compare $z=2$ and $z=4$.

From the table:
- $s(2)\approx 0.88$
- $s(4)\approx 0.98$

So:
$$
\Delta z = +2,\quad \Delta \hat{y} \approx 0.98 - 0.88 = +0.10.
$$

**Key point:** Even though the score $z$ increased by $+2$, the probability only increased by about $+0.10$, illustrating that sigmoid "squashes" large logits, so probabilities show diminishing returns near 0/1 (saturation).

**Question 4.** Now compare $z=-2$ and $z=-4$. What symmetry do you notice?

Compare $z=-2$ and $z=-4$. What symmetry do you notice?

From the table:
- $s(-2)\approx 0.12$
- $s(-4)\approx 0.02$

Symmetry property:
$$
s(-z)=1-s(z).
$$
Interpretation: logits of equal magnitude but opposite sign produce probabilities that are "mirrors" around 0.5 (e.g. 0.88 ↔ 0.12, 0.98 ↔ 0.02).

We will now go back to the whole class and discuss group answers for Part 1 in a plenary session.

---

## Part 2: One step of gradient descent (how logistic regression learns)

For the following problem, please choose a group facilitator/representative who will also take notes on your discussion.

Each document (sentence/comment) is converted into a feature vector ($x$). The model computes:

* **score (logit):** $z = w\cdot x + b$

  * $x$ is the feature vector for the document
  * $w$ is the weight vector (one weight per feature)
  * $b$ is a bias term (a constant offset)

* **predicted probability:** $\hat{y} = s(z)$, where
  $$
  s(z)=\frac{1}{1+\exp(-z)}
  $$
  $\hat{y}$ is the model's predicted probability that $y=1$ for this document.

* **true label:** $y \in \{0,1\}$ is the correct label for the document

  * $y=1$: positive (or "class 1")
  * $y=0$: negative (or "class 0")

The loss we use in lecture is the cross-entropy loss ($L_{CE}$). For a single training example ($(x,y)$), the derivative of the loss with respect to weight $w_j$ is:

$$
\frac{\partial L_{CE}}{\partial w_j} = [s(w\cdot x + b)-y]x_j
$$

Here is what each term means, in plain language:

* $L_{CE}$: the loss (how "wrong" the model is on this example)
* $w_j$: the weight for feature $j$
* $x_j$: the value of feature $j$ for this document
* $w\cdot x + b$: the score ($z$) (total evidence before applying sigmoid)
* $s(w\cdot x + b)$: the predicted probability ($\hat{y}$)
* $\hat{y}-y$: the "error term" (positive if we predicted too high; negative if we predicted too low)

Finally, gradient descent updates parameters by moving **against** the gradient. 

$$
q \leftarrow q - h g
$$

where:

* $q$ is the parameter vector (it contains the weights and bias)
* $h$ is the learning rate
* $g$ is the gradient vector

### Setup

We will classify a **movie review comment** using two word-count features:

* $x_1$ = number of **positive** words in the comment (from a small positive lexicon)
* $x_2$ = number of **negative** words in the comment (from a small negative lexicon)

Consider the comment:

> "The acting was great and the soundtrack was incredible, and the cinematography was amazing — but the plot was boring and the ending was bad."

Assume our lexicons contain:

* positive words: {great, incredible, amazing}
* negative words: {boring, bad}

So the feature vector is:

* $x_1 = 3$ (great, incredible, amazing)
* $x_2 = 2$ (boring, bad)
* $x = [3,2]$

We will start with:

* $w_1 = 0$, $w_2 = 0$, $b = 0$

So initially:
$$
z = w\cdot x + b = 0 \quad\Rightarrow\quad \hat{y}=s(0)=0.5
$$

Let the learning rate be $h = 0.1$.

---

### Case 1: The comment is labeled positive ($y=1$)

1. Compute $\hat{y}-y$ at initialization. Is it positive or negative?

Compute $\hat{y}-y$:
$$
\hat{y}-y = 0.5 - 1 = -0.5
$$
Negative.

2. Using
   $$
   \frac{\partial L_{CE}}{\partial w_j} = (\hat{y}-y)x_j,
   $$
   determine the sign (positive or negative) of:

   * $\frac{\partial L_{CE}}{\partial w_1}$
   * $\frac{\partial L_{CE}}{\partial w_2}$

Signs of gradients:
$$
\frac{\partial L_{CE}}{\partial w_1} = (\hat{y}-y)x_1 = (-0.5)(3) < 0
$$
$$
\frac{\partial L_{CE}}{\partial w_2} = (\hat{y}-y)x_2 = (-0.5)(2) < 0
$$
So both gradient components are negative.

3. Gradient descent updates:
   $$
   w_j \leftarrow w_j - h\frac{\partial L_{CE}}{\partial w_j}
   $$
   Will $w_1$ increase or decrease? Will $w_2$ increase or decrease?

Update subtracts the gradient, so subtracting a negative value increases each weight:
- $w_1$ increases
- $w_2$ increases

(Optional numeric update:)
$$
w_1 \leftarrow 0 - 0.1(-1.5)= +0.15,\quad
w_2 \leftarrow 0 - 0.1(-1.0)= +0.10.
$$

4. After this update, will the new score $z = w\cdot x + b$ be larger or smaller than before?
   Therefore, will $\hat{y}=s(z)$ move toward **1** or toward **0**?

Since $x_1,x_2>0$ and both weights increased, the score $z=w\cdot x+b$ increases. Therefore $\hat{y}=s(z)$ increases and moves toward 1.

(Optional numeric check:)
$$
z_{\text{new}} = (0.15)(3) + (0.10)(2) + 0 = 0.65 \Rightarrow \hat{y}_{\text{new}} = s(0.65) > 0.5.
$$

---

### Case 2: The same comment is labeled negative ($y=0$)

Repeat Questions 1–4, but with $y=0$.

5. What is the sign of $\hat{y}-y$ now?

Compute $\hat{y}-y$:
$$
\hat{y}-y = 0.5 - 0 = +0.5
$$
Positive.

6. Will $w_1$ and $w_2$ increase or decrease?

Signs of gradients:
$$
\frac{\partial L_{CE}}{\partial w_1} = (0.5)(3) > 0,\quad
\frac{\partial L_{CE}}{\partial w_2} = (0.5)(2) > 0
$$
Subtracting positive gradients decreases both weights:
- $w_1$ decreases
- $w_2$ decreases

(Optional numeric update:)
$$
w_1 \leftarrow 0 - 0.1(1.5)= -0.15,\quad
w_2 \leftarrow 0 - 0.1(1.0)= -0.10.
$$

7. Will $z$ increase or decrease? Will $\hat{y}$ move toward 1 or toward 0?

With positive features and smaller (more negative) weights, $z=w\cdot x+b$ decreases. Therefore $\hat{y}=s(z)$ decreases and moves toward 0.

(Optional numeric check:)
$$
z_{\text{new}} = (-0.15)(3) + (-0.10)(2) + 0 = -0.65 \Rightarrow \hat{y}_{\text{new}} = s(-0.65) < 0.5.
$$

---

### Discussion (one sentence each)

8. In one sentence: explain why $(\hat{y}-y)$ makes sense as an "error signal."

$(\hat{y}-y)$ is an "error signal" because its sign encodes whether the model is overpredicting or underpredicting:
- if $\hat{y}>y$, then $\hat{y}-y>0$ and the update pushes $z$ down
- if $\hat{y}<y$, then $\hat{y}-y<0$ and the update pushes $z$ up

9. In one sentence: explain why multiplying by $x_j$ makes sense (why a feature that appears more should change its weight more).

Multiplying by $x_j$ makes sense because if feature $j$ appears more in the document (larger $x_j$), it contributed more to the score $z$ for this example, so the update for its weight $w_j$ should be larger in magnitude.

We will now go back to the whole class and discuss group answers for Part 2 in a plenary session.