# CSCI 632 Machine Learning Homework 9

This homework provides a sampling of questions from last year's final, plus some
other questions that cover topics that might appear on the final.
It also covers the last bit of material in the class.


**Instructions**

* **Insert all code, plots, results, and discussion** into this Jupyter Notebook.
* Your homework should be submitted as a **single Jupyter Notebook** (.ipynb file).
* While working, you use Google Colab by uploading this notebook and performing work there. Once complete, export the notebook as a Jupyter Notebook (.ipynb) and submit it to **Blackboard.**

You can answer mathematical questions either by:
* using LaTeX in a markdown cell, or
* pasting a scanned or photographed handwritten answer.

### Part I: True / False

1) True/False: The ReLU activation function is differentiable everywhere.

2) True/False: Precision is the ratio of true positives to the total number of positive predictions.

3) True/False: The logistic activation function limits output to the range (0,1).

4) True/False: High accuracy guarantees good performance in a dataset with class imbalance.

5) True/False: The softmax activation function is only used in binary classification tasks.

#### Additional study

Expect other questions about anything covered in a lecture. This includes but is not limited to:
* optimal classifiers
* Bayes' risk
* GDA
* logistic regression
* information, information gain, self-entropy
* decision trees
* random forests
* multi-layer perceptrons
* backpropagation
* MLP for multi-class classifiers.

### Part II: Multiple Choice

Throughout this exam $y$ refers to the output and $x$ to the input
of a model.  NOTE: there is exactly one correct answer to each question.

6) Which activation function is appropriate for a regression task where the target values span the nonnegative real line?

  * (A) ReLU
  * (B) Logistic
  * (C) Tanh
  * (D) None (use a linear activation)

7) Which of the following is true about the difference between generative and discriminative models?

  * (A) Generative models learn the posterior probability $P(y \mid x)$, while discriminative models learn likelihood $P(x \mid y)$.
  * (B) Generative models model the likelihood $P(x \mid y)$ and the class prior $P(y)$, while discriminative models directly estimate the posterior $P(y \mid x)$.
  * (C) Logistic regression is a generative model, while GDA is a discriminative model.
  * (D) Generative models cannot be used for classification.

8) What is the primary goal of the optimal Bayes‚Äô classifier?
  * (A) To maximize the likelihood of the observed data.
  * (B) To minimize the probability of misclassification.
  * (C) To find the decision boundary that separates the classes perfectly.
  * (D) To estimate the class priors  $P(y)$  accurately.

9) Logistic regression‚Äôs decision boundary is:
  * (A) Non-linear in the input space.
  * (B) Linear in the input space.
  * (C) Linear in the feature space only when using softmax.
  * (D) Based on $P(x \mid y)$.

10) Gaussian Discriminant Analysis (GDA) assumes:
  * (A) The class-conditional distributions  $P(x \mid y)$  are Gaussian.
  * (B) The covariances of the Gaussian distributions for all classes are identical.
  * (C) The class priors  $P(y)$  are known or can be estimated.
  * (D) All of the above.

#### Additional study

Expect other questions about anything covered in a lecture. This includes but is not limited to the same scope as shown in T/F.

### Part III: Classifier Performance

A binary classifier $h$ predicts whether an image contains a car (positive class).

| # |Image | Description      |$y$|$\hat{y}$
|---|------|------------------|---|---------
|1  | üöó   | red car          | T |  T
|2  | üöô   | SUV              | T |  F
|3  | üöï   | taxi             | T |  F
|4  | üöì   | police car       | T |  F
|5  | üöå   | bus              | F |  F
|6  | üöö   | delivery truck   | F |  F
|7  | üö≤   | bicycle          | F |  F
|8  | üõµ   | scooter          | F |  F
|9  | üê∂   | dog              | F |  F

11) Create a confusion matrix and place the number of samples that fall in 
each quadrant in the respective quadrant.

12) Compute recall


13) Compute precision

14) Compute accuracy


### Part IV: Bayes' Law

**Problem 15**: A screening test is used to detect a certain type of cancer. The test has the following characteristics:

* If a person has the cancer, the test correctly gives a positive result 98% of the time.
* If a person does not have the cancer, the test correctly gives a negative result 92% of the time.
* This cancer is relatively rare and affects 2 out of every 1,000 people in the general population.

A randomly selected person takes the test, and the result is positive.
What is the probability that this person actually has the cancer?

### Part V: Shannon-Entropy, Cross-Entropy, Self-Information

**Problem 16**

Consider a 6-sided die with the following probabilities:

\begin{align*}
P(1) &= 0.1 \\
P(2) &= 0.1 \\
P(3) &= 0.2 \\
P(4) &= 0.2 \\
P(5) &= 0.2 \\
P(6) &= 0.2
\end{align*}

**(a)** Compute the Shannon entropy \(H(X)\) of a roll in bits.


**(b)** What would the entropy be if the die were fair?

**(c)** Briefly explain why the entropy changes when the die becomes biased.

**Additional material**

Review the cross-information problem in HW 6.



### Part VI: Decision Trees and Random Forests

**Problem 17** 

You are building a decision tree to predict whether a customer will **buy** a product
(`Buy = Yes` or `Buy = No`) based on simple demographic features.

You have data for 14 customers. The target variable \(Y\) = `Buy` has the following
distribution:

- 9 customers: `Buy = Yes`
- 5 customers: `Buy = No`

So:

$$P(Y = \text{Yes}) = \frac{9}{14}, \quad P(Y = \text{No}) = \frac{5}{14}.$$

You are considering two different attributes for the root split:

- `AgeGroup` with values: `Young`, `Middle`, `Old`
- `Student` with values: `Yes`, `No`

The data grouped by attribute values are:

**Split by `AgeGroup`:**

- `Young`: 5 customers  
  - 2: `Buy = Yes`  
  - 3: `Buy = No`
- `Middle`: 4 customers  
  - 3: `Buy = Yes`  
  - 1: `Buy = No`
- `Old`: 5 customers  
  - 4: `Buy = Yes`  
  - 1: `Buy = No`

**Split by `Student`:**

- `Student = Yes`: 6 customers  
  - 5: `Buy = Yes`  
  - 1: `Buy = No`
- `Student = No`: 8 customers  
  - 4: `Buy = Yes`  
  - 4: `Buy = No`

Use base-2 logarithms in all entropy calculations.


**(a)** Compute the entropy of the target variable \(H(Y)\) before any split:

$$H(Y) = - \sum_{y \in \{\text{Yes}, \text{No}\}} P(Y = y) \log_2 P(Y = y).$$

Show your work.


**(b)** Compute the conditional entropy $H(Y \mid \text{AgeGroup})$.

That is,

$$H(Y \mid \text{AgeGroup}) =
\sum_{a \in \{\text{Young}, \text{Middle}, \text{Old}\}}
P(\text{AgeGroup} = a) \, H(Y \mid \text{AgeGroup} = a).$$


**(c)** Compute the conditional entropy $H(Y \mid \text{Student})$.


**(d)** Compute the information gain for each attribute.

Which attribute should be chosen as the root split according to the information gain criterion?
Briefly justify your answer using the values you computed.


**Part VII: Multi-Layer Perceptrons**

Consider a Multi-Layer Perceptron
  (MLP) with the following structure:

* Input layer: 2 nodes.
* Hidden layer: 2 nodes with ReLU activation $\text{ReLU}(x) = \max(0, x)$.
* Output layer: 3 nodes with softmax activation.

Given

* Input

$$\mathbf{a}^{[0]} = \mathbf{x} = \begin{bmatrix} 1.0 \\ 2.0 \end{bmatrix}.$$

* Weights and biases for the hidden layer ($\ell=1$):

$$W^{[1]} = \begin{bmatrix}
  0.5 & -1.0 \\
  1.5 & 2.0
\end{bmatrix}, \quad
\mathbf{b}^{[1]} = \begin{bmatrix}
  0.0 \\
  -0.5
\end{bmatrix}.
$$

* Weights and biases for the output layer ($\ell=2$):

$$W^{[2]} = \begin{bmatrix}
2.0 & -1.0 \\
-1.0 & 1.0 \\
0.5 & 0.5
\end{bmatrix}, \quad
\mathbf{b}^{[2]} = \begin{bmatrix}
0.0 \\
0.5 \\
-0.5
\end{bmatrix}.
$$

**Problem 18** Draw this multi-layer perceptron labelling each node and the weights on each edge.   Draw each node as a box containing a summation symbol,
pre-activation label, activation function symbol, and activation label.
Show the bias into each node.

**Problem 19** Is this neural network most appropriate for regression
across positive real numbers, regression across the entire real-line,
binary classification, or multi-class classification?   Why?


**Problem 20** Compute the pre-activation values $\mathbf{z}^{[1]}$ for the
hidden layer $\ell=1$:

$$\mathbf{z}^{[1]} = W^{[1]} \cdot \mathbf{x} + \mathbf{b}^{[1]}$$


**Problem 21** Apply the ReLU activation to compute the hidden layer activations
$\mathbf{a}^{[1]}$:

$$\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]})$$


**Problem 22** Compute the pre-activation values
$\mathbf{z}^{[2]}$ for the output layer:

$$\mathbf{z}^{[2]} = W^{[2]} \cdot \mathbf{a}^{[1]} + \mathbf{b}^{[2]}.$$

**Problem 23** Apply the softmax activation to compute the output probabilities
$\mathbf{a}_{[2]}$:

$$\text{softmax}(z^{[2]}_i) = \frac{e^{z^{[2]}_i}}{\sum_{j=1}^3 e^{z^{[2]}_j}}$$


**Problem 24** Which class does the network predict based on the softmax output?


**Part VIII**

You are tasked with deriving the gradient update rules for a
three-layer neural network, where layer $\ell=0$ is the inputs, with a
ReLU activation function at hidden layer $\ell=1$ and a scalar output
$\hat{y}$ in layer $\ell=2$.  We apply a mean squared error loss
function to the output when training the network.  There is only one
node in the output layer.

The network is structured as follows:

**Layer $\ell=0$ (Input Layer):** $\mathbf{x} \in \mathbb{R}^{n}$ (1
  input vector of n features).

**Layer $\ell=1$ (Hidden Layer):**

 * Weights: $W^{[1]} \in \mathbb{R}^{m \times n}$
 * Bias: $\mathbf{b}^{[1]} \in \mathbb{R}^{m}$
 * Activation: $ReLU: \text{ReLU}(z) = \max(0, z)$

**Layer $\ell=2$                                            (Output Layer):**

 * Weights: $W^{[2]} \in \mathbb{R}^{1 \times m}$
 * Bias: $b^{[2]} \in \mathbb{R}$

The forward pass equations are:

 * Hidden layer $\ell=1$ pre-activation: $\mathbf{z}^{[1]} = W^{[1]} \mathbf{x} + \mathbf{b}^{[1]} = W^{[1]} \mathbf{a}^{[0}] + \mathbf{b}^{[1]}$
 * Hidden layer $\ell=1$ activation: $\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]})$
 * Output layer pre-activation: $z^{[2]} = W^{[2]} \mathbf{a}^{[1]} + b^{[2]}$
 * Prediction: $\hat{y} = z^{[2]}$

The loss function is the mean squared error:

$$L = \frac{1}{2} (\hat{y} - y)^2,$$

where $y$ is the true output.


**Proble 25** Draw this neural neural network labelling each node and the
weights on each edge.  Draw each node as a box containing a summation
symbol, pre-activation label, activation function symbol, and
activation label.  Show the bias into each node. 


**Problem 26** Draw matrices showing the layout of coefficients using the same
label names as were used in your drawing of the neural network.
For example how our weights laid out in the matrices,
$W_{11}, W_{12}, \dots W_{mn}$. This is to demonstrate an
understanding of how the diagram in $(a)$ is represented as
matrices.  Show the input vector, weights and bias at each level
and the output vector. 

**Problem 27** Is this network most appropriate for regression across
positive real numbers, regression across the entire real-line,
binary classification, or multi-class classification?  Why?


**Problem 28**  Derive the gradient of the loss function $L$ with respect to the
second layer weights $W^{[2]}$ and bias $b^{[2]}$. 


**Problem 29** Derive the gradient of the loss function $L$ with respect to the first layer weights $W^{[1]}$ and bias $b^{[1]}$. 



**Problem 30** Write the weight update equations for both layers using gradient descent:

$$\theta \leftarrow \theta - \alpha \nabla_\theta L,$$

where $\theta$ represents the weights and biases, and $\alpha$ is the
learning rate.