# Homework 3
Esther Lyon Delsordo

## Problem 1: Overparameterization
Consider a multi-class classification problem (N = 3) with a single feature (called x). Assuming that the
logits are related to the feature via a linear model, we can write the conditional class probabilities as
$$
P(y = k|x) = \frac{exp(W_{0,k} + W_{1,k}x)}{\Sigma_j^N exp (W_{0,j} + W_{1,j}x)}
$$
This model has 2N free parameters, two for each class. Show that this model is overparameterized; specifically, show that this model can be equivalently represented with 2(N − 1) free parameters. HINT: With two
classes, this procedure exactly recovers logistic regression.

**Solution:** Consider the case with only $N = 3$ classes. We can write the conditional class probabilities for $y=1$ as:
$$
P(y = 1|x) = \frac{exp(W_{0,1}+W_{1,1}x)}{exp(W_{0,1}+W_{1,1}x)+exp(W_{0,2}+W_{1,2}x)+exp(W_{0,3}+W_{1,3}x)}
$$
Then if we multiply by a "funky one" (an identity), we get:
$$
\frac{\frac{1}{exp(W_{0,1}+W_{1,1}x)}}{\frac{1}{exp(W_{0,1}+W_{1,1}x)}} \times \frac{exp(W_{0,1}+W_{1,1}x)}{exp(W_{0,1}+W_{1,1}x)+exp(W_{0,2}+W_{1,2}x)+exp(W_{0,3}+W_{1,3}x)}
\\
= \frac{1}{1+\frac{exp(W_{0,2}+W_{1,2}x)}{exp(W_{0,1}+W_{1,1}x)}+\frac{exp(W_{0,3}+W_{1,3}x)}{exp(W_{0,1}+W_{1,1}x)}}
\\
= \frac{1}{1+exp(W_{0,2}+W_{1,2}x-W_{0,1}-W_{1,1}x)+exp(W_{0,3}+W_{1,3}x-W_{0,1}-W_{1,1}x)}
$$
Now, let's redefine our variables by combining like terms. Let:
$$
b_1 = W_{0,2}-W_{0,1}
\\
m_1 = W_{1,2}-W{1,1}
\\
b_2 = W_{0,3}-W_{0,1}
\\
m_2 = W_{1,3}-W{1,1}
$$
We get:
$$
P(y = 1|x) = \frac{1}{1+exp(b_1+m_1x)+exp(b_2+m_2x)}
$$
This version only has $4 = 2(3-1)$ parameters, showing that our original model was overparametrized. Further, by taking the $N=2$ class version, we see that we have a logistic regression with:
$$
P(y = 1|x) = \frac{1}{1+exp(W_{0,2}+W_{1,2}x-W_{0,1}-W_{1,1}x)}
\\ = \frac{1}{1 + exp(b_1+m_1x)}
$$

## Problem 2: Backpropagation
### A
Derive and evaluate an expression for
$$
\frac{\partial L}{\partial w_2^{(2)}}
$$
Create a python script that computes this derivative for the feature x = 1.0, data y = 0.5, and parameter
values
$$
w^{(1)}_1 = 0.5 
\\ w^{(2)}_1 = 0.7 
\\ w^{(2)}_2 = −0.3
\\ w^{(3)}_1 = 0.1 
\\ w^{(3)}_2 = −0.8
$$
As a check, the correct value is approximately 0.099.

If we use the property $\sigma'(x) = \sigma(x)(1-\sigma(x))$ to compute $\frac{\partial z_2^{2}}{\partial a_2^{(2)}}$, we get:
$$
\begin{eqnarray}
\\
\frac{\partial L}{\partial w_2^{(2)}} &=& 
\frac{\partial L}{\partial z^{(3)}}\frac{\partial z^{(3)}}{\partial a^{(3)}}\frac{\partial a^{(3)}}{\partial z_2^{(2)}}\frac{\partial z_2^{(2)}}{\partial a_2^{(2)}}\frac{\partial a_2^{(2)}}{\partial w_2^{(2)}}
\\
\\
&=& (z^{(3)}-y)(1)(w_2^{(3)})(\sigma(a_2^{(2)})(1-\sigma(a_2^{(2)})))(z^{(1)})
\end{eqnarray}
$$

In [4]:
import numpy as np

# Given values
w1_1 = 0.5
w2_1 = 0.7
w2_2 = -0.3
w3_1 = 0.1
w3_2 = -0.8
x = 1.
y = 0.5

# Sigmoid function
def sigmoid(z):
    return 1./(1 + np.exp(-z))

# Calculate other variables from graph equations
a1 = x*w1_1
z1 = sigmoid(a1)
a2_1 = z1*w2_1
a2_2 = z1*w2_2
z2_2 = sigmoid(a2_2)
z2_1 = sigmoid(a2_1)
a3 = z2_1*w3_1+z2_2*w3_2
z3 = a3

# Calculate the derivative
dLdw2_2 = (z3-y)*1*w3_2*(sigmoid(a2_2)*(1-sigmoid(a2_2)))*z1
dLdw2_2

0.09898163577188604

## Problem 3: Robust Cross-Entropy
Consider a binary classification problem in which the target values are $y_{obs} \in 0, 1$, with a network output
y(x, w) that represents P(y = 1|x) (i.e. logistic regression). Now suppose that there is a probability $\epsilon$ that the class label on a training data point has been incorrectly set. Assuming independent and identically
distributed data, write down the negative log likelihood. Verify that the cross-entropy error function (Murphy
10.27) is obtained when $\epsilon$ = 0. Note that this error function makes the model robust to incorrectly labelled
data, in contrast to the usual error function.