__a) Show that the naive-softmax loss given in $J_{\text {naive_softmax}}(v_c, o, U) = - \log P(O=o|C=c)$ is the same as the cross-entropy loss between $y$ and $\hat{y}$; i.e., show that:__

$$- \sum \limits_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \log \hat {y}_o$$

Answer:
$y_w$ is non zero and equal to 1 only in the position $k$ of true word $o$. So left part can be simplified and be written as the only non zero term of a sum: $- \sum \limits_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \sum \limits_{w = o} {y_w \log \hat {y}_w} = - y_o \log \hat{y}_o = - 1 * \log \hat{y}_o = - \log \hat{y}_o$

__b) Compute the partial derivative of $J_{\text {naive-softmax}}(v_c, o, U)$ with respect to $v_c$ . Please write your answer in terms of $y$, $\hat{y}$, and $U$.__

Answer: 

$\frac {\partial J} {\partial v_c} = (y - \hat{y})U^T$

Numerical check (loss should be decreasing):

In [5]:
import numpy as np

def J(y, U, v_c):
    #softmax value for each word in vocabulary. Array shape [1,V], where V - size of vocabulary
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #loss function. Scalar value.
    J = np.sum(- y * np.log(y_hat))
    return J

def grad_v_c(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = np.sum((y_hat-y)[:,np.newaxis]*U.T, axis=0)
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.diag([0.00001]*4)

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_v_c(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    print("gradient check:", dj - [(J(y,U,v_c+h[0])-j)/np.sum(h[0]),
                                   (J(y,U,v_c+h[1])-j)/np.sum(h[1]),
                                   (J(y,U,v_c+h[2])-j)/np.sum(h[2]),
                                   (J(y,U,v_c+h[3])-j)/np.sum(h[3])])
    v_c = v_c - dj

iteration 0
Loss: 1.5890904292435637 v_c [0.45885213 0.73172628 0.24022176 0.73334472]
gradient [-0.26301605  0.01924709 -0.24670471  0.2366701 ]
gradient check: [-2.33588093e-07 -3.83483198e-07 -2.60442183e-07 -1.53415490e-07]
iteration 1
Loss: 1.4083963494249763 v_c [0.72186817 0.71247919 0.48692647 0.49667461]
gradient [-0.24973468  0.0107495  -0.23143907  0.21901466]
gradient check: [-2.38928241e-07 -3.80015229e-07 -2.56052697e-07 -1.53460447e-07]
iteration 2
Loss: 1.2497108363431604 v_c [0.97160285 0.70172969 0.71836554 0.27765996]
gradient [-0.23580589  0.00366018 -0.21633927  0.20207384]
gradient check: [-2.42400880e-07 -3.72014821e-07 -2.50647304e-07 -1.51925110e-07]
iteration 3
Loss: 1.1113080418408678 v_c [1.20740874 0.69806951 0.9347048  0.07558611]
gradient [-0.22167337 -0.0020538  -0.20163608  0.18606347]
gradient check: [-2.43786463e-07 -3.60542837e-07 -2.44207683e-07 -1.49041492e-07]
iteration 4
Loss: 0.9912210325017193 v_c [ 1.42908211  0.70012332  1.13634089 -0.1104773

__c) Compute the partial derivatives of $J_{\text naive-softmax} (v_c, o, U)$ with respect to each of the ‘outside’
word vectors, $u_w$ ’s. There will be two cases: when $w = o$, the true ‘outside’ word vector, and $w \ne o$, for
all other words. Please write you answer in terms of $y$, $\hat{y}$, and $v_c$.__

Answer:

$$\frac {\partial J} {\partial U} = v_c (y - \hat{y})^T$$ 

Numerical check (loss should be decreasing):

In [8]:
def grad_U(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = v_c.T[:,np.newaxis] @ (y_hat-y).T[np.newaxis,:]
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.zeros((20,4,5))
for i in range(20):
    row = i // 5
    col = i % 5
    h[i,row,col] = 0.00001

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_U(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    diff = np.zeros_like(U)
    for i in range(20):
        row = i // 5
        col = i % 5
        diff[row,col] = (J(y,U+h[i],v_c)-j)/np.sum(h[i])
    print("gradient check:", dj - diff)
    U = U - dj

iteration 0
Loss: 2.0075040558004953 v_c [0.68634028 0.99313711 0.07836505 0.11857169]
gradient [[ 0.12666534  0.11158217 -0.59414863  0.17204079  0.18386034]
 [ 0.18328524  0.16145984 -0.85973543  0.24894371  0.26604664]
 [ 0.01446241  0.01274024 -0.06783878  0.0196433   0.02099283]
 [ 0.02188262  0.01927686 -0.10264472  0.02972165  0.03176359]]
gradient check: [[-3.54426763e-07 -3.20656466e-07 -2.73864997e-07 -4.42397011e-07
  -4.61927854e-07]
 [-7.42177257e-07 -6.71423551e-07 -5.73425745e-07 -9.26322839e-07
  -9.67190941e-07]
 [-4.59959692e-09 -4.15287579e-09 -3.55509641e-09 -5.77902785e-09
  -6.02744712e-09]
 [-1.05645871e-08 -9.53816363e-09 -8.16228823e-09 -1.31796270e-08
  -1.37556228e-08]]
iteration 1
Loss: 0.827735750645404 v_c [0.68634028 0.99313711 0.07836505 0.11857169]
gradient [[ 0.08731442  0.07945574 -0.38638368  0.10755608  0.11205744]
 [ 0.12634431  0.11497277 -0.55909872  0.15563408  0.16214756]
 [ 0.0099694   0.00907211 -0.04411657  0.01228055  0.01279451]
 [ 0.01508