__a) Show that the naive-softmax loss given in $J_{\text {naive_softmax}}(v_c, o, U) = - \log P(O=o|C=c)$ is the same as the cross-entropy loss between $y$ and $\hat{y}$; i.e., show that:__

$$- \sum \limits_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \log \hat {y}_o$$

Answer:
$y_w$ is non zero and equal to 1 only in the position $k$ of true word $o$. So left part can be simplified and be written as the only non zero term of a sum: $- \sum \limits_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \sum \limits_{w = o} {y_w \log \hat {y}_w} = - y_o \log \hat{y}_o = - 1 * \log \hat{y}_o = - \log \hat{y}_o$

__b) Compute the partial derivative of $J_{\text {naive-softmax}}(v_c, o, U)$ with respect to $v_c$ . Please write your answer in terms of $y$, $\hat{y}$, and $U$.__

Answer: 

$\frac {\partial J} {\partial v_c} = (y - \hat{y})U^T$

Numerical check (loss should be decreasing):

In [1]:
import numpy as np

def J(y, U, v_c):
    #softmax value for each word in vocabulary. Array shape [1,V], where V - size of vocabulary
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #loss function. Scalar value.
    J = np.sum(- y * np.log(y_hat))
    return J

def grad_v_c(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = np.sum((y_hat-y)[:,np.newaxis]*U.T, axis=0)
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.diag([0.00001]*4)

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_v_c(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    print("gradient check:", dj - [(J(y,U,v_c+h[0])-j)/np.sum(h[0]),
                                   (J(y,U,v_c+h[1])-j)/np.sum(h[1]),
                                   (J(y,U,v_c+h[2])-j)/np.sum(h[2]),
                                   (J(y,U,v_c+h[3])-j)/np.sum(h[3])])
    v_c = v_c - dj

iteration 0
Loss: 1.6286669460358003 v_c [0.11044461 0.49637592 0.59164704 0.52777441]
gradient [ 0.12555521  0.06494604 -0.05090479  0.0246106 ]
gradient check: [-1.79127658e-07 -4.06441447e-07 -7.98708971e-08 -3.36656124e-07]
iteration 1
Loss: 1.6056614757829795 v_c [-0.0151106   0.43142988  0.64255183  0.50316381]
gradient [ 0.12319114  0.06573569 -0.05017703  0.02198738]
gradient check: [-1.75666377e-07 -4.01828840e-07 -8.04450533e-08 -3.40770149e-07]
iteration 2
Loss: 1.5833234184915304 v_c [-0.13830174  0.36569419  0.69272886  0.48117643]
gradient [ 0.12099035  0.06627528 -0.04953588  0.01958544]
gradient check: [-1.72309719e-07 -3.97353032e-07 -8.10396256e-08 -3.44556652e-07]
iteration 3
Loss: 1.561604196838108 v_c [-0.25929209  0.2994189   0.74226474  0.46159098]
gradient [ 0.11893103  0.06660231 -0.04897301  0.01738821]
gradient check: [-1.69131233e-07 -3.92990247e-07 -8.15579603e-08 -3.47884898e-07]
iteration 4
Loss: 1.5404628684279995 v_c [-0.37822312  0.23281659  0.79123775

__c) Compute the partial derivatives of $J_{\text naive-softmax} (v_c, o, U)$ with respect to each of the ‘outside’
word vectors, $u_w$ ’s. There will be two cases: when $w = o$, the true ‘outside’ word vector, and $w \ne o$, for
all other words. Please write you answer in terms of $y$, $\hat{y}$, and $v_c$.__

Answer:

$$\frac {\partial J} {\partial U} = v_c (y - \hat{y})^T$$ 

Numerical check (loss should be decreasing):

In [2]:
def grad_U(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = v_c.T[:,np.newaxis] @ (y_hat-y).T[np.newaxis,:]
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.zeros((20,4,5))
for i in range(20):
    row = i // 5
    col = i % 5
    h[i,row,col] = 0.00001

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_U(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    diff = np.zeros_like(U)
    for i in range(20):
        row = i // 5
        col = i % 5
        diff[row,col] = (J(y,U+h[i],v_c)-j)/np.sum(h[i])
    print("gradient check:", dj - diff)
    U = U - dj

iteration 0
Loss: 1.6425340884085566 v_c [0.32366551 0.78602271 0.95299673 0.32574079]
gradient [[ 0.07071273  0.06883717 -0.26103976  0.08263298  0.03885688]
 [ 0.17172609  0.16717129 -0.63393588  0.20067446  0.09436405]
 [ 0.20820568  0.20268332 -0.76860225  0.24330353  0.11440971]
 [ 0.07116612  0.06927854 -0.2627135   0.08316281  0.03910602]]
gradient check: [[-8.94299828e-08 -8.77025546e-08 -8.17395531e-08 -9.95782286e-08
  -5.53298021e-08]
 [-5.27448146e-07 -5.17278156e-07 -4.82059162e-07 -5.87294578e-07
  -3.26345596e-07]
 [-7.75350542e-07 -7.60388967e-07 -7.08630746e-07 -8.63342035e-07
  -4.79706960e-07]
 [-9.05928829e-08 -8.88221932e-08 -8.27795587e-08 -1.00861391e-07
  -5.60309616e-08]]
iteration 1
Loss: 0.5366733735791156 v_c [0.32366551 0.78602271 0.95299673 0.32574079]
gradient [[ 0.03602409  0.03542334 -0.1344215   0.03948824  0.02348582]
 [ 0.08748462  0.08602571 -0.32644304  0.09589733  0.05703539]
 [ 0.10606889  0.10430006 -0.39578901  0.1162687   0.06915136]
 [ 0.0362

__d) The sigmoid function is given by equation:
$$ \sigma(x) = \frac 1 {1 + e^{-x}} = \frac {e^x} {e^x + 1} $$
Please compute the derivative of $\sigma(x)$ with respect to $x$, where $x$ is a vector.__

Answer:

$$\sigma^\prime(x) = \frac {(e^x)^\prime (e^x + 1) - e^x (e^x + 1)^\prime} {(e^x + 1)^2} = \frac {e^x (e^x + 1) - e^x e^x} {(e^x + 1)^2} = \frac {e^x} {(e^x + 1)^2} = \frac {e^x} {e^x + 1} \frac 1 {e^x + 1} = \sigma(x) (1 - \frac {e^x} {e^x + 1}) = \sigma(x) (1 - \sigma(x))$$

Numerical check:

In [3]:
def sigmoid(x):
    #naive sigmoid implementation
    return 1/(1 + np.exp(-x))

def grad_sigmoid(x):
    return sigmoid(x)*(1 - sigmoid(x))

#matrix with small shift for every element in vector X to perform finite differences check
h = np.array([0.0001]*4)

for i in range(5):
    #some random vector X (suppose d=4)
    X = np.random.random(4)
    s = sigmoid(X)
    ds = grad_sigmoid(X)
    print("X",X)
    print("Sigmoid value:",s)
    print("Gradient for x",ds)
    print("gradient check:", ds - (sigmoid(X+h)-s)/h[0])

X [0.6838126  0.52530528 0.87615279 0.56177687]
Sigmoid value: [0.6645891  0.62838748 0.70602435 0.63686357]
Gradient for x [0.22291043 0.23351666 0.20755397 0.23126836]
gradient check: [3.66898801e-06 2.99821760e-06 4.27620167e-06 3.16537091e-06]
X [0.79460562 0.37616876 0.76823168 0.07692841]
Sigmoid value: [0.68881939 0.59294872 0.68313825 0.51922262]
Gradient for x [0.21434724 0.24136053 0.21646038 0.24963049]
gradient check: [4.04739409e-06 2.24359611e-06 3.96432437e-06 4.80063570e-07]
X [0.43845801 0.30115655 0.61410629 0.55144119]
Sigmoid value: [0.60789154 0.57472522 0.64887693 0.63446989]
Gradient for x [0.23835941 0.24441614 0.22783566 0.23191785]
gradient check: [2.57186682e-06 1.82659335e-06 3.39208660e-06 3.11874846e-06]
X [0.32132219 0.99601131 0.13700658 0.51930419]
Sigmoid value: [0.57964645 0.73027363 0.53419817 0.62698505]
Gradient for x [0.24365644 0.19697405 0.24883049 0.2338748 ]
gradient check: [1.94082456e-06 4.53585153e-06 8.51158838e-07 2.97001772e-06]
X [0.998

__e) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1$, $w_2$, ..., $w_K$ and their outside vectors as $u_1$, ..., $u_K$. Note that $o \notin \{w_1, ..., w_K\}$. For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:__

$$J_{\text neg-sample}(v_c, o, U) = − \log(\sigma(u_o v_c)) − \sum\limits_K {\log(\sigma(−u_k v_c))}$$

__for a sample $w_1$, ..., $w_K$, where $\sigma(x)$ is the sigmoid function. Please repeat parts (b) and (c), computing the partial derivatives of $J_{\text neg-sample}$ with respect to $v_c$, with respect to $u_o$, and with respect to a negative sample $u_k$. Please write your answers in terms of the vectors $u_o$, $v_c$ and $u_k$, where $k \in [1,K]$.__

__After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.__

Answer:

$$\frac {\partial{J}} {\partial{u_o}} = - (1 - \sigma(u_o^T v_c))v_c$$

$$\frac {\partial{J}} {\partial{u_k}} = (1 - \sigma(-u_k^T v_c))v_c \\ 
\text{or in matrix form} \\ 
\frac {\partial{J}} {\partial{U}} = v_c \times (1 - \sigma(-U^T v_c))^T$$

$$\frac {\partial{J}} {\partial{v_c}} = - (1 - \sigma(u_o^T v_c)) u_o + \sum\limits_K {(1 - \sigma(-u_k^T v_c))u_k} \\ 
\text{or in matrix form} \\ 
\frac {\partial{J}} {\partial{v_c}} = - (1 - \sigma(u_o^T v_c)) u_o + U \times (1 - \sigma(-U^T v_c))$$

Numerical check:

In [4]:
def J(u_o, u, v_c):
    J = -np.log(sigmoid(u_o.T @ v_c)) - np.sum(np.log(sigmoid(-u.T @ v_c)), axis = 0)
    return J.ravel()

In [5]:
def grad_U_o(u_o, v_c):
    return -(1 - sigmoid(u_o.T @ v_c))*v_c

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.diag([0.0001]*4)

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_U_o(u_o,v_c).reshape((4,1))
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    print("gradient check:", dj - np.array([(J(u_o+h[0].reshape((-1,1)),u,v_c)-j)/np.sum(h[0]),
                                            (J(u_o+h[1].reshape((-1,1)),u,v_c)-j)/np.sum(h[1]),
                                            (J(u_o+h[2].reshape((-1,1)),u,v_c)-j)/np.sum(h[2]),
                                            (J(u_o+h[3].reshape((-1,1)),u,v_c)-j)/np.sum(h[3])]))
    u_o = u_o - dj

iteration 0
Loss: [4.25447771]
gradient [[-0.13942545]
 [-0.05123947]
 [-0.02688636]
 [-0.15435727]]
gradient check: [[-5.06282006e-06]
 [-6.83794527e-07]
 [-1.88272380e-07]
 [-6.20528501e-06]]
iteration 1
Loss: [4.2131611]
gradient [[-0.10879054]
 [-0.039981  ]
 [-0.02097883]
 [-0.12044151]]
gradient check: [[-4.11703830e-06]
 [-5.56052748e-07]
 [-1.53097848e-07]
 [-5.04607738e-06]]
iteration 2
Loss: [4.18742906]
gradient [[-0.08906147]
 [-0.03273048]
 [-0.01717433]
 [-0.09859954]]
gradient check: [[-3.45826633e-06]
 [-4.67082714e-07]
 [-1.28601735e-07]
 [-4.23864607e-06]]
iteration 3
Loss: [4.16991111]
gradient [[-0.07533683]
 [-0.02768662]
 [-0.01452772]
 [-0.08340505]]
gradient check: [[-2.97702822e-06]
 [-4.02076851e-07]
 [-1.10709814e-07]
 [-3.64881489e-06]]
iteration 4
Loss: [4.15723224]
gradient [[-0.06525236]
 [-0.02398053]
 [-0.01258306]
 [-0.07224058]]
gradient check: [[-2.61142928e-06]
 [-3.52704106e-07]
 [-9.71098873e-08]
 [-3.20071715e-06]]


In [6]:
def grad_U(u, v_c):
    return v_c @ (1 - sigmoid(-u.T @ v_c)).T

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.zeros((12,4,3))
for i in range(12):
    row = i // 3
    col = i % 3
    h[i,row,col] = 0.0001

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_U(u,v_c)
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    diff = np.zeros_like(u)
    for i in range(12):
        row = i // 3
        col = i % 3
        diff[row,col] = (J(u_o,u+h[i],v_c)-j)/np.sum(h[i])
    print("gradient check:", dj - diff)
    u = u - dj

iteration 0
Loss: [4.69323642]
gradient [[0.68944767 0.58271003 0.5562555 ]
 [0.69300844 0.58571953 0.55912837]
 [0.11040547 0.09331292 0.08907659]
 [0.06534438 0.05522802 0.05272071]]
gradient check: [[-4.61534981e-06 -7.01066395e-06 -7.42816962e-06]
 [-4.66315209e-06 -7.08327253e-06 -7.50509682e-06]
 [-1.18358924e-07 -1.79775024e-07 -1.90485117e-07]
 [-4.14665845e-08 -6.29782596e-08 -6.67234966e-08]]
iteration 1
Loss: [2.72423573]
gradient [[0.50823638 0.3917862  0.37004832]
 [0.51086125 0.39380964 0.3719595 ]
 [0.081387   0.06273912 0.0592581 ]
 [0.04816956 0.03713266 0.03507239]]
gradient check: [[-8.00720026e-06 -8.45377267e-06 -8.38693249e-06]
 [-8.09012194e-06 -8.54131747e-06 -8.47379099e-06]
 [-2.05337584e-07 -2.16786144e-07 -2.15070723e-07]
 [-7.19267686e-08 -7.59393574e-08 -7.53377815e-08]]
iteration 2
Loss: [1.79095642]
gradient [[0.33484305 0.26301125 0.25070819]
 [0.3365724  0.26436961 0.25200302]
 [0.05362046 0.0421176  0.04014743]
 [0.03173571 0.02492765 0.02376159]]
gra

In [7]:
def grad_v_c(u_o, u, v_c):
    return -(1 - sigmoid(u_o.T @ v_c))*u_o + u @ (1 - sigmoid(-u.T @ v_c))

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.diag([0.0001]*4)

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_v_c(u_o, u, v_c)
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    print("gradient check:", dj - np.array([(J(u_o,u,v_c+h[0].reshape((-1,1)))-j)/np.sum(h[0]),
                                            (J(u_o,u,v_c+h[1].reshape((-1,1)))-j)/np.sum(h[1]),
                                            (J(u_o,u,v_c+h[2].reshape((-1,1)))-j)/np.sum(h[2]),
                                            (J(u_o,u,v_c+h[3].reshape((-1,1)))-j)/np.sum(h[3])]))
    v_c = v_c - dj

iteration 0
Loss: [4.74420864]
gradient [[1.44004953]
 [0.83595838]
 [1.0429235 ]
 [0.73707864]]
gradient check: [[-1.28144872e-05]
 [-6.10326807e-06]
 [-9.30622945e-06]
 [-5.57731323e-06]]
iteration 1
Loss: [1.98857823]
gradient [[0.45151339]
 [0.13523654]
 [0.1867848 ]
 [0.12469523]]
gradient check: [[-1.44567119e-05]
 [-6.63057893e-06]
 [-1.03890310e-05]
 [-5.86220130e-06]]
iteration 2
Loss: [1.78802359]
gradient [[0.27016981]
 [0.01593348]
 [0.04689248]
 [0.01601042]]
gradient check: [[-1.11459844e-05]
 [-5.08473339e-06]
 [-7.55511551e-06]
 [-4.76154026e-06]]
iteration 3
Loss: [1.72307934]
gradient [[ 0.20443395]
 [-0.02302179]
 [ 0.00313705]
 [-0.02083734]]
gradient check: [[-9.70767802e-06]
 [-4.50838810e-06]
 [-6.51477863e-06]
 [-4.33537461e-06]]
iteration 4
Loss: [1.68348328]
gradient [[ 0.17047353]
 [-0.04045148]
 [-0.01571637]
 [-0.03772361]]
gradient check: [[-8.89888078e-06]
 [-4.24235167e-06]
 [-6.02978694e-06]
 [-4.13987573e-06]]


__f) Suppose the center word is $c = w_t$ and the context window is $[w_{t−m}, ..., w_{t−1}, w_t, w_{t+1}, ...,
w_{t+m}]$, where $m$ is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:__

$$J_{\text skip-gram}(v_c, w_{t−m}, ..., w_{t+m}, U) = \sum\limits_{-m \le j \le m} {J(v_c, w_{t+j}, U)}$$

__Here, $J(v_c, w_{t+j}, U)$ represents an arbitrary loss term for the center word $c = w_t$ and outside word $w_{t+j}$. $J(v_c, w_{t+j}, U)$ could be $J_{\text naive-softmax}(v_c, w_{t+j}, U)$ or $J_{\text neg-sample}(v_c, w_{t+j}, U)$, depending on your implementation.__

__Write down three partial derivatives:__
- $\frac {\partial J_{\text skip-gram}} {\partial U}$
- $\frac {\partial J_{\text skip-gram}} {\partial v_c}$
- $\frac {\partial J_{\text skip-gram}} {\partial v_w} \text {, where } w \ne c$

__Write your answers in terms of $\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial U}$ and $\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial v_c}$__

Answer:

$$\frac {\partial J_{\text skip-gram}} {\partial U} = \sum\limits_{-m \le j \le m} {\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial U}}$$

$$\frac {\partial J_{\text skip-gram}} {\partial v_c} = \sum\limits_{-m \le j \le m} {\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial v_c}}$$

$$\frac {\partial J_{\text skip-gram}} {\partial v_w} = 0 \text {, for all } w \ne c $$