__a) Show that the naive-softmax loss given in $J_{\text {naive_softmax}}(v_c, o, U) = - \log P(O=o|C=c)$ is the same as the cross-entropy loss between $y$ and $\hat{y}$; i.e., show that:__

$$- \sum \limits_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \log \hat {y}_o$$

Answer:
$y_w$ is non zero and equal to 1 only in the position $k$ of true word $o$. So left part can be simplified and be written as the only non zero term of a sum: $- \sum \limits_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \sum \limits_{w = o} {y_w \log \hat {y}_w} = - y_o \log \hat{y}_o = - 1 * \log \hat{y}_o = - \log \hat{y}_o$

__b) Compute the partial derivative of $J_{\text {naive-softmax}}(v_c, o, U)$ with respect to $v_c$ . Please write your answer in terms of $y$, $\hat{y}$, and $U$.__

Answer: 

$\frac {\partial J} {\partial v_c} = (\hat{y} - y)U^T$

Numerical check (loss should be decreasing):

In [1]:
import numpy as np

def J(y, U, v_c):
    #softmax value for each word in vocabulary. Array shape [1,V], where V - size of vocabulary
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #loss function. Scalar value.
    J = np.sum(- y * np.log(y_hat))
    return J

def grad_v_c(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = np.sum((y_hat-y)[:,np.newaxis]*U.T, axis=0)
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.diag([0.00001]*4)

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_v_c(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    print("gradient check:", dj - [(J(y,U,v_c+h[0])-j)/np.sum(h[0]),
                                   (J(y,U,v_c+h[1])-j)/np.sum(h[1]),
                                   (J(y,U,v_c+h[2])-j)/np.sum(h[2]),
                                   (J(y,U,v_c+h[3])-j)/np.sum(h[3])])
    v_c = v_c - dj

iteration 0
Loss: 2.0011823067785066 v_c [0.65965599 0.17352562 0.16775718 0.59073473]
gradient [ 0.26157565 -0.40633321  0.42686383  0.41312018]
gradient check: [-1.71048108e-07 -2.99075358e-07 -4.82647708e-07 -3.36033180e-07]
iteration 1
Loss: 1.459945148371363 v_c [ 0.39808033  0.57985882 -0.25910666  0.17761455]
gradient [ 0.21386538 -0.34613217  0.35182788  0.35564653]
gradient check: [-1.72205029e-07 -3.09848814e-07 -5.00205743e-07 -4.02530177e-07]
iteration 2
Loss: 1.0801515250943454 v_c [ 0.18421495  0.925991   -0.61093454 -0.17803198]
gradient [ 0.17248984 -0.28981979  0.2848892   0.2970376 ]
gradient check: [-1.61428890e-07 -3.10479200e-07 -4.76503942e-07 -4.29432120e-07]
iteration 3
Loss: 0.8222445581074181 v_c [ 0.01172511  1.21581079 -0.89582374 -0.47506958]
gradient [ 0.13958321 -0.24185402  0.23081792  0.24541722]
gradient check: [-1.44873841e-07 -2.99073512e-07 -4.30959371e-07 -4.21111431e-07]
iteration 4
Loss: 0.6472140435603255 v_c [-0.12785809  1.45766481 -1.12664166

__c) Compute the partial derivatives of $J_{\text naive-softmax} (v_c, o, U)$ with respect to each of the ‘outside’
word vectors, $u_w$ ’s. There will be two cases: when $w = o$, the true ‘outside’ word vector, and $w \ne o$, for
all other words. Please write you answer in terms of $y$, $\hat{y}$, and $v_c$.__

Answer:

$$\frac {\partial J} {\partial U} = v_c (\hat{y} - y)^T$$ 

Numerical check (loss should be decreasing):

In [2]:
def grad_U(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = v_c.T[:,np.newaxis] @ (y_hat-y).T[np.newaxis,:]
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.zeros((20,4,5))
for i in range(20):
    row = i // 5
    col = i % 5
    h[i,row,col] = 0.00001

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_U(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    diff = np.zeros_like(U)
    for i in range(20):
        row = i // 5
        col = i % 5
        diff[row,col] = (J(y,U+h[i],v_c)-j)/np.sum(h[i])
    print("gradient check:", dj - diff)
    U = U - dj

iteration 0
Loss: 2.0432893278345134 v_c [0.63326141 0.3609223  0.6198098  0.5393403 ]
gradient [[ 0.09212016  0.09275454 -0.55118965  0.12842835  0.23788659]
 [ 0.05250315  0.05286471 -0.31414616  0.07319672  0.13558157]
 [ 0.09016337  0.09078427 -0.53948139  0.1257003   0.23283345]
 [ 0.07845752  0.07899781 -0.46944088  0.10938072  0.20260484]]
gradient check: [[-2.49288763e-07 -2.50669257e-07 -2.26188589e-07 -3.24172154e-07
  -4.70287100e-07]
 [-8.09877885e-08 -8.14829695e-08 -7.35110741e-08 -1.05339711e-07
  -1.52772112e-07]
 [-2.38810876e-07 -2.40145070e-07 -2.16695342e-07 -3.10554288e-07
  -4.50540259e-07]
 [-1.80823763e-07 -1.81827119e-07 -1.64088170e-07 -2.35155596e-07
  -3.41134131e-07]]
iteration 1
Loss: 1.0059356135417779 v_c [0.63326141 0.3609223  0.6198098  0.5393403 ]
gradient [[ 0.07632257  0.07675535 -0.40167624  0.09929353  0.1493048 ]
 [ 0.04349944  0.0437461  -0.22893218  0.05659156  0.08509508]
 [ 0.07470134  0.07512492 -0.39314392  0.09718436  0.14613329]
 [ 0.0650

__d) The sigmoid function is given by equation:
$$ \sigma(x) = \frac 1 {1 + e^{-x}} = \frac {e^x} {e^x + 1} $$
Please compute the derivative of $\sigma(x)$ with respect to $x$, where $x$ is a vector.__

Answer:

$$\sigma^\prime(x) = \frac {(e^x)^\prime (e^x + 1) - e^x (e^x + 1)^\prime} {(e^x + 1)^2} = \frac {e^x (e^x + 1) - e^x e^x} {(e^x + 1)^2} = \frac {e^x} {(e^x + 1)^2} = \frac {e^x} {e^x + 1} \frac 1 {e^x + 1} = \sigma(x) (1 - \frac {e^x} {e^x + 1}) = \sigma(x) (1 - \sigma(x))$$

Numerical check:

In [3]:
def sigmoid(x):
    #naive sigmoid implementation
    return 1/(1 + np.exp(-x))

def grad_sigmoid(x):
    return sigmoid(x)*(1 - sigmoid(x))

#matrix with small shift for every element in vector X to perform finite differences check
h = np.array([0.0001]*4)

for i in range(5):
    #some random vector X (suppose d=4)
    X = np.random.random(4)
    s = sigmoid(X)
    ds = grad_sigmoid(X)
    print("X",X)
    print("Sigmoid value:",s)
    print("Gradient for x",ds)
    print("gradient check:", ds - (sigmoid(X+h)-s)/h[0])

X [0.31206212 0.45579732 0.22146572 0.56511716]
Sigmoid value: [0.57738852 0.61201671 0.55514124 0.63763572]
Gradient for x [0.24401102 0.23745226 0.24695944 0.23105641]
gradient check: [1.88855363e-06 2.66003045e-06 1.36196329e-06 3.18031137e-06]
X [0.48288811 0.03497572 0.861825   0.57506651]
Sigmoid value: [0.61842963 0.50874304 0.70304181 0.63993142]
Gradient for x [0.23597442 0.24992356 0.20877402 0.2304192 ]
gradient check: [2.79480048e-06 2.18717212e-07 4.23907474e-06 3.22443490e-06]
X [0.48865765 0.43121337 0.31837969 0.61015575]
Sigmoid value: [0.61979016 0.60616337 0.57892932 0.64797633]
Gradient for x [0.23565032 0.23872934 0.24377016 0.22810301]
gradient check: [2.82302096e-06 2.53460294e-06 1.92425042e-06 3.37552403e-06]
X [0.92942287 0.42928427 0.15898609 0.93813337]
Sigmoid value: [0.71695818 0.60570274 0.53966301 0.71872245]
Gradient for x [0.20292915 0.23882693 0.24842685 0.20216049]
gradient check: [4.40278851e-06 2.52463943e-06 9.85538183e-07 4.42177640e-06]
X [0.670

__e) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1$, $w_2$, ..., $w_K$ and their outside vectors as $u_1$, ..., $u_K$. Note that $o \notin \{w_1, ..., w_K\}$. For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:__

$$J_{\text neg-sample}(v_c, o, U) = − \log(\sigma(u_o v_c)) − \sum\limits_K {\log(\sigma(−u_k v_c))}$$

__for a sample $w_1$, ..., $w_K$, where $\sigma(x)$ is the sigmoid function. Please repeat parts (b) and (c), computing the partial derivatives of $J_{\text neg-sample}$ with respect to $v_c$, with respect to $u_o$, and with respect to a negative sample $u_k$. Please write your answers in terms of the vectors $u_o$, $v_c$ and $u_k$, where $k \in [1,K]$.__

__After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.__

Answer:

$$\frac {\partial{J}} {\partial{u_o}} = - (1 - \sigma(u_o^T v_c))v_c$$

$$\frac {\partial{J}} {\partial{u_k}} = (1 - \sigma(-u_k^T v_c))v_c \\ 
\text{or in matrix form} \\ 
\frac {\partial{J}} {\partial{U_k}} = v_c \times (1 - \sigma(-U_k^T v_c))^T$$

$$\frac {\partial{J}} {\partial{v_c}} = - (1 - \sigma(u_o^T v_c)) u_o + \sum\limits_K {(1 - \sigma(-u_k^T v_c))u_k} \\ 
\text{or in matrix form} \\ 
\frac {\partial{J}} {\partial{v_c}} = - (1 - \sigma(u_o^T v_c)) u_o + U_k \times (1 - \sigma(-U_k^T v_c))$$

Numerical check:

In [4]:
def J(u_o, u, v_c):
    J = -np.log(sigmoid(u_o.T @ v_c)) - np.sum(np.log(sigmoid(-u.T @ v_c)), axis = 0)
    return J.ravel()

In [5]:
def grad_U_o(u_o, v_c):
    return -(1 - sigmoid(u_o.T @ v_c))*v_c

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.diag([0.0001]*4)

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_U_o(u_o,v_c).reshape((4,1))
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    print("gradient check:", dj - np.array([(J(u_o+h[0].reshape((-1,1)),u,v_c)-j)/np.sum(h[0]),
                                            (J(u_o+h[1].reshape((-1,1)),u,v_c)-j)/np.sum(h[1]),
                                            (J(u_o+h[2].reshape((-1,1)),u,v_c)-j)/np.sum(h[2]),
                                            (J(u_o+h[3].reshape((-1,1)),u,v_c)-j)/np.sum(h[3])]))
    u_o = u_o - dj

iteration 0
Loss: [6.07813947]
gradient [[-0.19303651]
 [-0.19234558]
 [-0.08921333]
 [-0.21368096]]
gradient check: [[-5.57318569e-06]
 [-5.53336025e-06]
 [-1.19038373e-06]
 [-6.82897454e-06]]
iteration 1
Loss: [5.97260761]
gradient [[-0.12876768]
 [-0.12830678]
 [-0.05951099]
 [-0.14253884]]
gradient check: [[-4.13143483e-06]
 [-4.10190960e-06]
 [-8.82437264e-07]
 [-5.06235576e-06]]
iteration 2
Loss: [5.92317826]
gradient [[-0.09625192]
 [-0.09590741]
 [-0.04448357]
 [-0.10654566]]
gradient check: [[-3.24465983e-06]
 [-3.22147360e-06]
 [-6.93034122e-07]
 [-3.97576532e-06]]
iteration 3
Loss: [5.89470655]
gradient [[-0.07677999]
 [-0.07650518]
 [-0.03548447]
 [-0.08499129]]
gradient check: [[-2.66300498e-06]
 [-2.64397413e-06]
 [-5.68790962e-07]
 [-3.26304674e-06]]
iteration 4
Loss: [5.87622734]
gradient [[-0.06384204]
 [-0.06361353]
 [-0.0295051 ]
 [-0.07066968]]
gradient check: [[-2.25557089e-06]
 [-2.23945667e-06]
 [-4.81775591e-07]
 [-2.76380689e-06]]


In [6]:
def grad_U(u, v_c):
    return v_c @ (1 - sigmoid(-u.T @ v_c)).T

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.zeros((12,4,3))
for i in range(12):
    row = i // 3
    col = i % 3
    h[i,row,col] = 0.0001

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_U(u,v_c)
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    diff = np.zeros_like(u)
    for i in range(12):
        row = i // 3
        col = i % 3
        diff[row,col] = (J(u_o,u+h[i],v_c)-j)/np.sum(h[i])
    print("gradient check:", dj - diff)
    u = u - dj

iteration 0
Loss: [4.00174345]
gradient [[0.08410885 0.08734542 0.08612594]
 [0.27522971 0.28582075 0.28183023]
 [0.57606874 0.59823629 0.58988395]
 [0.11556171 0.12000861 0.11833309]]
gradient check: [[-1.58441093e-07 -1.50402988e-07 -1.53557925e-07]
 [-1.69661183e-06 -1.61053867e-06 -1.64428377e-06]
 [-7.43253846e-06 -7.05547401e-06 -7.20331500e-06]
 [-2.99100117e-07 -2.83929832e-07 -2.89877106e-07]]
iteration 1
Loss: [2.79749007]
gradient [[0.06645817 0.06957243 0.06837995]
 [0.21747131 0.22766214 0.22375996]
 [0.4551777  0.47650758 0.46834014]
 [0.09131048 0.09558934 0.09395092]]
gradient check: [[-1.83845701e-07 -1.81628859e-07 -1.82587124e-07]
 [-1.96860731e-06 -1.94485756e-06 -1.95517794e-06]
 [-8.62416497e-06 -8.52009592e-06 -8.56532466e-06]
 [-3.47055095e-07 -3.42863840e-07 -3.44685901e-07]]
iteration 2
Loss: [2.04854715]
gradient [[0.05163626 0.05405248 0.05311973]
 [0.16896954 0.17687613 0.17382391]
 [0.35366119 0.37021007 0.36382163]
 [0.07094586 0.07426563 0.07298408]]
gra

In [7]:
def grad_v_c(u_o, u, v_c):
    return -(1 - sigmoid(u_o.T @ v_c))*u_o + u @ (1 - sigmoid(-u.T @ v_c))

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.diag([0.0001]*4)

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_v_c(u_o, u, v_c)
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    print("gradient check:", dj - np.array([(J(u_o,u,v_c+h[0].reshape((-1,1)))-j)/np.sum(h[0]),
                                            (J(u_o,u,v_c+h[1].reshape((-1,1)))-j)/np.sum(h[1]),
                                            (J(u_o,u,v_c+h[2].reshape((-1,1)))-j)/np.sum(h[2]),
                                            (J(u_o,u,v_c+h[3].reshape((-1,1)))-j)/np.sum(h[3])]))
    v_c = v_c - dj

iteration 0
Loss: [3.42952005]
gradient [[0.15843803]
 [0.67268655]
 [0.7644406 ]
 [0.49613987]]
gradient check: [[-3.93701870e-06]
 [-7.80428311e-06]
 [-1.60732726e-05]
 [-8.63920169e-06]]
iteration 1
Loss: [2.60456207]
gradient [[-0.12619046]
 [ 0.27975382]
 [ 0.14575329]
 [ 0.03940952]]
gradient check: [[-5.45835467e-06]
 [-9.42912001e-06]
 [-2.08298835e-05]
 [-1.14212178e-05]]
iteration 2
Loss: [2.50534519]
gradient [[-0.17796141]
 [ 0.20167916]
 [ 0.03102633]
 [-0.04452132]]
gradient check: [[-5.31022726e-06]
 [-9.18296141e-06]
 [-2.00781510e-05]
 [-1.10417578e-05]]
iteration 3
Loss: [2.43123423]
gradient [[-0.18229266]
 [ 0.18634234]
 [ 0.01667763]
 [-0.05384099]]
gradient check: [[-5.30961325e-06]
 [-9.11101527e-06]
 [-1.99573277e-05]
 [-1.09987201e-05]]
iteration 4
Loss: [2.36094051]
gradient [[-0.18044678]
 [ 0.17938176]
 [ 0.01528714]
 [-0.05353138]]
gradient check: [[-5.33081499e-06]
 [-9.06585886e-06]
 [-1.99372555e-05]
 [-1.10083244e-05]]


__f) Suppose the center word is $c = w_t$ and the context window is $[w_{t−m}, ..., w_{t−1}, w_t, w_{t+1}, ...,
w_{t+m}]$, where $m$ is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:__

$$J_{\text skip-gram}(v_c, w_{t−m}, ..., w_{t+m}, U) = \sum\limits_{-m \le j \le m} {J(v_c, w_{t+j}, U)}$$

__Here, $J(v_c, w_{t+j}, U)$ represents an arbitrary loss term for the center word $c = w_t$ and outside word $w_{t+j}$. $J(v_c, w_{t+j}, U)$ could be $J_{\text naive-softmax}(v_c, w_{t+j}, U)$ or $J_{\text neg-sample}(v_c, w_{t+j}, U)$, depending on your implementation.__

__Write down three partial derivatives:__
- $\frac {\partial J_{\text skip-gram}} {\partial U}$
- $\frac {\partial J_{\text skip-gram}} {\partial v_c}$
- $\frac {\partial J_{\text skip-gram}} {\partial v_w} \text {, where } w \ne c$

__Write your answers in terms of $\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial U}$ and $\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial v_c}$__

Answer:

$$\frac {\partial J_{\text skip-gram}} {\partial U} = \sum\limits_{-m \le j \le m} {\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial U}}$$

$$\frac {\partial J_{\text skip-gram}} {\partial v_c} = \sum\limits_{-m \le j \le m} {\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial v_c}}$$

$$\frac {\partial J_{\text skip-gram}} {\partial v_w} = 0 \text {, for all } w \ne c $$