__a) Show that the naive-softmax loss given in $J_{\text {naive_softmax}}(v_c, o, U) = - \log P(O=o|C=c)$ is the same as the cross-entropy loss between $y$ and $\hat{y}$; i.e., show that:__

$$- \sum_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \log \hat {y}_o$$

Answer:
$y_w$ is non zero and equal to 1 only in the position $k$ of true word $o$. So left part can be simplified and be written as the only non zero term of a sum: 

$$- \sum_{w \in \text {Vocab}} {y_w \log \hat {y}_w} = - \sum_{w = o} {y_w \log \hat {y}_w} = - y_o \log \hat{y}_o = - 1 * \log \hat{y}_o = - \log \hat{y}_o$$

__b) Compute the partial derivative of $J_{\text {naive-softmax}}(v_c, o, U)$ with respect to $v_c$ . Please write your answer in terms of $y$, $\hat{y}$, and $U$.__

Answer: 

$\frac {\partial J} {\partial v_c} = (\hat{y} - y)U^T$

Numerical check (loss should be decreasing):

In [1]:
import numpy as np

def J(y, U, v_c):
    #softmax value for each word in vocabulary. Array shape [1,V], where V - size of vocabulary
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #loss function. Scalar value.
    J = np.sum(- y * np.log(y_hat))
    return J

def grad_v_c(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = np.sum((y_hat-y)[:,np.newaxis]*U.T, axis=0)
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.diag([0.00001]*4)

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_v_c(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    print("gradient check:", dj - [(J(y,U,v_c+h[0])-j)/np.sum(h[0]),
                                   (J(y,U,v_c+h[1])-j)/np.sum(h[1]),
                                   (J(y,U,v_c+h[2])-j)/np.sum(h[2]),
                                   (J(y,U,v_c+h[3])-j)/np.sum(h[3])])
    v_c = v_c - dj

iteration 0
Loss: 1.396645847775709 v_c [0.3277639  0.2355097  0.63730662 0.45045224]
gradient [-0.17616601  0.2021204  -0.21494512 -0.10431832]
gradient check: [-2.57400052e-07 -4.90028697e-07 -2.87680664e-07 -2.42110742e-07]
iteration 1
Loss: 1.2765575948229395 v_c [0.50392991 0.0333893  0.85225174 0.55477056]
gradient [-0.15197113  0.16622224 -0.18520328 -0.1070215 ]
gradient check: [-2.29392500e-07 -4.93436727e-07 -2.65967333e-07 -2.27544151e-07]
iteration 2
Loss: 1.1860828542365922 v_c [ 0.65590104 -0.13283294  1.03745501  0.66179206]
gradient [-0.13297829  0.13725236 -0.16194976 -0.10813718]
gradient check: [-2.04244209e-07 -4.87419743e-07 -2.45045411e-07 -2.13958033e-07]
iteration 3
Loss: 1.115752711021939 v_c [ 0.78887933 -0.2700853   1.19940477  0.76992924]
gradient [-0.11793785  0.1140384  -0.14368071 -0.10823274]
gradient check: [-1.82480118e-07 -4.76700683e-07 -2.26278084e-07 -2.01832791e-07]
iteration 4
Loss: 1.0593584938930638 v_c [ 0.90681718 -0.3841237   1.34308548  0.8

__c) Compute the partial derivatives of $J_{\text naive-softmax} (v_c, o, U)$ with respect to each of the ‘outside’
word vectors, $u_w$ ’s. There will be two cases: when $w = o$, the true ‘outside’ word vector, and $w \ne o$, for
all other words. Please write you answer in terms of $y$, $\hat{y}$, and $v_c$.__

Answer:

$$\frac {\partial J} {\partial U} = v_c (\hat{y} - y)^T$$ 

Numerical check (loss should be decreasing):

In [2]:
def grad_U(y, U, v_c):
    y_hat = np.exp(U.T @ v_c) / np.sum(np.exp(U.T @ v_c))
    #sum over rows. so we get shape [1,d], where d is the size of our vector representation
    d = v_c.T[:,np.newaxis] @ (y_hat-y).T[np.newaxis,:]
    return d

#some random matrix U (suppose d=4, v=5, so shape is 4x5)
U = np.array(np.random.random((4,5)))

#some random word vector V_c
v_c = np.random.random(4)

#some random vector y with correct labels of words from vocabulary (size v, one-hot encoded)
y = np.array([0,0,1,0,0])

#matrix with small shift for every element in word_vector v_c to perform finite differences check
h = np.zeros((20,4,5))
for i in range(20):
    row = i // 5
    col = i % 5
    h[i,row,col] = 0.00001

for i in range(5):
    j = J(y,U,v_c)
    dj = grad_U(y,U,v_c)
    print("iteration",i)
    print("Loss:",j,"v_c",v_c)
    print("gradient",dj)
    diff = np.zeros_like(U)
    for i in range(20):
        row = i // 5
        col = i % 5
        diff[row,col] = (J(y,U+h[i],v_c)-j)/np.sum(h[i])
    print("gradient check:", dj - diff)
    U = U - dj

iteration 0
Loss: 2.246555386582317 v_c [0.70716958 0.36076553 0.60210062 0.27327096]
gradient [[ 0.14966649  0.14402201 -0.63237727  0.15196379  0.18672498]
 [ 0.07635299  0.07347343 -0.32260992  0.07752497  0.09525853]
 [ 0.12742953  0.12262369 -0.5384207   0.1293855   0.15898199]
 [ 0.0578355   0.05565431 -0.24436903  0.05872324  0.07215598]]
gradient check: [[-4.17208270e-07 -4.05524080e-07 -2.36476755e-07 -4.21894367e-07
  -4.85900865e-07]
 [-1.08593897e-07 -1.05563286e-07 -6.15675031e-08 -1.09810992e-07
  -1.26454771e-07]
 [-3.02471737e-07 -2.93976248e-07 -1.71463211e-07 -3.05837219e-07
  -3.52266448e-07]
 [-6.23206819e-08 -6.05856751e-08 -3.53434825e-08 -6.30155270e-08
  -7.25884637e-08]]
iteration 1
Loss: 1.2692155010583435 v_c [0.70716958 0.36076553 0.60210062 0.27327096]
gradient [[ 0.12215456  0.11855346 -0.50841814  0.12360022  0.14410991]
 [ 0.06231766  0.06048054 -0.25937165  0.06305517  0.07351828]
 [ 0.10400523  0.10093917 -0.43287902  0.1052361   0.12269853]
 [ 0.04720

__d) The sigmoid function is given by equation:
$$ \sigma(x) = \frac 1 {1 + e^{-x}} = \frac {e^x} {e^x + 1} $$
Please compute the derivative of $\sigma(x)$ with respect to $x$, where $x$ is a vector.__

Answer:

$$\sigma^\prime(x) = \frac {(e^x)^\prime (e^x + 1) - e^x (e^x + 1)^\prime} {(e^x + 1)^2} = \frac {e^x (e^x + 1) - e^x e^x} {(e^x + 1)^2} = \frac {e^x} {(e^x + 1)^2} = \frac {e^x} {e^x + 1} \frac 1 {e^x + 1} = \sigma(x) (1 - \frac {e^x} {e^x + 1}) = \sigma(x) (1 - \sigma(x))$$

Numerical check:

In [3]:
def sigmoid(x):
    #naive sigmoid implementation
    return 1/(1 + np.exp(-x))

def grad_sigmoid(x):
    return sigmoid(x)*(1 - sigmoid(x))

#matrix with small shift for every element in vector X to perform finite differences check
h = np.array([0.0001]*4)

for i in range(5):
    #some random vector X (suppose d=4)
    X = np.random.random(4)
    s = sigmoid(X)
    ds = grad_sigmoid(X)
    print("X",X)
    print("Sigmoid value:",s)
    print("Gradient for x",ds)
    print("gradient check:", ds - (sigmoid(X+h)-s)/h[0])

X [0.44669405 0.00276901 0.99923837 0.62713255]
Sigmoid value: [0.60985293 0.50069225 0.73090881 0.65183899]
Gradient for x [0.23793233 0.24999952 0.19668112 0.22694492]
gradient check: [2.61392540e-06 1.75151171e-08 4.54159833e-06 3.44604698e-06]
X [0.96305427 0.61912816 0.58678562 0.29704091]
Sigmoid value: [0.7237329  0.65002024 0.64262728 0.57371898]
Gradient for x [0.19994359 0.22749393 0.22965746 0.24456551]
gradient check: [4.47346122e-06 3.41300807e-06 3.27568567e-06 1.80310283e-06]
X [0.52834246 0.71798239 0.88569308 0.52539044]
Sigmoid value: [0.62909643 0.67216257 0.70800058 0.62840737]
Gradient for x [0.23333411 0.22036005 0.20673576 0.23351155]
gradient check: [3.01241582e-06 3.79389377e-06 4.30019823e-06 2.99861661e-06]
X [0.82416312 0.09784161 0.06883076 0.38658814]
Sigmoid value: [0.69511934 0.52444091 0.5172009  0.59546109]
Gradient for x [0.21192844 0.24940264 0.24970413 0.24088718]
gradient check: [4.13522946e-06 6.09769257e-07 4.29720708e-07 2.29971461e-06]
X [0.110

__e) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1$, $w_2$, ..., $w_K$ and their outside vectors as $u_1$, ..., $u_K$. Note that $o \notin \{w_1, \ldots, w_K\}$. For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:__

$$J_{\text neg-sample}(v_c, o, U) = − \log(\sigma(u_o v_c)) − \sum_K {\log(\sigma(−u_k v_c))}$$

__for a sample $w_1$, ..., $w_K$, where $\sigma(x)$ is the sigmoid function. Please repeat parts (b) and (c), computing the partial derivatives of $J_{\text neg-sample}$ with respect to $v_c$, with respect to $u_o$, and with respect to a negative sample $u_k$. Please write your answers in terms of the vectors $u_o$, $v_c$ and $u_k$, where $k \in [1,K]$.__

__After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.__

Answer:

$$\frac {\partial{J}} {\partial{u_o}} = - (1 - \sigma(u_o^T v_c))v_c$$

$$\frac {\partial{J}} {\partial{u_k}} = (1 - \sigma(-u_k^T v_c))v_c \\ 
\text{or in matrix form} \\ 
\frac {\partial{J}} {\partial{U_k}} = v_c \times (1 - \sigma(-U_k^T v_c))^T$$

$$\frac {\partial{J}} {\partial{v_c}} = - (1 - \sigma(u_o^T v_c)) u_o + \sum_K {(1 - \sigma(-u_k^T v_c))u_k} \\ 
\text{or in matrix form} \\ 
\frac {\partial{J}} {\partial{v_c}} = - (1 - \sigma(u_o^T v_c)) u_o + U_k \times (1 - \sigma(-U_k^T v_c))$$

We need to calculate denominator in softmax-naive approach which is a summation over all corpus. It can be very inefficient. Instead in Negative Sampling we need to calculate sum over a small sample of negative words.

Numerical check:

In [4]:
def J(u_o, u, v_c):
    J = -np.log(sigmoid(u_o.T @ v_c)) - np.sum(np.log(sigmoid(-u.T @ v_c)), axis = 0)
    return J.ravel()

In [5]:
def grad_U_o(u_o, v_c):
    return -(1 - sigmoid(u_o.T @ v_c))*v_c

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.diag([0.0001]*4)

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_U_o(u_o,v_c).reshape((4,1))
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    print("gradient check:", dj - np.array([(J(u_o+h[0].reshape((-1,1)),u,v_c)-j)/np.sum(h[0]),
                                            (J(u_o+h[1].reshape((-1,1)),u,v_c)-j)/np.sum(h[1]),
                                            (J(u_o+h[2].reshape((-1,1)),u,v_c)-j)/np.sum(h[2]),
                                            (J(u_o+h[3].reshape((-1,1)),u,v_c)-j)/np.sum(h[3])]))
    u_o = u_o - dj

iteration 0
Loss: [4.19528017]
gradient [[-0.00896   ]
 [-0.19028008]
 [-0.10124527]
 [-0.19798658]]
gradient check: [[-1.09191948e-08]
 [-4.92392033e-06]
 [-1.39404451e-06]
 [-5.33083867e-06]]
iteration 1
Loss: [4.11903751]
gradient [[-0.00702925]
 [-0.14927748]
 [-0.07942838]
 [-0.15532334]]
gradient check: [[-9.24857534e-09]
 [-4.16891238e-06]
 [-1.18029527e-06]
 [-4.51343474e-06]]
iteration 2
Loss: [4.0712289]
gradient [[-0.00574127]
 [-0.12192523]
 [-0.06487465]
 [-0.12686331]]
gradient check: [[-7.91848551e-09]
 [-3.57177811e-06]
 [-1.01122875e-06]
 [-3.86695212e-06]]
iteration 3
Loss: [4.03887326]
gradient [[-0.004834  ]
 [-0.10265788]
 [-0.05462277]
 [-0.10681561]]
gradient check: [[-6.88342674e-09]
 [-3.10622776e-06]
 [-8.79424876e-07]
 [-3.36293252e-06]]
iteration 4
Loss: [4.0156801]
gradient [[-0.00416535]
 [-0.08845799]
 [-0.04706721]
 [-0.09204062]]
gradient check: [[-6.07583483e-09]
 [-2.73937679e-06]
 [-7.75562530e-07]
 [-2.96576035e-06]]


In [6]:
def grad_U(u, v_c):
    return v_c @ (1 - sigmoid(-u.T @ v_c)).T

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.zeros((12,4,3))
for i in range(12):
    row = i // 3
    col = i % 3
    h[i,row,col] = 0.0001

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_U(u,v_c)
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    diff = np.zeros_like(u)
    for i in range(12):
        row = i // 3
        col = i % 3
        diff[row,col] = (J(u_o,u+h[i],v_c)-j)/np.sum(h[i])
    print("gradient check:", dj - diff)
    u = u - dj

iteration 0
Loss: [4.07665146]
gradient [[0.10957902 0.10383585 0.1062522 ]
 [0.20609141 0.19528992 0.19983448]
 [0.68047203 0.64480769 0.65981291]
 [0.24770133 0.23471901 0.24018112]]
gradient check: [[-2.22327617e-07 -2.40490840e-07 -2.33253835e-07]
 [-7.86431977e-07 -8.50683979e-07 -8.25073774e-07]
 [-8.57345177e-06 -9.27395184e-06 -8.99472842e-06]
 [-1.13604805e-06 -1.22886904e-06 -1.19186639e-06]]
iteration 1
Loss: [2.63410449]
gradient [[0.08256756 0.07716268 0.07937657]
 [0.15528945 0.14512418 0.14928797]
 [0.51273428 0.4791706  0.49291861]
 [0.18664244 0.1744248  0.17942926]]
gradient check: [[-2.79034245e-07 -2.81624731e-07 -2.80917147e-07]
 [-9.87025978e-07 -9.96178771e-07 -9.93673004e-07]
 [-1.07603836e-05 -1.08601645e-05 -1.08329183e-05]
 [-1.42582273e-06 -1.43903550e-06 -1.43542871e-06]]
iteration 2
Loss: [1.8387801]
gradient [[0.06034834 0.05658271 0.0581089 ]
 [0.11350052 0.10641828 0.10928868]
 [0.37475569 0.35137159 0.36084905]
 [0.13641631 0.12790417 0.1313541 ]]
grad

In [7]:
def grad_v_c(u_o, u, v_c):
    return -(1 - sigmoid(u_o.T @ v_c))*u_o + u @ (1 - sigmoid(-u.T @ v_c))

u_o = np.random.random((4,1))
u = np.random.random((4,3))
v_c = np.random.random((4,1))
h = np.diag([0.0001]*4)

for i in range(5):
    j = J(u_o,u,v_c)
    dj = grad_v_c(u_o, u, v_c)
    print("iteration",i)
    print("Loss:",j)
    print("gradient",dj)
    print("gradient check:", dj - np.array([(J(u_o,u,v_c+h[0].reshape((-1,1)))-j)/np.sum(h[0]),
                                            (J(u_o,u,v_c+h[1].reshape((-1,1)))-j)/np.sum(h[1]),
                                            (J(u_o,u,v_c+h[2].reshape((-1,1)))-j)/np.sum(h[2]),
                                            (J(u_o,u,v_c+h[3].reshape((-1,1)))-j)/np.sum(h[3])]))
    v_c = v_c - dj

iteration 0
Loss: [4.72987672]
gradient [[1.1069796 ]
 [0.8000436 ]
 [1.35019401]
 [1.2601191 ]]
gradient check: [[-1.27866749e-05]
 [-6.36308131e-06]
 [-1.07693757e-05]
 [-1.15491254e-05]]
iteration 1
Loss: [1.99581972]
gradient [[-0.14393218]
 [-0.07702408]
 [ 0.22377566]
 [ 0.13736781]]
gradient check: [[-1.18785756e-05]
 [-5.98597260e-06]
 [-9.90202495e-06]
 [-1.08015012e-05]]
iteration 2
Loss: [1.90321053]
gradient [[-0.16445899]
 [-0.0934769 ]
 [ 0.1950267 ]
 [ 0.11039826]]
gradient check: [[-1.13935756e-05]
 [-5.76798298e-06]
 [-9.30438815e-06]
 [-1.02310958e-05]]
iteration 3
Loss: [1.81862326]
gradient [[-0.16654173]
 [-0.09649856]
 [ 0.18279018]
 [ 0.1006706 ]]
gradient check: [[-1.12929319e-05]
 [-5.71486405e-06]
 [-9.04350038e-06]
 [-9.97925962e-06]]
iteration 4
Loss: [1.73924734]
gradient [[-0.16327516]
 [-0.09556005]
 [ 0.17547567]
 [ 0.0960344 ]]
gradient check: [[-1.12928979e-05]
 [-5.70480407e-06]
 [-8.87860299e-06]
 [-9.81921726e-06]]


__f) Suppose the center word is $c = w_t$ and the context window is $[w_{t−m}, ..., w_{t−1}, w_t, w_{t+1}, ...,
w_{t+m}]$, where $m$ is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:__

$$J_{\text skip-gram}(v_c, w_{t−m}, ..., w_{t+m}, U) = \sum_{-m \le j \le m} {J(v_c, w_{t+j}, U)}$$

__Here, $J(v_c, w_{t+j}, U)$ represents an arbitrary loss term for the center word $c = w_t$ and outside word $w_{t+j}$. $J(v_c, w_{t+j}, U)$ could be $J_{\text naive-softmax}(v_c, w_{t+j}, U)$ or $J_{\text neg-sample}(v_c, w_{t+j}, U)$, depending on your implementation.__

__Write down three partial derivatives:__
- $\frac {\partial J_{\text skip-gram}} {\partial U}$
- $\frac {\partial J_{\text skip-gram}} {\partial v_c}$
- $\frac {\partial J_{\text skip-gram}} {\partial v_w} \text {, where } w \ne c$

__Write your answers in terms of $\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial U}$ and $\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial v_c}$__

Answer:

$$\frac {\partial J_{\text skip-gram}} {\partial U} = \sum_{-m \le j \le m} {\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial U}}$$

$$\frac {\partial J_{\text skip-gram}} {\partial v_c} = \sum_{-m \le j \le m} {\frac {\partial {J(v_c, w_{t+j}, U)}} {\partial v_c}}$$

$$\frac {\partial J_{\text skip-gram}} {\partial v_w} = 0 \text {, for all } w \ne c $$

## Coding: Implementing word2vec
__After 40,000 iterations, the script will finish and a visualization for your word vectors will appear. It will
also be saved as word vectors.png in your project directory. Include the plot in your homework
write up. Briefly explain in at most three sentences what you see in the plot.__

Result:
![SVD visualization of trained word2vec model](word_vectors.png)

We can see that words with opposite meanings placed far away from each other. Moreover if the difference in meaning is similar between two pairs of words, then we can see similar shift in terms of X and Y (in other words difference between two word-vectors is similar between this two pairs). For example we can draw a vector from "male" to "female" and it will be close to vector between "king" and "queen". 