<a href="https://colab.research.google.com/github/fkhafizov/w2v_intro/blob/main/w2v_skpgr_noW1_v6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# w2v_skpgr_noW1_v6.ipynb
#  2021.08.19
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
import time
import matplotlib.pyplot as plt

# Introduction to Skip-gram

### Skip-gram Algorithm was introduced in [1] and improved in [2]
* [1] [Mikolov et al, 2013.09](https://arxiv.org/pdf/1301.3781.pdf)
* [2] [Mikolov et al, 2013.10](https://arxiv.org/pdf/1310.4546.pdf)

Formulation of optimization problem for the **baseline algorithm** described in section 2 of [2] is as follows.

### Notation 

Given text with vocabulary, $V$, of size $|V|$, window size $c$ ($c=2$ in our example below), and embedding space dimention (2 in our example) we construct a set  training word pairs $(w_{center}, w_{context})$.

$idx(w) := $  index of word $w$ in the $V$.

$W_0$ is a $|V| \times 2$ matrix. Rows are embeddings of center words.

$W_1$ is a $2 \times |V|$ matrix. Columns are embeddings of context words.

$x:=W_0[idx(w_{center}),:]$ is a 2D row-vector = embedding of $w_{center}$ (also denoted as $w_x$).



$y':=W_1[:,idx(w_{context})]$ is a 2D column-vector =  embedding of $w_{context}=w_y$.


$x\cdot W_1$ is a row-vector of length $|V|$. Each entry is a dot product $x\cdot y_v'$ for some $y_v'$.






### Skip-Gram Model

$$P(w_{context} | w_{center}) = P(w_y|w_x) := \frac{\exp(x\cdot y')}{\sum_{v=1}^{|V|}\exp(x\cdot y'_v)}
$$

\begin{equation} \label{eq1}
\begin{split}
U(W_0, W_1) & := \frac{1}{T} \sum_{t=1}^T \sum_{-c\le j \le c, j\ne 0} \log P(y_{t+j}|x_t) 
\end{split}
\end{equation}

Given a sequence of training words $w_1, w_2, w_3, \dots , w_T$ , the objective of the Skip-gram model is to maximize $U(W_0, W_1)$:
$$\max_{W_0, W_1} U(W_0, W_1)  $$

---

### Note: 

Often the following question is being asked: is it necessary to have two weight matrices $W_0$ and $W_1$? It turns out, that  one matrix is sufficient as we illustrate in our example below. Therefore, the code below is an implementation of 

### Simplified Skip-Gram Model (with only one weight matrix)

Let $\displaystyle  W_1:=W_0$, $T=$  the size of the set of training word pairs, and 

$$U=U(W_0)  := \frac{1}{T} \sum_{t=1}^T \log P(y_t|x_t)$$

Then the objective is  to maximize $U(W_0)$.






# Auxilary Functions

In [None]:
def remove_stop_words(corpus):
    stop_words = ['in','very','are','the','to','of','is', 'a', 'and',\
                  'on','will', 'be', 'и', 'он', 'этот', 'она']
    results = []
    for text in corpus:
        tmp = text.split(' ')
        for stop_word in stop_words:
            if stop_word in tmp:
                tmp.remove(stop_word)
        results.append(" ".join(tmp))
    return results


def plot_words(W, vocab, ttl, wcolor):
    # plot words according to their embedding
    x1 = W[:,0]
    x2 = W[:,1]
    x_axis_min, x_axis_max = np.min(x1)-get_padding(x1), np.max(x1)+get_padding(x1)
    y_axis_min, y_axis_max = np.min(x2)-get_padding(x2), np.max(x2)+get_padding(x2)

    fig, ax = plt.subplots()
    figsz=6
    plt.rcParams["figure.figsize"] = (figsz,figsz)

    plt.scatter(x1, x2, c='yellow', s=500, alpha=1.0)
    for ix, word, x1i, x2i in zip(range(len(wcolor)), vocab, x1, x2):
        ax.annotate(word, (x1i,x2i ), fontsize=18, color=wcolor[ix])

    # Plot Center
    plt.scatter([0], [0], c='g', marker='+', s=500, alpha=0.9)     
    plt.grid()
    plt.xlim(x_axis_min,x_axis_max)
    plt.ylim(y_axis_min,y_axis_max)
    
    # We change the fontsize of minor ticks label 
    ax.tick_params(axis='both', which='major', labelsize=20)
    ax.tick_params(axis='both', which='minor', labelsize=18)
    
    plt.title(ttl, fontsize=18)

    #     fig.savefig(fname=figfn, formatstr='png')
    plt.show()

def get_padding(x):
    return 2*(np.max(x)-np.min(x))/10

# Prepare Text

In [None]:
text = "Fish swim in deep water. Ocean is very deep. Fish swim in darkness. \
          Birds are high in the sky. Birds fly very high. On a sunny day the sky is full of light."

# text = "Fish and water. Birds and sky."

# Text should have even number of sentences. This will help with illustration.
# Words in the 1st half will be 'red'.
# Words in the 2nd half of sentences must be 'cyan'.
print('TEXT = ', text)

corpus=[s.lower().strip() for s in text.split('.')][:-1]
corpus = remove_stop_words(corpus)
print('CLEAN CORPUS = ',corpus)
sentences = [s.split(' ') for s in corpus]
print('SENTENCES = ',sentences)

# GET VOCAB
vocab=[]
for ss in sentences:
  # print(ss)
  vocab += ss
    
vocab = sorted(set(vocab))
print('VOCABULARY = ', vocab)


# DICs mapping each word to a number and vice-versa
word2idx = {w: idx for (idx, w) in enumerate(vocab)}
idx2word = {idx: w for (idx, w) in enumerate(vocab)}
print('word2idx=', word2idx)
print('idx2word=', idx2word)


# COUNT WORDS
vocab_word_count = np.zeros(len(vocab))
for ss in sentences:
    for word in ss:
        idx = vocab.index(word) 
        vocab_word_count[idx] += 1
print('vocab_word_count = ',vocab_word_count)


# Create master DF for our data 
vocab_df = pd.DataFrame( columns=['vocab','color'] )

# assign colors
# ASSUME: corpus has even number of sentences. First half must be 'red'.
# Words in the second half of sentences must be 'cyan'
ix=0
for i, ss in zip(range(len(sentences)), sentences):
    wordcolor = 'red' if i<len(sentences)/2 else 'cyan'
    for w in ss:
        vocab_df.loc[ix]=[w, wordcolor]
        ix+=1
        
total_number_of_words = ix
vocab_df = vocab_df.drop_duplicates().sort_values(by='vocab').reset_index(drop=True)
vocab_df['count']=vocab_word_count
vocab_df['freq'] = vocab_df['count']/total_number_of_words

TEXT =  Fish swim in deep water. Ocean is very deep. Fish swim in darkness.           Birds are high in the sky. Birds fly very high. On a sunny day the sky is full of light.
CLEAN CORPUS =  ['fish swim deep water', 'ocean deep', 'fish swim darkness', 'birds high sky', 'birds fly high', 'sunny day sky full light']
SENTENCES =  [['fish', 'swim', 'deep', 'water'], ['ocean', 'deep'], ['fish', 'swim', 'darkness'], ['birds', 'high', 'sky'], ['birds', 'fly', 'high'], ['sunny', 'day', 'sky', 'full', 'light']]
VOCABULARY =  ['birds', 'darkness', 'day', 'deep', 'fish', 'fly', 'full', 'high', 'light', 'ocean', 'sky', 'sunny', 'swim', 'water']
word2idx= {'birds': 0, 'darkness': 1, 'day': 2, 'deep': 3, 'fish': 4, 'fly': 5, 'full': 6, 'high': 7, 'light': 8, 'ocean': 9, 'sky': 10, 'sunny': 11, 'swim': 12, 'water': 13}
idx2word= {0: 'birds', 1: 'darkness', 2: 'day', 3: 'deep', 4: 'fish', 5: 'fly', 6: 'full', 7: 'high', 8: 'light', 9: 'ocean', 10: 'sky', 11: 'sunny', 12: 'swim', 13: 'water'}
vocab_wor

# Build Training Data Word Pairs

In [None]:
# initialize dictionary of context words
context_words = {}
for w in vocab:
    context_words[w] = []

WINDOW_SIZE = 2
data = []
for sentence in sentences:
    for idx, word in enumerate(sentence):
        for neighbor in sentence[max(idx - WINDOW_SIZE, 0) :\
                                 min(idx + WINDOW_SIZE, len(sentence)) + 1] : 
            if neighbor != word:
                data.append([word, neighbor])
                context_words[word].append(neighbor)

df_context_words = pd.DataFrame({'vocab': list(context_words.keys()),   \
                                 'c_words': list(context_words.values())})

df_context_words['num_c_words_pairs'] = [ len(wrds) for wrds in list(context_words.values()) ]
df_context_words['set_c_words'] = [ set(wrds) for wrds in list(context_words.values()) ]
df_context_words['num_c_words'] = [ len(set(wrds)) for wrds in list(context_words.values()) ]
vocab_df2 = vocab_df.merge( df_context_words, how='outer',on=['vocab'] )
tr_pairs = pd.DataFrame(data, columns = ['center', 'context'])
tr_pairs.head()

Unnamed: 0,center,context
0,fish,swim
1,fish,deep
2,swim,fish
3,swim,deep
4,swim,water


In [None]:
vocab_df2

Unnamed: 0,vocab,color,count,freq,c_words,num_c_words_pairs,set_c_words,num_c_words
0,birds,cyan,2.0,0.1,"[high, sky, fly, high]",4,"{fly, high, sky}",3
1,darkness,red,1.0,0.05,"[fish, swim]",2,"{swim, fish}",2
2,day,cyan,1.0,0.05,"[sunny, sky, full]",3,"{sunny, full, sky}",3
3,deep,red,2.0,0.1,"[fish, swim, water, ocean]",4,"{swim, water, fish, ocean}",4
4,fish,red,2.0,0.1,"[swim, deep, swim, darkness]",4,"{swim, deep, darkness}",3
5,fly,cyan,1.0,0.05,"[birds, high]",2,"{birds, high}",2
6,full,cyan,1.0,0.05,"[day, sky, light]",3,"{light, day, sky}",3
7,high,cyan,2.0,0.1,"[birds, sky, birds, fly]",4,"{birds, fly, sky}",3
8,light,cyan,1.0,0.05,"[sky, full]",2,"{full, sky}",2
9,ocean,red,1.0,0.05,[deep],1,{deep},1


# Simplified Skip-Gram Model (no W1)

In [None]:
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import time

EMBEDDING_DIMENSION = 2
W0 = Variable(torch.randn(len(vocab), EMBEDDING_DIMENSION).float(), requires_grad=True)
W0_init = torch.clone( W0 )

In [None]:
%matplotlib notebook
plot_words(W0_init.cpu().detach().numpy(), vocab, ttl='Initial embedding', wcolor=vocab_df.color)

<IPython.core.display.Javascript object>

# Main Loop


In [None]:
U_array = []

In [None]:
EPOCHS = 1000
LEARNING_RATE = 0.001
pair_ix = 24  # index of contect pair for prints
sttime = time.time()

# PyTorch Training
for epoch in range(EPOCHS):
    U = 0
    for i in range(len(tr_pairs)):
        
        w_center     = tr_pairs.center[i]
        w_context    = tr_pairs.context[i]
        
        w_center_ix  = word2idx[w_center]
        w_context_ix = word2idx[w_context]
        
        w_center_embed = W0[word2idx[tr_pairs.center[i]],:]        
        dot_products = torch.matmul( w_center_embed, torch.transpose(W0, 0, 1) )      
        log_softmax = F.log_softmax(dot_products, dim=0)
        
        ix = torch.tensor([w_context_ix])
        U_increment = torch.mul( F.nll_loss( log_softmax.view(1,-1), ix ), torch.tensor([-1.0]) )
        U += U_increment.item()       
        
        if (epoch % 300 == 0) & \
           (w_center  == tr_pairs.center[pair_ix]) & \
           (w_context == tr_pairs.context[pair_ix]):
            U_incr_val = round(U_increment.cpu().detach().numpy()[0],3)
            if epoch == 0:
                print(f'    w_center={w_center}, w_center_ix={w_center_ix}', 
                      f'  w_context={w_context}, w_context_ix={w_context_ix} ')    
            print( f'  * epoch={epoch},  i={i}, U=',np.round(U,4),
                 '   U_increment=',  U_incr_val  )
        
        U_increment.backward()
        W0.data += LEARNING_RATE * W0.grad.data
        W0.grad.data.zero_()
        
    U_array.append(U)

endtime = time.time()
ttl = f'Epochs={len(U_array)}, LR={LEARNING_RATE}, time={int(endtime-sttime)} sec'

print('Finished training with: U=',np.round(U,4),'\n', ttl)

    w_center=birds, w_center_ix=0   w_context=fly, w_context_ix=5 
  * epoch=0,  i=24, U= -81.629    U_increment= -3.354
  * epoch=300,  i=24, U= -55.9912    U_increment= -2.906
  * epoch=600,  i=24, U= -47.254    U_increment= -2.106
  * epoch=900,  i=24, U= -43.8685    U_increment= -1.73
Finished training with: U= -75.1981 
 Epochs=1000, LR=0.001, time=34 sec


In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.rcParams["figure.figsize"] = (2,8)

epochs = range(1, len(U_array)+1)
plt.plot(epochs, U_array, 'bo')

plt.title(ttl, fontsize=18)
plt.xlabel('EPOCH', fontsize=18)
plt.ylabel('U', fontsize=18)
plt.grid()
ax.tick_params(axis='both', which='major', labelsize=18)
ax.tick_params(axis='both', which='minor', labelsize=18)

vocab_df['embedding']=[v for v in  W0.cpu().detach().numpy()]
vocab_df

<IPython.core.display.Javascript object>

Unnamed: 0,vocab,color,count,freq,embedding
0,birds,cyan,2.0,0.1,"[-0.6278515, -1.5386766]"
1,darkness,red,1.0,0.05,"[-0.9173088, 1.3822806]"
2,day,cyan,1.0,0.05,"[1.4878625, -0.6285384]"
3,deep,red,2.0,0.1,"[0.27573723, 1.8041941]"
4,fish,red,2.0,0.1,"[-0.6531924, 1.7705522]"
5,fly,cyan,1.0,0.05,"[-1.038932, -1.0519396]"
6,full,cyan,1.0,0.05,"[1.2515254, -0.9088243]"
7,high,cyan,2.0,0.1,"[-0.6285649, -1.5387623]"
8,light,cyan,1.0,0.05,"[0.9479898, -0.93907964]"
9,ocean,red,1.0,0.05,"[0.54138255, 1.1440163]"


In [None]:
%matplotlib notebook
plot_words(W0.cpu().detach().numpy(), vocab, ttl=ttl, wcolor=vocab_df.color)

<IPython.core.display.Javascript object>