# Wrod2Vec Implementation:
###### Yassin Bahid


It is said that the complete work of Shakespear contains evry single human emotion. What better source then for a word2vec dictionary. We focus on the Sonnets only and implement a word2vec algorithm with a regular sigmoid Function. We will then look at the similariy, or closeness, of certain words after training.

In [35]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import cosine
from sklearn.decomposition import PCA

## Data:

We use the MIT data base of Shakespear work: http://shakespeare.mit.edu/. We clean the data by eliminating "'s", puncruations and line breakes. When considering words that are closest to each other and related, we only consider each verse as a sentence, i.e., we do not consider the word at the end of a verse, and the word at the begining of the following verse as "close". We also add negative matches in the form of random choices from the entire body of text. The data is not stemmed as of Yet.

In [2]:
## Getting stopwords
with open('stopwords.txt') as f:
    stopwords = f.read().replace('\n',' ').split()


In [17]:
with open('TheSonnets.txt') as Shake_Sonnets:
    lines = Shake_Sonnets.readlines()
##cleaning the data:
for line_index in range(0, len(lines)):
    
    lines[line_index] = lines[line_index].replace("\n", "")
    lines[line_index] = lines[line_index].replace("-", " ")
    lines[line_index] = lines[line_index].replace("'s", "")
    lines[line_index] = lines[line_index].replace("  ", " ")
    lines[line_index] = lines[line_index].replace(",", "")
    lines[line_index] = lines[line_index].replace("'", "")
    
    if lines[line_index][0:2]=="  ":
        lines[line_index] = lines[line_index][2:]
    
    for i in range(0,9):
        if str(i) in lines[line_index]:
            lines[line_index] = ""
    lines[line_index] = lines[line_index].lower()
    
    lines[line_index] = lines[line_index].split()
    
    lines[line_index] = [ wor for wor in lines[line_index] if wor not in stopwords]

In [18]:
all_words = list(set([w for i in range(0,len(lines))  for w in lines[i]]))

In [19]:
## Creating the training data:

WINDOW_SIZE = 2
NUM_NEGATIVE_SAMPLES = 2

data = []

for ln in lines:
    for idx, center_word in enumerate(ln[WINDOW_SIZE-1:-WINDOW_SIZE+1]):
        ## Getting the context words:
        context_words = [context_word 
                             for context_word 
                             in ln[idx:idx+2*WINDOW_SIZE-1] 
                             if context_word != center_word]
        
        for context_word in context_words:
            data = data + [[center_word, context_word, 1]]
        ## Getting negative samples for words not present around the center word:
        if [w for w in all_words[WINDOW_SIZE-1:-WINDOW_SIZE] if w != center_word and w not in context_words] != []:
            negative_samples = np.random.choice(
                                        [w for w 
                                         in all_words[WINDOW_SIZE-1:-WINDOW_SIZE] 
                                         if w != center_word
                                         and w not in context_words],
                                        NUM_NEGATIVE_SAMPLES)
            for negative_samp in negative_samples:

                data.append([center_word, negative_samp, 0])

sonnet_df = pd.DataFrame(columns=['center_word', 'context_word', 'label'], data=data)
words = np.intersect1d(sonnet_df.context_word, sonnet_df.center_word)
sonnet_df = sonnet_df[(sonnet_df.center_word.isin(words)) & (sonnet_df.context_word.isin(words))].reset_index(drop=True)
                                                                                                           

In [20]:
sonnet_df 

Unnamed: 0,center_word,context_word,label
0,creatures,fairest,1
1,creatures,desire,1
2,creatures,heavenly,0
3,desire,creatures,1
4,desire,destroys,0
...,...,...,...
15923,water,rigour,0
15924,water,cools,1
15925,water,dully,0
15926,water,pursuit,0


## Word2Vec Code:

We use the data generated in the previous section to create an embedding, or vector, representing each word. Words that appear close to each other whould have vectors that are the closest. All these embeddings shall be inside the unit circle. Now each words will have a main embeding and a context embedding. The first is the actual vector space representation of each word. The second, is the vector representation of the word as it appears in the context. We use the context vectors to update the main embeddings. The score is computed from the sigmoid of the dot product of the context and main vector. If the two vectors are close the sigmoid will be close to 1, Butis they are far from each other , it will be 0. The error is then the difference betwwen the actual label and the error. We update the main vector by moving the the vectors either closer to each other, or further from each other deppending. The benefit of this method is two folds: First, it moves the main embeddings and context embeddings towards each other. Second, it moves main vectors that have similar context words closer to each other.

In [31]:
##Trainig the data

In [32]:
def sigmoid(v, scale=1):
    return 1 / (1 + np.exp(-scale*v))

In [34]:
def update_embeddings(df, main_embeddings, context_embeddings, learning_rate, debug=False):
    
    #get differences between main embeddings and corresponding context embeddings
    main_embeddings = main_embeddings.loc[df.center_word].values
    context_embeddings = context_embeddings.loc[df.context_word].values
    diffs = context_embeddings_context - main_embeddings_center
    
    #get similarities, scores, and errors between main embeddings and corresponding context embeddings
    dot_prods = np.sum(main_embeddings * context_embeddings, axis=1)
    scores = sigmoid(dot_prods)
    errors = (df.label - scores).values.reshape(-1,1)
    
    #calculate updates
    updates = diffs*errors*learning_rate
    updates_df = pd.DataFrame(data=updates)
    updates_df['center_word'] = df.center_word
    updates_df['context_word'] = df.context_word
    updates_df_center = updates_df.groupby('center_word').sum()
    updates_df_context = updates_df.groupby('context_word').sum()
    
    #apply updates
    main_embeddings += updates_df_center.loc[main_embeddings.index]
    context_embeddings -= updates_df_context.loc[context_embeddings.index]
    
    #normalize embeddings
    main_embeddings = normalize_data(main_embeddings)
    context_embeddings = normalize_data(context_embeddings)
    
    #return the updated embeddings
    return main_embeddings, context_embeddings

In [26]:
EMBEDDING_SIZE = 50

main_embeddings = np.random.normal(0,0.1,(len(words), EMBEDDING_SIZE))
row_norms = np.sqrt((main_embeddings**2).sum(axis=1)).reshape(-1,1)
main_embeddings = main_embeddings / row_norms

context_embeddings = np.random.normal(0,0.1,(len(words), EMBEDDING_SIZE))
row_norms = np.sqrt((context_embeddings**2).sum(axis=1)).reshape(-1,1)
context_embeddings = context_embeddings / row_norms

main_embeddings = pd.DataFrame(data=main_embeddings, index=words)
context_embeddings = pd.DataFrame(data=context_embeddings, index=words)

### Results:


In [27]:
L = []
for w1 in words:
    for w2 in words:
        if w1 != w2:
            sim = 1 - cosine(main_embeddings.loc[w1], main_embeddings.loc[w2])
            L.append((w1,w2,sim))
sorted([item for item in L if item[0] == 'love'], key=lambda t: -t[2])[:10]

[('love', 'loves', 0.47626850299071566),
 ('love', 'saturn', 0.4304067248743966),
 ('love', 'quill', 0.42985441939037183),
 ('love', 'mistress', 0.4246136590489624),
 ('love', 'temptation', 0.422496543854154),
 ('love', 'willed', 0.4064902775730783),
 ('love', 'less', 0.39182032918226883),
 ('love', 'wink', 0.36580739655958316),
 ('love', 'feathered', 0.3559349176205441),
 ('love', 'soundless', 0.35249907387580737)]

In [28]:
sorted([item for item in L if item[0] == 'love'], key=lambda t: -t[2])[-10:]

[('love', 'replete', -0.3608068036960259),
 ('love', 'argument', -0.36246532523482977),
 ('love', 'shalt', -0.36349737452243347),
 ('love', 'believed', -0.36554522130009204),
 ('love', 'travail', -0.36895822444962434),
 ('love', 'weeds', -0.3727390968083115),
 ('love', 'carve', -0.37726877337258946),
 ('love', 'express', -0.38897830275574696),
 ('love', 'concealed', -0.396334234335856),
 ('love', 'pencil)', -0.4374159622020506)]


We focus here on one word: Love. We see what the 10 closest words to it are and the 10 furthest. Interestingly, loves is the closest word. While these two words are unlikely to appear next to each other, they do have very similar context words. One possible  way to remedy this problem is stemming to reduce every word to its root. We can also see that love for Shakespeare is mostly related to temptation, concealed, mistress, and Saturn. Saturn is often paired with Venus symbolizing love in shakespearean literature.  