# Word Embeddings

In NLP, your features are words. We encode these words as 'word embeddings,' which are dense vectors of real numbers. Each index in the vector represents a word in your vocabulary. <br><br>

An ASCII character representation could be stored, but this does not convey meaning. We can also try a one-hot encoding, but these are typically huge and do not encode for semantic similarity in addition to orthographic similarity. <br><br>

Thus, we must generate dense word embeddings. The fundamental assumption underlying these embeddings is the distributional hypothesis of linguistics, stating that words appearing in similar contexts have similar meanings. <br><br>

We use a neural network to learn latent semantic attributes, which will form a new dense vector. We use a normalized dot-product to find the cosine-similarity, which gives the angle between the two vectors. These new vectors are called word embeddings, and they efficiently encode semantic information which is not necessarily interpretable.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7fb6a41576d0>

In [2]:
word_idx = {"hello": 0, "world": 1}

# the embedding function takes two args: vocab size, embed dim.
embeddings = nn.Embedding(2, 5)

# Define index for each word, keys to lookup table
lookup_tensor = torch.tensor([word_idx["hello"]], dtype=torch.long)
hello_embedding = embeddings(lookup_tensor)
print(hello_embedding)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


In [12]:
# the embedding for hello
embeddings(torch.tensor(0))

tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519],
       grad_fn=<EmbeddingBackward0>)

In [11]:
# the embedding for world
embeddings(torch.tensor(1))

tensor([-0.1661, -1.5228,  0.3817, -1.0276, -0.5631],
       grad_fn=<EmbeddingBackward0>)

### N-Gram Language Modeling

In [13]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10

In [14]:
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

In [15]:
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]

In [17]:
ngrams[0:10]

[(['forty', 'When'], 'winters'),
 (['winters', 'forty'], 'shall'),
 (['shall', 'winters'], 'besiege'),
 (['besiege', 'shall'], 'thy'),
 (['thy', 'besiege'], 'brow,'),
 (['brow,', 'thy'], 'And'),
 (['And', 'brow,'], 'dig'),
 (['dig', 'And'], 'deep'),
 (['deep', 'dig'], 'trenches'),
 (['trenches', 'deep'], 'in')]

In [26]:
vocab = set(test_sentence)
word_idx = {word: i for i, word in enumerate(vocab)}

In [27]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [28]:
losses = []
loss_function = nn.NLLLoss()

In [29]:
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [31]:
for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:
        context_idx = torch.tensor([word_idx[w] for w in context], dtype=torch.long)
        model.zero_grad()
        
        log_probs = model(context_idx)
        loss = loss_function(log_probs, torch.tensor([word_idx[target]], dtype=torch.long))
        
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    losses.append(total_loss)
        

In [32]:
print(losses)

[518.5340082645416, 516.1016693115234, 513.6846833229065, 511.2812280654907, 508.8921465873718, 506.5155596733093, 504.14948534965515, 501.79528069496155, 499.44975066185, 497.1138048171997]


In [34]:
print(model.embeddings.weight[word_idx["beauty"]])

tensor([-0.4274,  1.2996, -1.0202, -0.8538, -1.3634, -0.1731,  1.5392, -1.1700,
        -1.0086, -1.1237], grad_fn=<SelectBackward0>)


### Continuous Bag of Words

Continuous context is provided; a few words before and after the target word (surrounding context). CBOW is non-sequential and not necessarily probabilistic, unliek language modeling. <br><br>

Typically used to quickly train word embeddings (pre-training), improving performance. The model tries to minimize the negative log probability of word i given the context C (words before and after word i). This is equivalent to the negative log softmax of the affine map of context word embeddings.

In [35]:
CONTEXT_SIZE = 2

In [36]:
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

In [37]:
vocab = set(raw_text)
vocab_size = len(vocab)

In [38]:
word_idx = {word: i for i, word in enumerate(vocab)}

In [39]:
data = [] 
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))

In [41]:
print(data[:3])

[(['are', 'We', 'to', 'study'], 'about'), (['about', 'are', 'study', 'the'], 'to'), (['to', 'about', 'the', 'idea'], 'study')]


In [80]:
def make_context_vector(context, word_idx):
    idxs = [word_idx[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

In [43]:
make_context_vector(data[0][0], word_idx)

tensor([42, 10, 34, 24])

In [142]:
CONTEXT_SIZE = 4
EMBEDDING_DIM = 20

In [143]:
class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [144]:
losses = []
loss_function = nn.NLLLoss()

In [145]:
model = CBOW(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [146]:
for epoch in range(10):
    total_loss = 0
    for context, target in data:
        context_idx = make_context_vector(data[0][0], word_idx)
        model.zero_grad()
        
        log_probs = model(context_idx)
        loss = loss_function(log_probs, torch.tensor([word_idx[target]], dtype=torch.long))
        
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    losses.append(total_loss)
        

In [147]:
print(losses)

[230.36214447021484, 225.46933817863464, 223.80172324180603, 222.60078740119934, 221.75739216804504, 221.17100763320923, 220.76076102256775, 220.46780276298523, 220.25217199325562, 220.0878508090973]


In [148]:
print(model.embeddings.weight[word_idx["computational"]])

tensor([-1.4399, -0.5098, -0.6951, -0.6175, -0.6868, -1.4988,  0.6709,  0.7892,
         0.0599,  0.8013,  0.4626, -1.2088, -0.1457, -0.0925, -0.6336, -0.4352,
        -1.3568,  1.0663, -1.3798, -2.2706], grad_fn=<SelectBackward0>)
