# Embeddings

Embeddings basically map a discrete values of a feature variable to continuous space. This notebook works with an embedding to better understand their working...We start with importing everything necessary and defining key variables...

In [13]:
import os
import numpy as np
import matplotlib.pyplot as plt
from pymagnitude import Magnitude
from scipy import spatial
from sklearn.manifold import TSNE

import torch
import torch.nn as nn

In [2]:
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

RED, BLUE = '#FF4136', '#0074D9'

## Turning words to numbers

We want a mathematical function to work on our input data as discussed in notebook 1. To do so, we try to map our data(in this case, text) to numeric quantities. A sentence gets split into words as shown below. Each word then gets assigned a number and finally converted into a Tensor...

In [6]:
sentence = 'Why are you running? No! Come back here! Charlie!'
words = sentence.split()
words

['Why', 'are', 'you', 'running?', 'No!', 'Come', 'back', 'here!', 'Charlie!']

In [7]:
wordToIdx = {word : i for i,word in enumerate(sorted(set(words)))}
wordToIdx

{'Charlie!': 0,
 'Come': 1,
 'No!': 2,
 'Why': 3,
 'are': 4,
 'back': 5,
 'here!': 6,
 'running?': 7,
 'you': 8}

In [10]:
idxs = torch.LongTensor([wordToIdx[word] for word in sentence.split()])
idxs

tensor([3, 4, 8, 7, 2, 1, 5, 6, 0])

## Understanding Embeddings and Pymagnitude

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. Each word is represented by a real-valued vector, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding.

To put it simply, a word embedding, in itself, learns patterns in words and creates a vector for each data row. Being a model itself, you can either:
1. Train your own Word Embedding
2. Use a Pre-trained Word Embedding

Some of the best word embeddings are provided by Google, Facebook and Stanford NLP. They are openly available for download and use. We will be using the "GloVe" embedding by StanfordNLP but instead of loading the vanilla embedding, we would go for the "magnitude" file of the same.

"Magnitude" is a library that allows for more memory efficient vector loading. Their library can be installed via pip
by typing:
pip install pymagnitude

After which, you can simply download any of the embeddings given here:
https://github.com/plasticityai/magnitude#pre-converted-magnitude-formats-of-popular-embeddings-models


We will be using **GloVe 6B-50D magnitude lemmatized by plasticity** as our embedding. It is trained over Wikipedia words. The "6B" means it has been trained around 6 Billion different datarows while the 50D specifies that the output of this embedding layer will have 50 columns. More on this can be read here:
https://machinelearningmastery.com/what-are-word-embeddings/

In [23]:
# To use Magnitude, we simply create an object of the class with the magnitude filepath...
# And then we use the query() method on the object while passing our tokens(or list of words)...

glove_vectors = Magnitude(os.path.join(os.getcwd(),"embeddings","glove-lemmatized.6B.50d.magnitude"))
glove_embeddings = glove_vectors.query(words)

In [24]:
# This returns us a numpy array...
glove_embeddings.shape

(9, 50)

As we can see, our 9 tokenized words yield a numpy array of size (9,50) which is basically ( nWords , nDims )

## Checking out the embedding

We first try to see how things work with the embedding...To do so, we take a few random pairs of words and try to find how similar they are. This is done by querying the GloVe for embeddings of both words and finding distance b/w them. To map the distance from [1 , 0] to a probability-looking [0 , 1], we take cosine of distance and subtract it from 1...

In [29]:
def cosine_similarity(word1, word2):
    vector1, vector2 = glove_vectors.query(word1), glove_vectors.query(word2)
    return 1 - spatial.distance.cosine(vector1, vector2)

In [41]:
word_pairs = [
    ('Stapler', 'cat'),
    ('dog', 'cat'),
    ('Sodium', 'the'),
    ('king', 'queen'),
    ('duck','goose'),

]

for word1, word2 in word_pairs:
    print(f'Similarity between "{word1}" and "{word2}":\t{cosine_similarity(word1, word2):.2f}')

Similarity between "Stapler" and "cat":	0.05
Similarity between "dog" and "cat":	0.92
Similarity between "Sodium" and "the":	-0.01
Similarity between "king" and "queen":	0.78
Similarity between "duck" and "goose":	0.74


As we can see above, words that sound related/similar are having high similarity/low distance. While, unrelated words have low similarity.

In [42]:
ANIMALS = [
    'whale',
    'fish',
    'horse',
    'rabbit',
    'sheep',
    'lion',
    'dog',
    'cat',
    'tiger',
    'hamster',
    'pig',
    'goat',
    'lizard',
    'elephant',
    'giraffe',
    'hippo',
    'zebra',
]

HOUSEHOLD_OBJECTS = [
    'stapler',
    'screw',
    'nail',
    'tv',
    'dresser',
    'keyboard',
    'hairdryer',
    'couch',
    'sofa',
    'lamp',
    'chair',
    'desk',
    'pen',
    'pencil',
    'table',
    'sock',
    'floor',
    'wall',
]

In [52]:
tsne_words_embedded = TSNE(n_components=2).fit_transform(glove_vectors.query(ANIMALS + HOUSEHOLD_OBJECTS))
tsne_words_embedded.shape

(35, 2)

Plotting this causes jupyter notebook to behave weirdly, so i had intentionally made it cause an error

In [51]:
x, y = zip(*tsne_words_embedded)

fig, ax = plt.subplots(figsize=(10, 8))

for i, label in enumerate(ANIMALS + HOUSEHOLD_OBJECTS):
    if label in ANIMALS:
        color = BLUE
    elif label in HOUSEHOLD_OBJECTS:
        color = RED
        
    ax.scatter(x[i], y[i], c=color)
    ax.annotate(label, (x[i], y[i]))

ax.axis('off')

plt.show()

ValueError: too many values to unpack (expected 2)

In [54]:
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import torch
import torch.nn as nn

In [55]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

100%|██████████████████████████████████████████████████████████████████████████████████████| 231508/231508 [00:00<00:00, 339477.25B/s]
100%|███████████████████████████████████████████████████████████████████████████████| 407873900/407873900 [04:09<00:00, 1633752.70B/s]


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Lin

In [56]:
def toBertEmbeddings(text,return_tokens=False):
    """
        Create embeddings of given text
    """
    
    if isinstance(text,list):
        tokens = tokenizer.tokenize(' '.join(text))
    else:
        tokens = tokenizer.tokenize(text)
        
    tokenTags = ['[CLS]'] + tokens + ['[SEP]']
    indices = tokenizer.convert_tokens_to_ids(tokenTags)
    
    out = model(torch.LongTensor(indices).unsqueeze(0))
    
    embMatrix = torch.stack(out[0]).squeeze(1)[-4:]
    emb = []
    
    for j in range(embMatrix.shape[1]):
        emb.append(embMatrix[:,j,:].flatten().detach().numpy())
    
    emb = emb[1:-1]
    
    if return_tokens:
        assert len(emb) == len(tokens)
        return emb,tokens
    
    return emb

In [57]:
words_sentences = [
    ('mouse', 'I saw a mouse run off with some cheese.'),
    ('mouse', 'I bought a new computer mouse yesterday.'),
    ('cat', 'My cat jumped on the bed.'),
    ('keyboard', 'My computer keyboard broke when I spilled juice on it.'),
    ('dessert', 'I had a banana fudge sunday for dessert.'),
    ('dinner', 'What did you eat for dinner?'),
    ('lunch', 'Yesterday I had a bacon lettuce tomato sandwich for lunch. It was tasty!'),
    ('computer', 'My computer broke after the motherdrive was overloaded.'),
    ('program', 'I like to program in Java and Python.'),
    ('pasta', 'I like to put tomatoes and cheese in my pasta.'),
]

In [58]:
words = [words_sentence[0] for words_sentence in words_sentences]
sentences = [words_sentence[1] for words_sentence in words_sentences]

In [60]:
embeddings_lst, tokens_lst = zip(*[toBertEmbeddings(sentence, return_tokens=True) for sentence in sentences])
words, tokens_lst, embeddings_lst = zip(*[(word, tokens, embeddings) for word, tokens, embeddings in zip(words, tokens_lst, embeddings_lst) if word in tokens])

# Convert tuples to lists
words, tokens_lst, tokens_lst = map(list, [words, tokens_lst, tokens_lst])

In [69]:
# For some reason plotting this breaks jupyter notebook
# So again, i will intentionally create an error here...

raise FileNotFoundError

tsne_words_embedded = TSNE(n_components=2).fit_transform(target_embeddings)
x, y = zip(*tsne_words_embedded)

fig, ax = plt.subplots(figsize=(5, 10))

for word, tokens, x_i, y_i in zip(words, tokens_lst, x, y):
    ax.scatter(x_i, y_i, c=RED)
    ax.annotate(' '.join([f'$\\bf{x}$' if x == word else x for x in tokens]), (x_i, y_i))

plt.show()

FileNotFoundError: 