We use Embeddings to represent text into a numerical form. Either into a one-hot encoding format called sparse vector or a fixed Dense representation called Dense Vector.

Every Word gets it meaning from the words it is surrounded by, So when we train our embeddings we want word with similar meaning or words used in similar context to be together.

For Example:- 
1. Words like Aeroplane, chopper, Helicopter, Drone should be very close to each other because they share the same feature, they are flying object.

2. Words like Man and Women should be exact opposite to each other.

3. Sentences like "Coders are boring people." and "Programmers are boring." the word `coders` and `programmers` are used in similar context so they should be close to each other.

Word Embeddings are nothing but vectors in a vector space. And using some vector calculation we can easily find 
1. Synonyms or similar words
2. Finding Analogies
3. Can be used as spell check (if trained on a large corpus)
4. Pretty Much Anything which you can do with vectors.


In [12]:
import torchtext
import numpy as np
import torch

In [2]:
glove = torchtext.vocab.GloVe(name = '6B', dim = 100)

print(f'There are {len(glove.itos)} words in the vocabulary')

There are 400000 words in the vocabulary


In [9]:
glove.itos[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [11]:
glove.stoi["cat"]

5450

In [7]:
def get_embedding(word):
    return glove.vectors[glove.stoi[word]]

In [8]:
get_embedding("cat")

tensor([ 0.2309,  0.2828,  0.6318, -0.5941, -0.5860,  0.6326,  0.2440, -0.1411,
         0.0608, -0.7898, -0.2910,  0.1429,  0.7227,  0.2043,  0.1407,  0.9876,
         0.5253,  0.0975,  0.8822,  0.5122,  0.4020,  0.2117, -0.0131, -0.7162,
         0.5539,  1.1452, -0.8804, -0.5022, -0.2281,  0.0239,  0.1072,  0.0837,
         0.5501,  0.5848,  0.7582,  0.4571, -0.2800,  0.2522,  0.6896, -0.6097,
         0.1958,  0.0442, -0.3114, -0.6883, -0.2272,  0.4618, -0.7716,  0.1021,
         0.5564,  0.0674, -0.5721,  0.2374,  0.4717,  0.8277, -0.2926, -1.3422,
        -0.0993,  0.2814,  0.4160,  0.1058,  0.6220,  0.8950, -0.2345,  0.5135,
         0.9938,  1.1846, -0.1636,  0.2065,  0.7385,  0.2406, -0.9647,  0.1348,
        -0.0072,  0.3302, -0.1236,  0.2719, -0.4095,  0.0219, -0.6069,  0.4076,
         0.1957, -0.4180,  0.1864, -0.0327, -0.7857, -0.1385,  0.0440, -0.0844,
         0.0491,  0.2410,  0.4527, -0.1868,  0.4618,  0.0891, -0.1819, -0.0152,
        -0.7368, -0.1453,  0.1510, -0.71

# Similar Context

To find words similar to input words. We have to first take the vector representation of all words and compute the eucledian distance of the input word with respect to all words and choose the n closest words by sorting the distance ascending order.


In [189]:
def get_closest_word(word,n=10):
    input_vector = get_embedding(word).numpy() if isinstance(word,str) else word.numpy()
    distance = np.linalg.norm(input_vector-glove.vectors.numpy(),axis=1)
    sort_dis = np.argsort(distance)[:n]
    return list(zip(np.array(glove.itos)[sort_dis] , distance[sort_dis]))

In [140]:
get_closest_word("sad",n=10)

[('sad', 0.0),
 ('sorry', 3.8980238),
 ('awful', 3.9397242),
 ('tragic', 4.1759458),
 ('horrible', 4.2391276),
 ('heartbreaking', 4.3083005),
 ('unfortunate', 4.3767824),
 ('pathetic', 4.384075),
 ('scary', 4.3990903),
 ('happy', 4.427562)]

In [168]:
def get_similarity_angle(word1,word2):
    word1 = get_embedding(word1).view(1,-1)
    word2 = get_embedding(word2).view(1,-1)
    simi = torch.nn.CosineSimilarity(dim=1)(word1,word2).numpy()    
    return simi,np.rad2deg(np.arccos(simi))


In [180]:
get_similarity_angle("sad","awful")

(array([0.7284238], dtype=float32), array([43.245583], dtype=float32))

# Analogies

In [203]:
def analogy( word1, word2, word3, n=5):
    
    #get vectors for each word
    word1_vector = get_embedding(word1)
    word2_vector = get_embedding(word2)
    word3_vector = get_embedding(word3)
    
    #calculate analogy vector
    analogy_vector = word2_vector - word1_vector + word3_vector
    
#     #find closest words to analogy vector
    candidate_words = get_closest_word( analogy_vector, n=n+3)
    
    #filter out words already in analogy
    candidate_words = [(word, dist) for (word, dist) in candidate_words 
                       if word not in [word1, word2, word3]][:n]
    
    print(f'{word1} is to {word2} as {word3} is to...')
    
    return candidate_words

In [259]:
analogy('man', 'king', 'woman')

man is to king as woman is to...


[('queen', 4.081079),
 ('monarch', 4.6429076),
 ('throne', 4.9055004),
 ('elizabeth', 4.921559),
 ('prince', 4.9811463)]

This is the canonical example which shows off this property of word embeddings. So why does it work? Why does the vector of 'woman' added to the vector of 'king' minus the vector of 'man' give us 'queen'?

If we think about it, the vector calculated from 'king' minus 'man' gives us a "royalty vector". This is the vector associated with traveling from a man to his royal counterpart, a king. If we add this "royality vector" to 'woman', this should travel to her royal equivalent, which is a queen!

In [258]:
analogy('india', 'delhi', 'australia')

india is to delhi as australia is to...


[('sydney', 3.075962),
 ('melbourne', 3.1802053),
 ('canberra', 3.2988422),
 ('perth', 3.3058352),
 ('brisbane', 3.543623)]

In [262]:
get_closest_word("reliable")

[('reliable', 0.0),
 ('dependable', 3.4597418),
 ('accurate', 3.9092937),
 ('unreliable', 4.1367426),
 ('trustworthy', 4.178639),
 ('consistent', 4.251687),
 ('useful', 4.2767124),
 ('efficient', 4.4032826),
 ('credible', 4.4201684),
 ('authoritative', 4.514892)]

# Case Studies
1. https://forums.fast.ai/t/nlp-any-libraries-dictionaries-out-there-for-fixing-common-spelling-errors/16411

2. Multilingual and Cross-lingual analysis: If you work on works in translation, or on the influence of writers who write in one language on those who write in another language, word vectors can valuable ways to study these kinds of cross-lingual relationships algorithmically.
[Case Study: Using word vectors to study endangered languages](https://raw.githubusercontent.com/YaleDHLab/lab-workshops/master/word-vectors/papers/coeckelbergs.pdf)

3. Studying Language Change over Time: If you want to study the way the meaning of a word has changed over time, word vectors provide an exceptional method for this kind of study.
[Case Study: Using word vectors to analyze the changing meaning of the word "gay" in the twentieth century.](https://nlp.stanford.edu/projects/histwords/)

4. Analyzing Historical Concept Formation: If you want to analyze the ways writers in a given historical period understood particular concepts like "honor" and "chivalry", then word vectors can provide excellent opportunities to uncover these hidden associations.
[Case Study: Using word vectors to study the ways eighteenth-century authors organized moral abstractions](https://raw.githubusercontent.com/YaleDHLab/lab-workshops/master/word-vectors/papers/heuser.pdf)

5. Uncovering Text Reuse: If you want to study text reuse or literary imitation (either within one language or across multiple languages), word vectors can provide excellent tools for identifying similar passages of text.
[Case Study: Using word vectors to uncover cross-lingual text reuse in eighteenth-century writing](https://douglasduhaime.com/posts/crosslingual-plagiarism-detection.html)
