Text embeddings are maps from words to vectors of numbers. The design goal is: words with *similar* meaning should map to vectors that are *near* each other.

This allows very powerful models to be build quickly.

In this notebook we will take a quick look at:

> GloVe: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher, Christopher D. Manning

[https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/)

One of the earliest embeddings was word2vec, and modern systems such as Bert and GPT3 are based on related mapping and memorization concepts.

In [1]:
import os
import sys

In [2]:
sys.path.append('.')
sys.path.append('../data/GloVe')
from GloVe_tools import scan_for_knn, GloveKNN

In [3]:
glove_path = '../data/GloVe/glove.840B.300d.zip'
k = 5

In [4]:
code, neighbors = scan_for_knn(
    glove_path=glove_path,
    k=k,
    vec_key='awful')

In [5]:
code

array([-3.5771e-01,  1.3224e-01, -1.2629e-01,  1.9399e-02, -1.2629e-01,
        6.7209e-02,  5.5793e-01, -2.2874e-01,  7.2385e-02,  2.1437e+00,
        1.8868e-01,  1.4195e-02, -1.6686e-01, -1.1108e-02, -1.4443e-01,
        8.2411e-02, -1.0485e-01,  6.8161e-02,  1.7655e-01, -1.2898e-01,
        2.8546e-02,  2.7927e-02, -1.8166e-01, -1.1873e-01, -1.9247e-01,
       -1.5619e-01, -2.8282e-01, -2.8844e-01,  8.0158e-02, -2.6151e-01,
       -2.7874e-01, -1.1319e-01, -4.1970e-01,  5.6307e-02, -1.4667e-01,
        2.6969e-01,  1.1773e-01,  1.8393e-01, -2.1825e-01,  4.8422e-01,
        3.8373e-01, -6.3800e-02, -4.5418e-01,  3.1067e-01,  1.8604e-01,
       -1.6346e-01, -4.0040e-01,  5.3268e-01,  1.6649e-02, -1.2839e-01,
        1.0279e-01,  4.9500e-01,  2.4030e-01,  9.1514e-02,  1.1895e-01,
       -5.5230e-02, -5.5255e-02, -1.0431e-01,  1.3750e-02, -4.3006e-02,
       -3.2369e-02, -2.7714e-01, -2.0849e-01, -1.4907e-01,  3.9801e-02,
        1.2099e-01, -4.6359e-02, -3.4227e-01,  1.5219e-01,  2.04

In [6]:
neighbors

{'terrible': 5.781587394434634,
 'horrible': 5.9903435930876885,
 'dreadful': 10.75734373694929,
 'horrendous': 12.032913736421703,
 'horrid': 8.202688529045705}

In [7]:
# similar calculation using sklearn KNN code in memory
# needs about 20GB GB of ram
nbhds = GloveKNN(glove_path=glove_path, k = k + 1)
nbhds.kneighbors_k('awful')

['awful', 'terrible', 'horrible', 'horrid', 'dreadful', 'horrendous']

In [8]:
nbhds.kneighbors_k('plot')

['plot', 'plots', 'plotting', 'Plot', 'plotline', 'storyline']

In [9]:
nbhds.kneighbors_k('actor')

['actor', 'actors', 'Actor', 'actress', 'starred', 'starring']