## Experiment with Word Embeddings

**Note:** lesson notebooks are created for the purpose of in-class illustrations and student experimentation after class.

Back to [week 1 slides](https://docs.google.com/presentation/d/1sQvImcSg-IAikN6nsQmvoUmgcKjJ8oJ9DXHWSOiyUZY/edit?slide=id.g1223b215f96_0_6#slide=id.g1223b215f96_0_6)

**Description:** Review some basics about embeddings which we will use to represent words through the remainder of the class.

How should we think about embeddings relative to language?  How do they represent words? Are they like dictionary definitions of words with clear boundaries?  Are they a sharp clear respresentation of the meaning or are they more nebulous?

Gensim is a library that is used for working with similarity of words and documents.  We'll use it to experiment with embeddings and to sharpen our intuition.  Make sure you use the correct version of gensim for this notebook.


<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Fits Like a GloVe?](#glove)    
  * 3. [Cosine Similarity](#cosineSimilarity)
  * 4. [Analogies](#analogies)
  * 5. [Similarity and Language](#similarity)



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/materials/lesson_notebooks/lesson_1_WordEmbeddings_and_Analogies.ipynb)


[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup

Gensim has a preloaded set of different well known static word embeddings.  We'll take advantage of these to explore how they represent words.

In [1]:
import sys, scipy, numpy
print(sys.executable)
print("scipy:", scipy.__version__)
print("numpy:", numpy.__version__)

/share/crsp/lab/pkaiser/ddlin/mids/datasci-266/2025-fall-main/.venv/bin/python
scipy: 1.12.0
numpy: 1.26.4


In [4]:
import numpy as np
import gensim
import gensim.downloader as api
print(list(api.info()['models'].keys()))


['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


**Note: If the gensim downloader fails to load some numpy libraries, you must go to the Runtime menu and select 'Restart Session'.  Once the session is restarted, you can run all of the cells in order and they will work.**

[Return to Top](#returnToTop)  
<a id = 'glove'></a>

## 2. Fits like a GloVe?

Word embeddings take a long time to train - since the goal is to provide a good representation for as many words as possible, generating good embeddings often requires making several passes over a very large corpus. There are a number of different set out there avaialble for download.

Fortunately, it's possible to learn fairly general embeddings from large corpora that are useful for many downstream tasks. We'll use the GloVe vectors available at https://nlp.stanford.edu/projects/glove/ - specifically, a set trained with a vocabulary of 400,000 on a corpus of 6B tokens from Wikipedia and Gigaword.

The vectors are distributed as a (very) large text file, with one word per line followed by its vector:
```
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459
```

Gensim makes a number of these static embeddings avialble for your use.  They provide access to fasttext, conceptnet, word2vec, and GloVe.

You can select the set you want to use by parsing the name of the models provided by gensim.  These embeddings are identified by several dimensions.  First is the language modeling task that was used to build the embeddings.  Second is the corpus used to generate the embeddings.  Last is the dimensionality (size) of each embedding. The file 'glove-wiki-gigaword-50' means the glove approach was used to create the embeddings.  wiki-gigaword is the corpus used for training.  These embeddings are 50 dimensions long.



If you download word2vec it takes a long time (up to 15 minutes).  The glove-wiki-gigaword-50 takes significantly less time to download(around 1 minute).

In [5]:
# Download the original "word2vec-google-news-300" embeddings

embed_vectors = gensim.downloader.load('glove-wiki-gigaword-50')




Embeddings consist of a set of vectors and each vector is associated with a word.  This makes it easy to look up the vector for a specific word.  The list of words for which there are vectors are held in a vocabulary list.  Let's get the list and see what's inside.

In [9]:
embed_vectors.__dict__.keys()

dict_keys(['vector_size', 'index_to_key', 'next_index', 'key_to_index', 'vectors', 'norms', 'expandos', 'mapfile_path', 'lifecycle_events'])

In [7]:
embed_vectors.most_similar('king', topn=5)

[('prince', 0.8236179351806641),
 ('queen', 0.7839043140411377),
 ('ii', 0.7746230363845825),
 ('emperor', 0.7736247777938843),
 ('son', 0.766719400882721)]

In [6]:
#How many words in our vocabulary?
embed_vectors_vocab = embed_vectors.index_to_key
len(embed_vectors_vocab)

400000

Let's look at some of the words in the vocabulary.  We can retrieve them by position in the list.

In [6]:
embed_vectors_vocab[399995:400000]

['chanty', 'kronik', 'rolonda', 'zsombor', 'sandberger']

Let's look at one vector and see what it contains.  You can try various words to see what words are in the vocabulary.

In [10]:
vec0 = embed_vectors['globules']

vec0

array([ 0.8932  , -0.74191 ,  0.74033 , -1.2181  ,  0.38596 ,  0.22143 ,
        0.87075 ,  0.089573, -0.258   ,  0.87708 , -0.22681 ,  0.45953 ,
        1.321   ,  1.132   , -0.72951 ,  0.92633 ,  1.2474  , -0.5903  ,
        0.23094 ,  0.017891,  0.023733, -1.1873  ,  1.8662  ,  0.18656 ,
       -0.63184 ,  0.95206 , -0.55096 ,  1.0844  , -0.43715 ,  0.11234 ,
       -0.38582 , -0.36061 ,  0.63165 ,  0.57202 ,  0.38001 ,  0.081306,
       -0.41507 ,  0.015277,  0.79508 , -0.24794 , -0.24346 ,  0.65275 ,
       -0.69327 ,  0.37766 ,  0.13726 ,  0.18225 , -0.05881 , -0.18726 ,
        0.37905 , -0.74832 ], dtype=float32)

In [11]:
vec0 = embed_vectors['unk']

vec0

array([-7.9149e-01,  8.6617e-01,  1.1998e-01,  9.2287e-04,  2.7760e-01,
       -4.9185e-01,  5.0195e-01,  6.0792e-04, -2.5845e-01,  1.7865e-01,
        2.5350e-01,  7.6572e-01,  5.0664e-01,  4.0250e-01, -2.1388e-03,
       -2.8397e-01, -5.0324e-01,  3.0449e-01,  5.1779e-01,  1.5090e-02,
       -3.5031e-01, -1.1278e+00,  3.3253e-01, -3.5250e-01,  4.1326e-02,
        1.0863e+00,  3.3910e-02,  3.3564e-01,  4.9745e-01, -7.0131e-02,
       -1.2192e+00, -4.8512e-01, -3.8512e-02, -1.3554e-01, -1.6380e-01,
        5.2321e-01, -3.1318e-01, -1.6550e-01,  1.1909e-01, -1.5115e-01,
       -1.5621e-01, -6.2655e-01, -6.2336e-01, -4.2150e-01,  4.1873e-01,
       -9.2472e-01,  1.1049e+00, -2.9996e-01, -6.3003e-03,  3.9540e-01],
      dtype=float32)

In [12]:
vec1 = embed_vectors['computer']  # get numpy vector of a word

vec1

array([ 0.079084, -0.81504 ,  1.7901  ,  0.91653 ,  0.10797 , -0.55628 ,
       -0.84427 , -1.4951  ,  0.13418 ,  0.63627 ,  0.35146 ,  0.25813 ,
       -0.55029 ,  0.51056 ,  0.37409 ,  0.12092 , -1.6166  ,  0.83653 ,
        0.14202 , -0.52348 ,  0.73453 ,  0.12207 , -0.49079 ,  0.32533 ,
        0.45306 , -1.585   , -0.63848 , -1.0053  ,  0.10454 , -0.42984 ,
        3.181   , -0.62187 ,  0.16819 , -1.0139  ,  0.064058,  0.57844 ,
       -0.4556  ,  0.73783 ,  0.37203 , -0.57722 ,  0.66441 ,  0.055129,
        0.037891,  1.3275  ,  0.30991 ,  0.50697 ,  1.2357  ,  0.1274  ,
       -0.11434 ,  0.20709 ], dtype=float32)

In [14]:
vec2 = embed_vectors['memory']  # get numpy vector of a word

vec2

array([ 0.23279  ,  0.86839  ,  1.1254   ,  0.51721  ,  0.75241  ,
        0.29429  ,  0.49561  , -0.32712  ,  0.073683 ,  0.65221  ,
        0.25048  ,  0.18102  ,  0.048204 , -0.58776  ,  0.50913  ,
        0.31256  , -1.4426   ,  0.2717   ,  0.63258  , -0.30259  ,
        0.1134   , -0.091346 , -0.72572  , -0.63157  ,  1.1439   ,
       -0.66678  , -0.98826  , -0.789    ,  0.92111  ,  0.024868 ,
        2.7919   , -0.17054  ,  0.16879  , -0.92214  ,  0.0043812,
        1.1182   ,  0.57454  ,  0.11623  ,  0.35822  , -0.29916  ,
        0.3413   , -0.32128  , -0.94669  ,  0.39784  ,  0.14316  ,
        0.60063  ,  0.59717  , -0.57258  , -0.20082  , -0.40977  ],
      dtype=float32)

[Return to Top](#returnToTop)  
<a id = 'cosineSimilarity'></a>

## 3. Cosine Similarity

To measure the similarity of two words, we'll use the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between their representation vectors:

$$ D^{cos}_{ij} = \frac{v_i^T v_j}{||v_i||\ ||v_j||}$$

*Note that this is called cosine similarity because $D^{cos}_{ij} = \cos(\theta_{ij})$, where $\theta_{ij}$ is the angle between the two vectors.*


Let's use numpy to calculate the cosine similarity between two vectors.  The closer the distance between the two vectors the more alike they are.  In word2vec the embeddings are built so that words that are used in the same context have more similar vectors than words used in different contexts.  If you think of names of cities like London or Paris they will be used in the same context like "I want to visit ..." or "I flew in from ..." or "I once lived in ..."

In [15]:
from numpy.linalg import norm

cos_sim = np.dot(vec1, vec2)/(norm(vec1)*norm(vec2))

cos_sim

0.69093883

In [None]:
# Already has a a built-in function to compute similarity between two words
embed_vectors.similarity(w1='computer', w2='memory')

0.6909388

Let's use a word like 'flies' which is ambiguous.  As a noun it refers to an insect.  As a verb it refers to the act of flying.  Notice how both of these are incorporated into the embedding and how this is reflected in the list of most similar words.

In [16]:
# Gensim allows us to easily see which words are most similar to the word we specify.
embed_vectors.most_similar('flies')

[('fly', 0.7638436555862427),
 ('flying', 0.7358503937721252),
 ('birds', 0.6993444561958313),
 ('moths', 0.6920357346534729),
 ('butterflies', 0.6906600594520569),
 ('plane', 0.6849700808525085),
 ('circling', 0.6725693941116333),
 ('spotted', 0.672272801399231),
 ('sea', 0.6711999177932739),
 ('tsetse', 0.6690365672111511)]

The results are not always what you might expect.  It is useful to think about what you expect to see and what associations are actually represented and why.  For example we think of happy as the emotion.  Notice that most similar doesn't necessarily mean synonyms.  It just means used often in the same context.

In [14]:
embed_vectors.most_similar('happy')

[("'m", 0.9142323136329651),
 ('everyone', 0.8976402282714844),
 ('everybody', 0.8965489864349365),
 ('really', 0.88397616147995),
 ('me', 0.8784631490707397),
 ('definitely', 0.8762788772583008),
 ('maybe', 0.8756702542304993),
 ("'d", 0.8718011975288391),
 ('feel', 0.8707678318023682),
 ('i', 0.8707453012466431)]

In [17]:
#To see least similar use the following construct
embed_vectors.most_similar(negative=['happy'])

[('then-director', 0.693364143371582),
 ('carneades', 0.6816099286079407),
 ('bb94', 0.6682854294776917),
 ('vopi', 0.6628192067146301),
 ('endel', 0.6545295119285583),
 ('dovers', 0.6493183374404907),
 ('cw96', 0.6464378237724304),
 ('synesius', 0.6453444957733154),
 ('kd94', 0.6443259119987488),
 ('25aou94', 0.6437768340110779)]

[Return to Top](#returnToTop)  
<a id = 'analogies'></a>

## 4. Analogies

We can also use these embeddings to perform analogies.  This is an interesting way to test how well they represent the meaning of the words.

In [18]:
result = embed_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

queen: 0.8524


Let's define a function and try some simple analogies.  When it works well it is great and when it fails it can be equally sepctacular.

In [19]:
def make_analogy(x1, x2, y1):
    result = embed_vectors.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [20]:
make_analogy('long', 'longest', 'fast')

'fastest'

In [21]:
make_analogy('hands', 'arms', 'feet')

'length'

In [22]:
make_analogy('paris', 'france', 'athens')

'greece'

In [23]:
make_analogy('man', 'programmer', 'woman')

'prodigy'

[Return to Top](#returnToTop)  
<a id = 'similarity'></a>

## 5. Similarity and Language

How does the similarity score translate to your understanding of language?  It is also good to experiment with words that you think are similar and words you think are different to see what kind of similarity scores you get.

In [24]:
embed_vectors.similarity(w1='man', w2='guy')

0.75536114

In [25]:
embed_vectors.similarity(w1='man', w2='boy')

0.8564432

In [26]:
embed_vectors.similarity(w1='man', w2='squirrel')

0.30944517