[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/3.embeddings/WordEmbeddings.ipynb)

In [1]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/wiki.10K.txt

--2025-09-08 16:08:51--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/wiki.10K.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-09-08 16:08:52 ERROR 404: Not Found.



This notebook explores word embeddings through the functionality of Gensim; we train new embeddings from a dataset of our own and compare with pre-trained Glove embeddings.

In [2]:
!pip install gensim

Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-macosx_12_0_arm64.whl.metadata (60 kB)
Downloading scipy-1.13.1-cp311-cp311-macosx_12_0_arm64.whl (30.3 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.3/30.3 MB[0m [31m22.1 MB/s[0m  [33m0:00:01[0m3.3 MB/s[0m eta [36m0:00:01[0m:01[0m
[?25hInstalling collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.16.1
    Uninstalling scipy-1.16.1:
      Successfully uninstalled scipy-1.16.1
Successfully installed scipy-1.13.1


In [3]:
import re
from gensim.models import Word2Vec, KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath

First, let's train a new word2vec model on our data.

In [4]:
sentences=[]
filename="wiki.10K.txt"
with open(filename) as file:
    for line in file:
        words = line.rstrip().lower()
        # this file is already tokenize, so we can split on whitespace
        # but first let's replace any sequence of whitespace (space, tab, newline, etc.) with single space
        words = re.sub("\s+", " ", words)
        sentences.append(words.split(" "))

In [5]:
model = Word2Vec(
    sentences,
    vector_size=100,
    window=5,
    min_count=2,
    workers=10
)

In [6]:
my_trained_vectors = model.wv
# save vectors to file if you want to use them later
my_trained_vectors.save_word2vec_format('embeddings.txt', binary=False)

In [7]:
my_trained_vectors.most_similar("actor", topn=10)

[('actress', 0.9447002410888672),
 ('writer', 0.9032748937606812),
 ('musician', 0.9025986790657043),
 ('producer', 0.8952045440673828),
 ('artist', 0.8919627666473389),
 ('composer', 0.8810310959815979),
 ('novelist', 0.8661836385726929),
 ('comedian', 0.86481112241745),
 ('singer', 0.8614188432693481),
 ('pianist', 0.8470042943954468)]

Let's load in vectors that have already been trained on a much bigger dataset. [Glove vectors](https://nlp.stanford.edu/projects/glove/) are trained using a different method than word2vec, but results in vectors that can be read in by Gensim.  Here we'll use a 100-dimensional model trained on 6B words (from Wikipedia and news), but bigger models are also available.

In [8]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt

--2025-09-08 16:09:27--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-09-08 16:09:27 ERROR 404: Not Found.



In [9]:
glove = KeyedVectors.load_word2vec_format("glove.6B.100d.100K.txt", binary=False, no_header=True)

In [10]:
glove.most_similar("actor", topn=10)

[('actress', 0.8580666184425354),
 ('comedian', 0.7957587242126465),
 ('starring', 0.7920297384262085),
 ('starred', 0.7582032680511475),
 ('actors', 0.7394536137580872),
 ('filmmaker', 0.7349801659584045),
 ('screenwriter', 0.7342271208763123),
 ('film', 0.6941470503807068),
 ('movie', 0.6924505829811096),
 ('comedy', 0.6884661912918091)]

`most_similar` allows for vector arithmetic (as the average value of the input positive/negative vectors, where negative vectors are first multiplied by -1).  Play around with this function to discover other analogies that have been learned in this representation.

In [11]:
# one + two = three + ?
one="man"
two="king"
three="woman"

one="paris"
two="france"
three="berlin"

glove.most_similar(positive=[two, three], negative=[one], topn=5)

[('germany', 0.8923620581626892),
 ('austria', 0.7597678899765015),
 ('poland', 0.7425416111946106),
 ('denmark', 0.7360999584197998),
 ('german', 0.6986511945724487)]

We can also evaluate the quality of the learned vectors through an intrinsic evaluation comparing to human judgments in the wordsim 353 dataset.

In [12]:
glove.evaluate_word_pairs(datapath('wordsim353.tsv'))

(PearsonRResult(statistic=0.5483502337187542, pvalue=4.235089835437802e-29),
 SignificanceResult(statistic=0.5327354323238274, pvalue=2.86541465805589e-27),
 0.0)

In [13]:
my_trained_vectors.evaluate_word_pairs(datapath('wordsim353.tsv'))

(PearsonRResult(statistic=0.38916628095711586, pvalue=4.9702980422433095e-14),
 SignificanceResult(statistic=0.39450373055600235, pvalue=2.0864107496013423e-14),
 1.41643059490085)