# word2vec

In this notebook, we are going to download a small word2vec model and try out some applications of the trained word embeddings.

To get started, run the cells of the first section below.

## Loading a pre-trained model

In [1]:
# Install necessary libraries
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [2]:
# Import necessary libraries
import gensim
from gensim.models import KeyedVectors

In [None]:
# Download the model files from a URL
!wget -c -nv 'https://zenodo.org/records/6542975/files/wiki_300_50_word2vec.model?download=1' -O wiki_300_50_word2vec.model
!wget -c -nv 'https://zenodo.org/records/6542975/files/wiki_300_50_word2vec.model.syn1neg.npy?download=1' -O wiki_300_50_word2vec.model.syn1neg.npy
!wget -c -nv 'https://zenodo.org/records/6542975/files/wiki_300_50_word2vec.model.wv.vectors.npy?download=1' -O wiki_300_50_word2vec.model.wv.vectors.npy

2026-01-05 15:55:48 URL:https://zenodo.org/records/6542975/files/wiki_300_50_word2vec.model?download=1 [29406039/29406039] -> "wiki_300_50_word2vec.model" [1]
2026-01-05 16:00:06 URL:https://zenodo.org/records/6542975/files/wiki_300_50_word2vec.model.syn1neg.npy?download=1 [995376128/995376128] -> "wiki_300_50_word2vec.model.syn1neg.npy" [1]


In [None]:
# load the model
model = KeyedVectors.load('wiki_300_50_word2vec.model')

# If we do not plan to train the model further
# we can speed up the vector retrieval by only keeping the vectors:
vectors = model.wv
del model

## Finding similar words

For example, we can use a vector of a word to find similar words by finding the vector's **nearest neighbours**. These are found through **cosine similarity**, which is calculated from the angle between vectors.

<div>
<img src="https://kdb.ai/files/2024/01/similarity-768x348.png" width="400px" />
</div>

Find the most similar vectors of the word 'europe':

In [None]:
# this finds the 8 most similar words and the corresponding cosine similarity
vectors.most_similar('europe',topn=8)

[('asia', 0.701633870601654),
 ('scandinavia', 0.6125940084457397),
 ('america', 0.5918585062026978),
 ('european', 0.5871455073356628),
 ('africa', 0.5690851807594299),
 ('southeast_asia', 0.5615699291229248),
 ('oceania', 0.5580213069915771),
 ('continent', 0.5538219809532166)]

Another way of using vectors and cosine similarity is to find the most similar word among the given list of words.

In [None]:
vectors.most_similar_to_given('europe',['dinar','euro','dollar','pound','krona'])

'euro'

**Exercise:**

Try it out on other words!

In [None]:
print(vectors.most_similar('word',topn=8))
print(vectors.most_similar_to_given('word',['phrase','sentence','saying']))

[('phrase', 0.7247493863105774), ('words', 0.7070206999778748), ('meaning', 0.6717065572738647), ('loanword', 0.6272196173667908), ('colloquial', 0.59702467918396), ('noun', 0.5963165163993835), ('calque', 0.5881631970405579), ('proverb', 0.5685657262802124)]
phrase


In [None]:
vectors.similarity('orange','phone')

0.19567843

## Finding outliers

We can also use vectors to filter out 'outliers', i.e. words most divergent from the set.  

In [None]:
vectors.doesnt_match(['lemon','plum','pear','tree'])

'tree'

## Compositionality

In some cases, we can also just sum two word vectors to come to a joint concept:

In [None]:
vectors.most_similar(positive=['yugoslavia','currency'])

Try to find another concept for which the compositionality/addition principle holds in vector space:

In [None]:
vectors.most_similar(positive=['word1','word2'])

## Word analogies
Vectors also allow us to do mathematical operations on words (or their embeddings) that correspond to analogical relations between words.

<table style="width: 100%; table-layout: fixed;">
  <colgroup>
    <col span="1" style="width: 400px;">
    <col span="1" style="width: 200px;">
  </colgroup>
  <tr>
    <td>
      <p>
       For example, we can look for a word that would solve the equation </br>
</br>

 `king - man + woman = x `
</br></br>

 In other words, what is to *king* as *woman* is to *man* </br>
 OR </br> what is to *woman* as *king* is to *man*?
      </p>
    </td>
    <td>
      <img src="https://ai.engin.umich.edu/wp-content/uploads/sites/8/2020/06/king-queen.png" width=300px />
    </td>
  </tr>
</table>

### Semantic analogies

> If a *king* was not a *man* but a *woman*, what would they be?

In [None]:
# the vectors for 'king' and 'woman' are added and 'man' is subtracted
vectors.most_similar(positive=['king','woman'],negative='man')

[('queen', 0.584174633026123),
 ('queen_consort', 0.5550029873847961),
 ('queen_dowager', 0.5135421752929688),
 ('catherine_jagiellon', 0.5011864304542542),
 ('queen_regnant', 0.5006860494613647),
 ('regent', 0.49817225337028503),
 ('gunilla_bielke', 0.49682119488716125),
 ('princess', 0.4963575601577759),
 ('jagellon', 0.4916633367538452),
 ('queen_saovabha_phongsri', 0.4906517565250397)]

> If a *carnivore* didn't eat *meat* but ate *vegetables*, what would they be?

In [None]:
vectors.most_similar(positive=['carnivore','vegetable'], negative='meat')

[('herbivore', 0.5727195143699646),
 ('carnivorous', 0.5675762295722961),
 ('herbivores', 0.5299584865570068),
 ('herbivorous', 0.5283064842224121),
 ('invertebrate', 0.5204532742500305),
 ('dicotyledonous', 0.517408549785614),
 ('salvinia_molesta', 0.513854444026947),
 ('earthworms', 0.5121837854385376),
 ('herbaceous', 0.5119017958641052),
 ('single_celled_organism', 0.5048058032989502)]

### Syntactic analogies

We can also compute syntactic analogies, i.e. transfer the syntactic relation between two words to other words.

> If *was* is a form of *be*, what would correspond to the same form of *see*?

In [None]:
vectors.most_similar(positive=['was','see'], negative='be')

> If *spiders* is the plural of *spider*, what would be the plural of *octopus*?

In [None]:
vectors.most_similar(positive=['spiders','octopus'], negative='spider')

# Word vectors equipped with POS tags

Word vectors can also be trained on lemmatized data and/or by differentiating different parts of speech. This can prove very useful in disambiguating homographs, e.g. the adjective *second* (place) from the noun *second* (vs minute).

In [None]:
# Import necessary libraries
import gensim
from gensim import downloader
from gensim.models import Word2Vec, KeyedVectors
from huggingface_hub import hf_hub_download

In [None]:
# here, we directly download the model from huggingface and load it into gensim
model_pos = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/nlpl_0", filename="model.bin"), binary=True, unicode_errors="ignore")

Observe the difference in the neighbourhoods of the NOUN second and the ADJECTIVE second.

In [None]:
print(model_pos.most_similar('second_NOUN', topn=3))
print(model_pos.most_similar('second_ADJ', topn=3))

# Extra: Training your own model

In [None]:
# Import necessary libraries
import gensim
from gensim import downloader
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
from gensim.test.utils import datapath
from gensim import utils
from gensim.models.word2vec import LineSentence
from gensim import logging
import json

In [None]:
# here, I am loading a dataset of historical quotes to represent my training data.
!wget -c "https://huggingface.co/datasets/m-ric/english_historical_quotes/resolve/main/english_historical_quotes.json?download=true" -O dataset.json

In [None]:
with open('dataset.json','r', encoding='utf-8') as f:
    dataset = json.load(f)

Let's print out some data from the `dataset` variable to see how it is structured.

In [None]:
print(len(dataset)) # print out the length of the variable
print(type(dataset)) # print out the type of the variable


print(dataset[0]) # print out the first item of the list
print(dataset[0].keys())  # print out the keys of the first (dictionary) item of the list

Here, we collect all the sentences contained under the key 'quote' in each item from the `dataset` list.




In [None]:
all_quotes = []
for quote in dataset:
    all_quotes.append(quote['quote'])

Word2vec expects the **input** to be **a tokenized sentence**.



---


Below, we define our own iterator to return one sentence per pass from our list of `all_quotes`. We also apply additional preprocessing with the built-in `simple_preprocess ` utility which lowercases and tokenizes all the sentences.

In [None]:
class MyCorpus:
    """An iterator that yields sentences (lists of strings)."""

    def __iter__(self):
        for line in all_quotes:
            yield utils.simple_preprocess(line)

When training a model, we can define the following parameters (among many, many others):

*  **`vector_size `** – Dimensionality of the word vectors: for word2vec, 100-300 is usually a good choice.

*   **`window`**  – Maximum distance between the current and predicted word within a sentence. A larger window will capture more topical semantic information, and a smaller window will capture more syntactic information.


*   **`min_count `**   – Ignore all words with total frequency lower than this. We can prune the internal dictionary by disregarding words that appear very rarely. In large corpora, these are usually typos and irrelevant words, and in addition, there’s not enough data to make any meaningful training on those words.



 *   **`sg`** - Select the training algorithm: 1 for skip-gram; 0 for CBOW (default).

In [None]:
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences, min_count=1, window=5, vector_size=150, sg=1)



---



---



### Using your own data

If you want to use a dataset from texts directly from a file, you can specify that here. For simple text files (.txt), you can use
1.    `.read()` : the whole text is read as a contiguous string  
2.   `readlines()`: the file is read line-by-line, creating a list of all lines in the file


In case of other formats, you will have to apply preprocessing and text extraction with other dedicated tools.

In [None]:
# @title Opening a file
# if using a location on Google Drive, you need to mount it first:

from google.colab import drive
drive.mount('/content/drive')

# if using files directly from the runtime, you can open them without mounting Drive.

# define where the file is

filepath = 'drive/MyDrive/path_to_file.txt'



---


The sentences iterable can be simply a list of lists of tokens (Option 1), but for larger corpora, it is smart to create an iterable that streams the sentences directly from disk/network (Option 2).



In [None]:
# @title Option 1

# simple list of lists: sentences are lowercased and split into tokens by space ' '

#open and read the file
with open(filepath,'r', encoding='utf-8') as f:
    custom_data = f.readlines()

custom_dataset = [x.lower().split(' ') for x in custom_data]

If we have a file containg one sentence per line, we can use the `LineSentence` class which feeds the model one sentence at a time.


NB: Words must be already preprocessed; it applies whitespace tokenization.


In [None]:
# @title Option 2

# LineSentence class for files with one sentence per line
custom_dataset = LineSentence(filepath)

In [None]:
# @title Train the model with the selected custom dataset.
custom_model = gensim.models.Word2Vec(custom_dataset, min_count=2, window=5, vector_size=150, sg=1)