<center><img src="./images/nup_logo_dark.jpeg" width=300 style="display: inline-block;"></center>

## Advanced ML
### Topic modeling and word2vec

<br />
March 18, 2025


This notebook examines two topic modeling models from the `gensim` library:
  - LDA (Latent Dirichlet Allocation)
  - word2vec

Sources of inspiration:
  - https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html
  - https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

### LDA (Latent Dirichlet Allocation)
We install the topic modeling library gensim (http://radimrehurek.com/gensim/) and load the NLTK library (http://nltk.org/), which will be needed for lemmatization.


In [1]:
!pip install --upgrade gensim
!pip install --upgrade nltk

Collecting gensim
  Obtaining dependency information for gensim from https://files.pythonhosted.org/packages/1f/76/616bc781bc19ee76b387a101211f73e00cf59368fcc221e77f88ea907d04/gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata
  Downloading gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Obtaining dependency information for numpy<2.0,>=1.18.5 from https://files.pythonhosted.org/packages/75/5b/ca6c8bd14007e5ca171c7c03102d17b4f4e0ceb53957e8c44343a9546dcc/numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata
  Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Obtaining dependency information for scipy<1.14.0,>=1.7.0 from https://files.pythonhosted.org/packages/dc/5a/2043a3bde1443d94014aaa41e0b50c39

In [2]:
import nltk
import numpy as np
import os
import sys

from gensim import corpora, models, similarities
from math import log
from time import time

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

We read the collection of source texts into a list of documents. Each document is a list of lemmas (tokens). In this example, we load the entire collection into memory. In fact, `gensim` allows you to avoid this at all stages of model building.

The collection used is articles from the NeurIPS conference, one of the standard collections for topic modeling. The number of documents is about 1700, with each document having a length of 1000-2000 words. 

In [4]:
import tarfile
import re
import urllib.request, zipfile


tarfile_url = 'https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'
filename = 'nips12raw_str602.tgz'
# urllib.request.urlretrieve(tarfile_url, filename)

def extract_documents(fname=filename):
    with tarfile.open(fname, mode='r:gz') as tar:
        # Ignore directory entries, as well as files like README, etc.
        files = [
            m for m in tar.getmembers()
            if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
        ]
        for member in sorted(files, key=lambda x: x.name):
            member_bytes = tar.extractfile(member).read()
            yield member_bytes.decode('utf-8', errors='replace')


In [5]:
docs = list(extract_documents())
print(len(docs))
print(print(docs[0][:500]))

1740
1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a pr
None


Data preparation:
- Create a dictionary
- Perform lemmatization
- Build n-grams
- Filter out tokens that are too frequent or too rare

In [6]:
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

In [7]:
print(np.sum([len(doc) for doc in docs]))

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

print(np.sum([len(doc) for doc in docs]))

5461201
5115888


In [8]:
print(docs[1][:50])

['stochastic', 'learning', 'networks', 'and', 'their', 'electronic', 'implementation', 'joshua', 'alspector', 'robert', 'b', 'allen', 'victor', 'hut', 'and', 'srinagesh', 'satyanarayana', 'bell', 'communications', 'research', 'morristown', 'nj', 'abstract', 'we', 'describe', 'a', 'family', 'of', 'learning', 'algorithms', 'that', 'operate', 'on', 'a', 'recurrent', 'symmetrically', 'connected', 'neuromorphic', 'network', 'that', 'like', 'the', 'boltzmann', 'machine', 'settles', 'in', 'the', 'presence', 'of', 'noise']


In [9]:
# Remove words that are only one character
docs = [[token for token in doc if len(token) > 1] for doc in docs]
print(np.sum([len(doc) for doc in docs]))

4629808


In [10]:
# Remove words with underscores, since we are going to use them as delimiters in bigrams
docs = [[token for token in doc if '_' not in token] for doc in docs]
print(np.sum([len(doc) for doc in docs]))

4626035


In [11]:
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/Aleksandr.Avdiushenko/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
print(lemmatizer.lemmatize('abstracts'),
      lemmatizer.lemmatize('fishes'))

abstract fish


In [13]:
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [14]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)


2025-01-23 10:59:55,937 : INFO : collecting all words and their counts
2025-01-23 10:59:55,937 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2025-01-23 10:59:58,083 : INFO : collected 1114271 token types (unigram + bigrams) from a corpus of 4626035 words and 1740 sentences
2025-01-23 10:59:58,084 : INFO : merged Phrases<1114271 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>
2025-01-23 10:59:58,084 : INFO : Phrases lifecycle event {'msg': 'built Phrases<1114271 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000> in 2.15s', 'datetime': '2025-01-23T10:59:58.084307', 'gensim': '4.3.3', 'python': '3.12.6 (v3.12.6:a4a2d2b0d85, Sep  6 2024, 16:08:03) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-15.2-arm64-arm-64bit', 'event': 'created'}


In [15]:
for token in bigram[docs[0][:100]]:
    if '_' in token:
        print(token)

abu_mostafa
california_institute
technology_pasadena
ca_abstract
neural_network
boolean_function
can_be
very_low
learning_rule
lower_bound


In [16]:
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document
            docs[idx].append(token)

In [17]:
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents
dictionary = Dictionary(docs)

2025-01-23 11:00:03,990 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2025-01-23 11:00:04,898 : INFO : built Dictionary<77939 unique tokens: ['0a', '2h', '2h2', '2he', '2n']...> from 1740 documents (total 4944995 corpus positions)
2025-01-23 11:00:04,898 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<77939 unique tokens: ['0a', '2h', '2h2', '2he', '2n']...> from 1740 documents (total 4944995 corpus positions)", 'datetime': '2025-01-23T11:00:04.898816', 'gensim': '4.3.3', 'python': '3.12.6 (v3.12.6:a4a2d2b0d85, Sep  6 2024, 16:08:03) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-15.2-arm64-arm-64bit', 'event': 'created'}


Remove words that are too rare (e.g., typos) and words that are too frequent (e.g., stop words or just common non-topic terms). The `filter_extremes` function removes tokens from the dictionary that appear in less than `no_below` documents or in more than `no_above` fraction of the total number of documents.

In [18]:
# Remove rare and common tokens
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

2025-01-23 11:00:06,494 : INFO : discarding 69316 tokens: [('0a', 19), ('2h', 16), ('2h2', 1), ('2he', 3), ('a', 1740), ('about', 1058), ('abstract', 1740), ('after', 1087), ('alently', 2), ('all', 1658)]...
2025-01-23 11:00:06,495 : INFO : keeping 8623 tokens which were in no less than 20 and no more than 870 (=50.0%) documents
2025-01-23 11:00:06,505 : INFO : resulting dictionary: Dictionary<8623 unique tokens: ['2n', 'a2', 'a_follows', 'ability', 'abu']...>


Represent all documents in vector form (Bag-of-Words)

In [19]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [20]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 8623
Number of documents: 1740


In [22]:
print(corpus[0][:10])

[(0, 4), (1, 1), (2, 1), (3, 2), (4, 4), (5, 4), (6, 1), (7, 1), (8, 1), (9, 1)]


### Training
Now we are ready to build a topic model for our collection. We will build an online LDA model, implemented in the `gensim` library. We specify the vectorized corpus of texts, the dictionary, and the number of topics (10). We will discuss the remaining parameters later.

In [23]:
start = time()
# Set training parameters.
num_topics = 10
chunksize = 2000  # batch-size
epochs = 5   
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=epochs,
    eval_every=eval_every
)
print('Evaluation time: {}'.format((time()-start) / 60))

2025-01-23 11:00:24,760 : INFO : using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2025-01-23 11:00:24,761 : INFO : using serial LDA version on this node
2025-01-23 11:00:24,766 : INFO : running online (multi-pass) LDA training, 10 topics, 5 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2025-01-23 11:00:24,767 : INFO : PROGRESS: pass 0, at document #1740/1740
2025-01-23 11:00:29,848 : INFO : optimized alpha [0.059415117, 0.08211947, 0.050295256, 0.11127745, 0.0716258, 0.0535777, 0.0707049, 0.07356323, 0.09876121, 0.08361459]
2025-01-23 11:00:29,851 : INFO : topic #2 (0.050): 0.006*"cell" + 0.004*"image" + 0.003*"component" + 0.003*"architecture" + 0.003*"net" + 0.003*"approximation" + 0.003*"rule" + 0.003*"distance" + 0.003*"direction" + 0.002*"neuron"
2025-01-23 11:00:29,851 : INFO : topic #5 (0.054): 0.00

Evaluation time: 0.25493288040161133


Let's see what we got. We are interested in part of the Phi matrix – the probabilities of words in topics. The NeurIPS collection is entirely dedicated to machine learning. It's difficult to evaluate the topics, though some interpretability can be traced.

In [24]:
for position in range(10):
    row = []
    for topic in range(10):
        row.append(model.show_topic(topic)[position][0].center(11, ' '))
    print(''.join(row))

  control  generalization polynomial   noise       cell      hidden     neuron     image      image     sequence 
   policy    gaussian approximation component    neuron  hidden_unit   signal  recognition   object     layer   
   action     hidden    tangent     source    response    object     memory      net       visual     hidden  
  dynamic     sample     bound      matrix    activity     code       chip      layer      class     dynamic  
   motor    prediction    cell     sequence   control      net       spike    character  classifier    net    
  optimal   regression  distance   mixture    stimulus   circuit     analog      node       tree      matrix  
reinforcement  density      net       hidden    synaptic     rule     circuit     hidden  recognition   neuron  
 trajectory   class    dimension      em       firing   activation    word     trained     filter   recurrent 
  movement   estimate  direction    signal     layer      layer    connection   class      signal      no

In [25]:
top_topics = model.top_topics(corpus)

2025-01-23 11:00:48,857 : INFO : CorpusAccumulator accumulated stats from 1000 documents


In [26]:
top_topics[0]

([(0.0054766214, 'generalization'),
  (0.0047958074, 'gaussian'),
  (0.004777913, 'hidden'),
  (0.0044106385, 'sample'),
  (0.0043535293, 'prediction'),
  (0.00389188, 'regression'),
  (0.0036687148, 'density'),
  (0.0035643762, 'class'),
  (0.0032825277, 'estimate'),
  (0.0031644437, 'noise'),
  (0.0030338306, 'training_set'),
  (0.003029318, 'approximation'),
  (0.00301275, 'component'),
  (0.0029484706, 'variance'),
  (0.0029387628, 'bound'),
  (0.0029217678, 'hidden_unit'),
  (0.00291219, 'kernel'),
  (0.0027721105, 'optimal'),
  (0.0027302164, 'layer'),
  (0.0026089945, 'matrix')],
 -0.9447828080187861)

In [27]:
model.inference([corpus[0]])[0]

array([[3.6106169e-02, 5.5974636e-02, 1.0229638e+02, 7.4336639e+01,
        5.0236102e-02, 1.6639297e+02, 5.4127239e+01, 5.5391144e-02,
        6.2961258e-02, 4.0246011e+02]], dtype=float32)

### Perplexity evaluation
We want to assess the model with something more convincing than just looking at topic profiles and document profiles. This is necessary for the possibility of comparing different models, for example, those obtained with different run parameters. Let's learn to measure **perplexity**. The function `model.state.get_lambda` returns the unnormalized $\Phi$ matrix, and `model.inference` estimates the unnormalized $\Theta$ matrix for a list of documents.

We iterate through the collection and calculate perplexity using the formula. The lower the perplexity, the better.

In [28]:
def perplexity(model, corpus):
    corpus_length = 0
    log_likelihood = 0
    topic_profiles = model.state.get_lambda() / np.sum(model.state.get_lambda(), axis=1)[:, np.newaxis]
    for document in corpus:
        gamma, _ = model.inference([document])
        document_profile = gamma / np.sum(gamma)
        for term_id, term_count in document:
            corpus_length += term_count
            term_probability = np.dot(document_profile, topic_profiles[:, term_id])
            log_likelihood += term_count * log(term_probability.item())
    perplexity = np.exp(-log_likelihood / corpus_length)
    return perplexity

In [29]:
print('Perplexity: {}'.format(perplexity(model, corpus)))

Perplexity: 2868.6156534545394


In [30]:
model_5 = models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=5,
    passes=epochs,
    eval_every=eval_every
)
model_20 = models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=20,
    passes=epochs,
    eval_every=eval_every
)

2025-01-23 11:00:57,717 : INFO : using autotuned alpha, starting with [0.2, 0.2, 0.2, 0.2, 0.2]
2025-01-23 11:00:57,718 : INFO : using serial LDA version on this node
2025-01-23 11:00:57,719 : INFO : running online (multi-pass) LDA training, 5 topics, 5 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2025-01-23 11:00:57,720 : INFO : PROGRESS: pass 0, at document #1740/1740
2025-01-23 11:01:02,245 : INFO : optimized alpha [0.122527294, 0.17584074, 0.09676569, 0.1317677, 0.20862874]
2025-01-23 11:01:02,247 : INFO : topic #0 (0.123): 0.007*"image" + 0.003*"response" + 0.003*"stimulus" + 0.003*"field" + 0.003*"neuron" + 0.003*"layer" + 0.003*"cell" + 0.002*"recognition" + 0.002*"dynamic" + 0.002*"rule"
2025-01-23 11:01:02,247 : INFO : topic #1 (0.176): 0.007*"neuron" + 0.004*"signal" + 0.004*"cell" + 0.003*"hidden" + 0.003*"layer" + 0.003*"image" + 0

In [31]:
print('Perplexity 5: {}'.format(perplexity(model_5, corpus)))
print('Perplexity 20: {}'.format(perplexity(model_20, corpus)))

Perplexity 5: 3066.2736778426897
Perplexity 20: 2594.579183798892


### Word2Vec model
Word2Vec is one of the fundamental neural network models of the "pre-transformer" era (2013-2018). The essence of the model is to build a mapping of words into an $N$-dimensional space (embeddings) with certain characteristics. Two words have more similar embeddings the more similar the contexts in which they are used.

In the `gensim` library, two methods for building word2vec are implemented:
  - Skip-grams (SG)
  - Continuous-bag-of-words (CBOW)

## Demo
For the demonstration, let's take a pre-trained model trained on the Google News dataset, containing approximately 3 million English words and phrases.

In [32]:
# download the model ~1.6GB 
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

2025-01-23 11:02:31,398 : INFO : loading projection weights from /Users/Aleksandr.Avdiushenko/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2025-01-23 11:02:44,005 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from /Users/Aleksandr.Avdiushenko/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2025-01-23T11:02:44.005317', 'gensim': '4.3.3', 'python': '3.12.6 (v3.12.6:a4a2d2b0d85, Sep  6 2024, 16:08:03) [Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-15.2-arm64-arm-64bit', 'event': 'load_word2vec_format'}


In [33]:
for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key )} is {word}")

word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is
word #5/3000000 is on
word #6/3000000 is ##
word #7/3000000 is The
word #8/3000000 is with
word #9/3000000 is said


In [34]:
vec_king = wv['king']
print(vec_king[:10])

[ 0.12597656  0.02978516  0.00860596  0.13964844 -0.02563477 -0.03613281
  0.11181641 -0.19824219  0.05126953  0.36328125]


Using the model, you can compute the distances between words.

In [35]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


You can also find the most similar words to a given one.

In [36]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

[('SUV', 0.8532192707061768), ('vehicle', 0.8175783753395081), ('pickup_truck', 0.7763689756393433), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.7565719485282898)]


In [37]:
vec_example = wv['king'] - wv['man'] + wv['woman']

similars = wv.most_similar(positive=[vec_example])
print(similars)

[('king', 0.8449392318725586), ('queen', 0.7300516366958618), ('monarch', 0.6454660296440125), ('princess', 0.6156251430511475), ('crown_prince', 0.5818676948547363), ('prince', 0.5777117609977722), ('kings', 0.5613663792610168), ('sultan', 0.5376776456832886), ('Queen_Consort', 0.5344247817993164), ('queens', 0.5289887189865112)]


In [38]:
vec_example = wv['programmer'] - wv['man'] + wv['woman'] 

similars = wv.most_similar(positive=[vec_example])
print(similars)

[('programmer', 0.885962188243866), ('programmers', 0.6040860414505005), ('computer_programmer', 0.5623369216918945), ('coder', 0.5616979598999023), ('Programmer', 0.5576066374778748), ('programer', 0.5161396265029907), ('graphic_designer', 0.5139066576957703), ('coders', 0.48765403032302856), ('designer', 0.4822673797607422), ('librarian', 0.4649229943752289)]
