# Topic Modeling for Fun and Profit

[Source](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html)

In this notebook we'll

* vectorize a streamed corpus
* run topic modeling on streamed vectors, using gensim
* explore how to choose, evaluate and tweak topic modeling parameters
* persist trained models to disk, for later re-use
* In the [previous notebook 1 - Streamed Corpora](https://radimrehurek.com/gensim_3.8.3/auto_examples/howtos/run_compare_lda.html) we used the 20newsgroups corpus to demonstrate data preprocessing and streaming.

Now we'll switch to the English Wikipedia and do some topic modeling. Link: https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html#sphx-glr-auto-examples-core-run-corpora-and-vector-spaces-py

In [1]:
from datetime import datetime

# datetime object containing current date and time
now = datetime.now()

print("Begun at", now)

Begun at 2024-12-03 07:05:15.629148


In [2]:
!pip install six cython numpy scipy ipython[notebook]
!pip install nltk gensim pattern requests textblob
!python -m textblob.download_corpora lite
!pip install --upgrade gensim
!pip install --upgrade smart_open

Collecting jedi>=0.16 (from ipython[notebook])
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2
[31mERROR: Could not find a version that satisfies the requirement pattern (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pattern[0m[31m
[0m[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Finished.


In [3]:
!rm -f download_data.py && wget 'https://raw.githubusercontent.com/piskvorky/topic_modeling_tutorial/master/download_data.py'
#
# The older datasets are no longer available, use the latest one.
!sed -i 's/20140623/latest/g' download_data.py
#
# wikimedia sometimes refuses to connect due to excessive load
# use a mirror site instead. see https://dumps.wikimedia.org/mirrors.html
!sed -i 's|dumps.wikimedia.org|dumps.wikimedia.your.org|g' download_data.py

--2024-12-03 07:05:40--  https://raw.githubusercontent.com/piskvorky/topic_modeling_tutorial/master/download_data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2101 (2.1K) [text/plain]
Saving to: ‘download_data.py’


2024-12-03 07:05:40 (35.6 MB/s) - ‘download_data.py’ saved [2101/2101]



In [4]:
!rm -rf ./data
!mkdir ./data
!python download_data.py ./data

2024-12-03 07:05:41,568 : MainThread : INFO : running download_data.py ./data
2024-12-03 07:05:41,568 : MainThread : INFO : downloading http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz into ./data/20news-bydate.tar.gz
2024-12-03 07:05:43,582 : MainThread : INFO : downloaded 14464277 bytes
2024-12-03 07:05:43,603 : MainThread : INFO : downloading http://dumps.wikimedia.your.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2 into ./data/simplewiki-latest-pages-articles.xml.bz2
2024-12-03 07:05:48,592 : MainThread : INFO : downloaded 235367506 bytes
2024-12-03 07:05:48,592 : MainThread : INFO : finished running download_data.py


In [5]:
# import and setup modules we'll be using in this notebook
import logging
import itertools

import numpy as np
import gensim

logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

def head(stream, n=10):
    """Convenience fnc: return the first `n` elements of the stream, as plain list."""
    return list(itertools.islice(stream, n))

In [6]:
# import and setup modules we'll be using in this notebook
import logging
import itertools

import numpy as np
import gensim

logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

def head(stream, n=10):
    """Convenience fnc: return the first `n` elements of the stream, as plain list."""
    return list(itertools.islice(stream, n))

In [7]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
corpus_path = get_tmpfile("wiki-corpus.mm")
wiki = WikiCorpus(path_to_wiki_dump)  # create word->word_id mapping, ~8h on full wiki
MmCorpus.serialize(corpus_path, wiki)  # another 8h, creates a file in MatrixMarket format and mapping

texts = [' '.join(txt) for txt in wiki.get_texts()]
print(texts[0])
print(texts[1])

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.dictionary:built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
INFO:gensim.utils:Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2024-12-03T07:05:51.158392', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35', 'event': 'created'}
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.wikicorpus:finished iterating over Wikipedia corpus of 106 documents with 452944 positions (total 206 articles, 453267 positions before pruning articles shorter than 50 words)
INFO:gensim.corpora.dictionary:built Dictionary<34212 unique tokens: 

anarchism is political philosophy that advocates self governed societies based on voluntary institutions these are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations anarchism considers the state to be undesirable unnecessary and harmful while anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy many types and traditions of anarchism exist not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and in

In [8]:
# import gensim.utils as utils
from smart_open import smart_open
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora.wikicorpus import _extract_pages, filter_wiki

def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

def iter_wiki(dump_file):
    """Yield each article from the Wikipedia dump, as a `(title, tokens)` 2-tuple."""
    ignore_namespaces = 'Wikipedia Category File Portal Template MediaWiki User Help Book Draft'.split()
    for title, text, pageid in _extract_pages(smart_open(dump_file)):
        text = filter_wiki(text)
        tokens = tokenize(text)
        if len(tokens) < 50 or any(title.startswith(ns + ':') for ns in ignore_namespaces):
            continue  # ignore short articles and various meta-articles
        yield title, tokens

In [9]:
# only use simplewiki in this tutorial (fewer documents)
# the full wiki dump is exactly the same format, but larger
wiki_file = './data/simplewiki-latest-pages-articles.xml.bz2'
stream = iter_wiki(wiki_file)
for title, tokens in itertools.islice(iter_wiki(wiki_file), 8):
    print (title, tokens[:10])  # print the article title and its first ten tokens

April ['april', 'fourth', 'month', 'year', 'julian', 'gregorian', 'calendars', 'comes', 'march', 'months']
August ['august', 'aug', 'eighth', 'month', 'year', 'gregorian', 'calendar', 'coming', 'july', 'september']
Art ['painting', 'renoir', 'work', 'art', 'art', 'creative', 'activity', 'expresses', 'imaginative', 'technical']
A ['writing', 'cursive', 'font', 'letter', 'english', 'alphabet', 'small', 'letter', 'lower', 'case']
Air ['fan', 'air', 'air', 'refers', 'earth', 'atmosphere', 'air', 'mixture', 'gases', 'tiny']
Autonomous communities of Spain ['spain', 'divided', 'parts', 'called', 'autonomous', 'communities', 'autonomous', 'means', 'autonomous', 'communities']
Alan Turing ['statue', 'alan', 'turing', 'turing', 'idea', 'bombe', 'mechanical', 'details', 'added', 'built']
Alanis Morissette ['alanis', 'nadine', 'morissette', 'born', 'june', 'grammy', 'award', 'winning', 'canadian', 'american']


In [10]:
id2word = {0: u'word', 2: u'profit', 300: u'another_word'}

In [11]:
doc_stream = (tokens for _, tokens in iter_wiki(wiki_file))

In [12]:
%time id2word_wiki = gensim.corpora.Dictionary(doc_stream)
print(id2word_wiki)

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.dictionary:adding document #10000 to Dictionary<168992 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #20000 to Dictionary<246465 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #30000 to Dictionary<307692 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #40000 to Dictionary<366887 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #50000 to Dictionary<433236 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #60000 to Dictionary<469090 unique tokens: ['abdicated', 'abdicates', 'abraham', 'a

CPU times: user 11min 26s, sys: 1.48 s, total: 11min 27s
Wall time: 11min 32s
Dictionary<650295 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>


In [13]:
# ignore words that appear in less than 20 documents or more than 10% documents
id2word_wiki.filter_extremes(no_below=20, no_above=0.1)
print(id2word_wiki)

INFO:gensim.corpora.dictionary:discarding 610151 tokens: [('alvares', 4), ('american', 20610), ('aperire', 1), ('april', 10648), ('arbroath', 17), ('born', 24070), ('chakri', 16), ('city', 15421), ('cosmonauts', 18), ('davidians', 7)]...
INFO:gensim.corpora.dictionary:keeping 40144 tokens which were in no less than 20 and no more than 9180 (=10.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary<40144 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>


Dictionary<40144 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>


In [14]:
now = datetime.now()

print("Done with SimpleWiki at", now)

Done with SimpleWiki at 2024-12-03 07:17:39.986532



**Question 1:** Print all words and their ids from id2word_wiki where the word starts with "human".

**Note for advanced users:** In fully online scenarios, where the documents can only be streamed once (no repeating the stream), we can't exhaust the document stream just to build a dictionary. In this case we can map strings directly into their integer hash, using a hashing function such as MurmurHash or MD5. This is called the "[hashing trick](https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick)". A dictionary built this way is more difficult to debug, because there may be hash collisions: multiple words represented by a single id. See the documentation of [HashDictionary](https://radimrehurek.com/gensim/corpora/hashdictionary.html) for more details.

In [45]:
# Iterate through the items in the id2word_wiki dictionary
for id, word in id2word_wiki.items():
    # Check if the word starts with "human"
    if word.startswith("human"):
        # Print the word and its corresponding ID in a formatted string
        print(f"Word: {word} (ID: {id})")

Word: human (ID: 296)
Word: humanitarian (ID: 735)
Word: humans (ID: 953)
Word: humanity (ID: 2910)
Word: humanism (ID: 7356)
Word: humankind (ID: 9270)
Word: humanities (ID: 16429)
Word: humanistic (ID: 24754)
Word: humanist (ID: 26705)
Word: humanoid (ID: 30593)
Word: humane (ID: 32096)


## Vectorization
A streamed corpus and a dictionary is all we need to create [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) vectors:

In [15]:
doc = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood."
bow = id2word_wiki.doc2bow(tokenize(doc))
print(bow)

[(989, 1), (1176, 2), (1262, 1), (3368, 2)]


In [16]:
print(id2word_wiki[10882])

naruhito


In [17]:
class WikiCorpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        """
        Parse the first `clip_docs` Wikipedia documents from file `dump_file`.
        Yield each document in turn, as a list of tokens (unicode strings).

        """
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs

    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_wiki(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)

    def __len__(self):
        return self.clip_docs

# create a stream of bag-of-words vectors
wiki_corpus = WikiCorpus(wiki_file, id2word_wiki)
vector = next(iter(wiki_corpus))
print(vector)  # print the first vector in the stream

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 2), (5, 1), (6, 2), (7, 1), (8, 1), (9, 2), (10, 2), (11, 3), (12, 1), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 5), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 2), (27, 4), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 3), (36, 3), (37, 1), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 2), (48, 1), (49, 1), (50, 5), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 10), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 2), (68, 1), (69, 1), (70, 2), (71, 2), (72, 1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 1), (86, 2), (87, 1), (88, 2), (89, 1), (90, 2), (91, 1), (92, 2), (93, 1), (94, 1), (95, 1), (96, 1), (97, 2), (98, 1), (99, 2), (100, 2), (101, 2), (102, 4), (103, 2), (104, 1), (105, 1), (106, 2), (107, 1), (108, 1), (109, 2), (110, 1)

In [18]:
len(vector)
max([pair[1] for pair in vector])

index = [pair[1] for pair in vector].index(15)
index

628

In [19]:
# what is the most common word in that first article?

(most_index, most_count) = max(vector, key=lambda pair: pair[1])
print(id2word_wiki[most_index], most_count)

week 15


In [20]:
%time gensim.corpora.MmCorpus.serialize('./data/wiki_bow.mm', wiki_corpus)

INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to ./data/wiki_bow.mm
INFO:gensim.matutils:saving sparse matrix to ./data/wiki_bow.mm
INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:PROGRESS: saving document #5000
INFO:gensim.matutils:PROGRESS: saving document #6000
INFO:gensim.matutils:PROGRESS: saving document #7000
INFO:gensim.matutils:PROGRESS: saving document #8000
INFO:gensim.matutils:PROGRESS: saving document #9000
INFO:gensim.matutils:PROGRESS: saving document #10000
INFO:gensim.matutils:PROGRESS: saving document #11000
INFO:gensim.matutils:PROGRESS: saving document #12000
INFO:gensim.matutils:PROGRESS: saving document #13000
INFO:gensim.matutils:PROGRESS: saving document #14000
INFO:gensim.matutils:PROGRESS: saving document #1

CPU times: user 11min 31s, sys: 2.8 s, total: 11min 34s
Wall time: 11min 39s


In [21]:
mm_corpus = gensim.corpora.MmCorpus('./data/wiki_bow.mm')
print(mm_corpus)

INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_bow.mm.index
INFO:gensim.corpora._mmreader:initializing cython corpus reader from ./data/wiki_bow.mm
INFO:gensim.corpora._mmreader:accepted corpus with 91800 documents, 40144 features, 8783660 non-zero entries


MmCorpus(91800 documents, 40144 features, 8783660 non-zero entries)


In [22]:
print(next(iter(mm_corpus)))

[(0, 1.0), (1, 2.0), (2, 1.0), (3, 1.0), (4, 2.0), (5, 1.0), (6, 2.0), (7, 1.0), (8, 1.0), (9, 2.0), (10, 2.0), (11, 3.0), (12, 1.0), (13, 1.0), (14, 1.0), (15, 1.0), (16, 2.0), (17, 1.0), (18, 5.0), (19, 1.0), (20, 1.0), (21, 1.0), (22, 1.0), (23, 1.0), (24, 1.0), (25, 1.0), (26, 2.0), (27, 4.0), (28, 1.0), (29, 1.0), (30, 1.0), (31, 2.0), (32, 1.0), (33, 1.0), (34, 1.0), (35, 3.0), (36, 3.0), (37, 1.0), (38, 1.0), (39, 2.0), (40, 1.0), (41, 1.0), (42, 1.0), (43, 1.0), (44, 1.0), (45, 1.0), (46, 1.0), (47, 2.0), (48, 1.0), (49, 1.0), (50, 5.0), (51, 1.0), (52, 1.0), (53, 1.0), (54, 1.0), (55, 1.0), (56, 1.0), (57, 1.0), (58, 1.0), (59, 1.0), (60, 10.0), (61, 2.0), (62, 1.0), (63, 1.0), (64, 1.0), (65, 1.0), (66, 1.0), (67, 2.0), (68, 1.0), (69, 1.0), (70, 2.0), (71, 2.0), (72, 1.0), (73, 2.0), (74, 1.0), (75, 1.0), (76, 1.0), (77, 1.0), (78, 1.0), (79, 1.0), (80, 1.0), (81, 1.0), (82, 1.0), (83, 1.0), (84, 2.0), (85, 1.0), (86, 2.0), (87, 1.0), (88, 2.0), (89, 1.0), (90, 2.0), (91, 1.

## Semantic transformations
Topic modeling in gensim is realized via transformations. A transformation is something that takes a corpus and spits out another corpus on output, using `corpus_out = transformation_object[corpus_in]` syntax. What exactly happens in between is determined by what kind of transformation we're using -- options are Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP) etc.

Some transformations need to be initialized (=trained) before they can be used. For example, let's train an LDA transformation model, using our bag-of-words WikiCorpus as training data:

In [23]:
from gensim.utils import SaveLoad
class ClippedCorpus(SaveLoad):
    def __init__(self, corpus, max_docs=None):
        """
        Return a corpus that is the "head" of input iterable `corpus`.

        Any documents after `max_docs` are ignored. This effectively limits the
        length of the returned corpus to <= `max_docs`. Set `max_docs=None` for
        "no limit", effectively wrapping the entire input corpus.

        """
        self.corpus = corpus
        self.max_docs = max_docs

    def __iter__(self):
        return itertools.islice(self.corpus, self.max_docs)

    def __len__(self):
        return min(self.max_docs, len(self.corpus))

clipped_corpus = gensim.utils.ClippedCorpus(mm_corpus, 4000)  # use fewer documents during training, LDA is slow
# ClippedCorpus new in gensim 0.10.1
# copy&paste it from https://github.com/piskvorky/gensim/blob/0.10.1/gensim/utils.py#L467 if necessary (or upgrade your gensim)
%time lda_model = gensim.models.LdaModel(clipped_corpus, num_topics=10, id2word=id2word_wiki, passes=4)

INFO:gensim.models.ldamodel:using symmetric alpha at 0.1
INFO:gensim.models.ldamodel:using symmetric eta at 0.1
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 10 topics, 4 passes over the supplied corpus of 4000 documents, updating model once every 2000 documents, evaluating perplexity every 4000 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #2000/4000
INFO:gensim.models.ldamodel:merging changes from 2000 documents into a model of 4000 documents
INFO:gensim.models.ldamodel:topic #0 (0.100): 0.003*"president" + 0.003*"words" + 0.002*"league" + 0.002*"word" + 0.002*"person" + 0.002*"king" + 0.002*"countries" + 0.002*"number" + 0.002*"police" + 0.002*"example"
INFO:gensim.models.ldamodel:topic #5 (0.100): 0.003*"light" + 0.003*"language" + 0.002*"lake" + 0.002*"water" + 0.002*"countries" + 0.002*"usually" + 0.002*"mario" + 0

CPU times: user 45 s, sys: 22.7 s, total: 1min 7s
Wall time: 47.1 s


In [24]:
_ = lda_model.print_topics(-1)  # print a few most important words for each LDA topic

INFO:gensim.models.ldamodel:topic #0 (0.100): 0.006*"god" + 0.004*"movie" + 0.004*"said" + 0.003*"love" + 0.003*"book" + 0.003*"death" + 0.003*"music" + 0.003*"books" + 0.003*"man" + 0.003*"album"
INFO:gensim.models.ldamodel:topic #1 (0.100): 0.006*"person" + 0.004*"water" + 0.004*"things" + 0.004*"example" + 0.004*"earth" + 0.003*"human" + 0.003*"study" + 0.003*"body" + 0.003*"way" + 0.003*"women"
INFO:gensim.models.ldamodel:topic #2 (0.100): 0.007*"tower" + 0.007*"number" + 0.005*"game" + 0.005*"mast" + 0.005*"transmission" + 0.005*"player" + 0.005*"uhf" + 0.004*"numbers" + 0.004*"games" + 0.004*"players"
INFO:gensim.models.ldamodel:topic #3 (0.100): 0.016*"rgb" + 0.016*"hex" + 0.008*"color" + 0.005*"water" + 0.005*"food" + 0.004*"red" + 0.004*"blue" + 0.004*"usually" + 0.004*"green" + 0.003*"light"
INFO:gensim.models.ldamodel:topic #4 (0.100): 0.013*"actor" + 0.012*"politician" + 0.011*"singer" + 0.011*"actress" + 0.010*"german" + 0.010*"footballer" + 0.009*"french" + 0.009*"player"

In [25]:
now = datetime.now()

print("LDA Topic Models computed at", now)

LDA Topic Models computed at 2024-12-03 07:30:06.558728


More info on model parameters in [gensim docs](https://radimrehurek.com/gensim/models/lsimodel.html).

Transformation can be stacked. For example, here we'll train a TFIDF model, and then train Latent Semantic Analysis on top of TFIDF:

In [26]:
%time tfidf_model = gensim.models.TfidfModel(mm_corpus, id2word=id2word_wiki)

INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #10000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #20000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #30000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #40000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #50000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #60000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #70000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #80000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #90000
INFO:gensim.utils:TfidfModel lifecycle event {'msg': 'calculated IDF weights for 91800 documents and 40144 features (8783660 matrix non-zeros)', 'datetime': '2024-12-03T07:30:17.265825', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'p

CPU times: user 9.86 s, sys: 223 ms, total: 10.1 s
Wall time: 10.7 s


In [27]:
%time lsi_model = gensim.models.LsiModel(tfidf_model[mm_corpus], id2word=id2word_wiki, num_topics=200)

INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents
INFO:gensim.models.lsimodel:preparing a new chunk of documents
INFO:gensim.models.lsimodel:using 100 extra samples and 2 power iterations
INFO:gensim.models.lsimodel:1st phase: constructing (40144, 300) action matrix
INFO:gensim.models.lsimodel:orthonormalizing (40144, 300) action matrix
INFO:gensim.models.lsimodel:2nd phase: running dense svd on (300, 20000) matrix
INFO:gensim.models.lsimodel:computing the final decomposition
INFO:gensim.models.lsimodel:keeping 200 factors (discarding 15.008% of energy spectrum)
INFO:gensim.models.lsimodel:processed documents up to #20000
INFO:gensim.models.lsimodel:topic #0(15.817): 0.225*"footballer" + 0.225*"actor" + 0.218*"politician" + 0.206*"actress" + 0.188*"german" + 0.185*"singer" + 0.165*"french" + 0.160*"writer" + 0.144*"player" + 0.139*"british"
INFO:gensim.models.lsimodel:topic #1(10.686): -0.183*"footballer" + -

CPU times: user 2min 11s, sys: 8.97 s, total: 2min 20s
Wall time: 1min 48s



The LSI transformation goes from a space of high dimensionality (~TFIDF, tens of thousands) into a space of low dimensionality (a few hundreds; here 200). For this reason it can also seen as **dimensionality reduction**.

As always, the transformations are applied "lazily", so the resulting output corpus is streamed as well:

In [28]:
print(next(iter(lsi_model[tfidf_model[mm_corpus]])))

[(0, 0.22393276989999275), (1, 0.07501314293211767), (2, 0.07275384672982214), (3, -0.001961009056115981), (4, 0.04583826729681909), (5, 0.049213755133140975), (6, 0.07508945225959064), (7, 0.05778063755480169), (8, 0.019264736674517585), (9, 0.020731427194454692), (10, 0.06754124807391265), (11, -0.005845872717793255), (12, 0.06350898211848328), (13, 0.03445918305580428), (14, -0.016220735885074996), (15, -0.011463269089408892), (16, 0.045255754301228475), (17, -0.002746651617059838), (18, 0.049425340683087314), (19, 0.007842511625561623), (20, 0.05545586003595362), (21, 0.07409170720137478), (22, 0.006452992076814771), (23, 0.04075030802361773), (24, -0.04620519626374119), (25, -0.03800413976917934), (26, 0.038658858099167456), (27, 0.025385442571694675), (28, -0.02827900805216565), (29, -0.04496748293415219), (30, 0.014515694550262498), (31, -0.012112955007708292), (32, -0.03008396533293991), (33, 0.04132342439135069), (34, -0.0014373067311077908), (35, 0.001002569602345121), (36, -

In [29]:
# cache the transformed corpora to disk, for use in later notebooks
%time gensim.corpora.MmCorpus.serialize('./data/wiki_tfidf.mm', tfidf_model[mm_corpus])
%time gensim.corpora.MmCorpus.serialize('./data/wiki_lsa.mm', lsi_model[tfidf_model[mm_corpus]])
# gensim.corpora.MmCorpus.serialize('./data/wiki_lda.mm', lda_model[mm_corpus])

INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to ./data/wiki_tfidf.mm
INFO:gensim.matutils:saving sparse matrix to ./data/wiki_tfidf.mm
INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:PROGRESS: saving document #5000
INFO:gensim.matutils:PROGRESS: saving document #6000
INFO:gensim.matutils:PROGRESS: saving document #7000
INFO:gensim.matutils:PROGRESS: saving document #8000
INFO:gensim.matutils:PROGRESS: saving document #9000
INFO:gensim.matutils:PROGRESS: saving document #10000
INFO:gensim.matutils:PROGRESS: saving document #11000
INFO:gensim.matutils:PROGRESS: saving document #12000
INFO:gensim.matutils:PROGRESS: saving document #13000
INFO:gensim.matutils:PROGRESS: saving document #14000
INFO:gensim.matutils:PROGRESS: saving documen

CPU times: user 43.9 s, sys: 1.52 s, total: 45.5 s
Wall time: 46.1 s


INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:PROGRESS: saving document #5000
INFO:gensim.matutils:PROGRESS: saving document #6000
INFO:gensim.matutils:PROGRESS: saving document #7000
INFO:gensim.matutils:PROGRESS: saving document #8000
INFO:gensim.matutils:PROGRESS: saving document #9000
INFO:gensim.matutils:PROGRESS: saving document #10000
INFO:gensim.matutils:PROGRESS: saving document #11000
INFO:gensim.matutils:PROGRESS: saving document #12000
INFO:gensim.matutils:PROGRESS: saving document #13000
INFO:gensim.matutils:PROGRESS: saving document #14000
INFO:gensim.matutils:PROGRESS: saving document #15000
INFO:gensim.matutils:PROGRESS: saving document #16000
INFO:gensim.matutils:PROGRESS: saving document #17000
INFO:gensim.matutils:PROGRESS: saving doc

CPU times: user 1min 23s, sys: 3.13 s, total: 1min 26s
Wall time: 1min 27s


In [30]:
tfidf_corpus = gensim.corpora.MmCorpus('./data/wiki_tfidf.mm')
# `tfidf_corpus` is now exactly the same as `tfidf_model[wiki_corpus]`
print(tfidf_corpus)

lsi_corpus = gensim.corpora.MmCorpus('./data/wiki_lsa.mm')
# and `lsi_corpus` now equals `lsi_model[tfidf_model[wiki_corpus]]` = `lsi_model[tfidf_corpus]`
print(lsi_corpus)

INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_tfidf.mm.index
INFO:gensim.corpora._mmreader:initializing cython corpus reader from ./data/wiki_tfidf.mm
INFO:gensim.corpora._mmreader:accepted corpus with 91800 documents, 40144 features, 8783660 non-zero entries
INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_lsa.mm.index
INFO:gensim.corpora._mmreader:initializing cython corpus reader from ./data/wiki_lsa.mm
INFO:gensim.corpora._mmreader:accepted corpus with 91800 documents, 200 features, 18360000 non-zero entries


MmCorpus(91800 documents, 40144 features, 8783660 non-zero entries)
MmCorpus(91800 documents, 200 features, 18360000 non-zero entries)


In [31]:
now = datetime.now()

print("LSI Topic Models computed at", now)

LSI Topic Models computed at 2024-12-03 07:34:19.981992


## Transforming unseen documents
We can use the trained models to transform new, unseen documents into the semantic space:

In [32]:
text = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood."

# transform text into the bag-of-words space
bow_vector = id2word_wiki.doc2bow(tokenize(text))
print([(id2word_wiki[id], count) for id, count in bow_vector])

[('normally', 1), ('blood', 2), ('produced', 1), ('cell', 2)]


In [33]:
# transform into LDA space
lda_vector = lda_model[bow_vector]
print(lda_vector)
# print the document's single most prominent LDA topic
print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))

[(0, 0.014289127), (1, 0.014290458), (2, 0.014288421), (3, 0.014289067), (4, 0.014287789), (5, 0.45924333), (6, 0.014287876), (7, 0.0142878), (8, 0.014288354), (9, 0.4264478)]
0.006*"music" + 0.005*"band" + 0.005*"rock" + 0.004*"album" + 0.004*"released" + 0.004*"live" + 0.003*"light" + 0.003*"game" + 0.003*"metal" + 0.003*"species"


**Question 2**: print text transformed into TFIDF space.

For stacked transformations, apply the same stack during transformation as was applied during training:

In [46]:
# Step 1: Transform the bag-of-words vector into the TF-IDF space
tfidf_vector = tfidf_model[bow_vector]

# Step 2: Print the words and their corresponding TF-IDF values
word_tfidf_pairs = [(id2word_wiki[word_id], value) for word_id, value in tfidf_vector]
print("Words and their TF-IDF values:")
for word, tfidf_value in word_tfidf_pairs:
    print(f"Word: {word}, TF-IDF Value: {tfidf_value}")

# Example output
# [('normally', 0.3314205262599425), ('blood', 0.5989905558961313),
# ('produced', 0.23752276286313315), ('cell', 0.6891688369642741)]

# Step 3: Transform the TF-IDF vector into the LSI space
lsi_vector = lsi_model[tfidf_vector]

# Step 4: Print the LSI vector
print("\nLSI Vector:")
for topic_id, topic_value in lsi_vector:
    print(f"Topic ID: {topic_id}, Topic Value: {topic_value}")

# Step 5: Determine the document's single most prominent LSI topic
# (Note: Topics are not interpretable like LDA topics)
most_prominent_topic = max(lsi_vector, key=lambda item: abs(item[1]))
topic_id = most_prominent_topic[0]
print("\nMost prominent LSI topic:")
print(lsi_model.print_topic(topic_id))


Words and their TF-IDF values:
Word: normally, TF-IDF Value: 0.3314205262599425
Word: blood, TF-IDF Value: 0.5989905558961313
Word: produced, TF-IDF Value: 0.23752276286313315
Word: cell, TF-IDF Value: 0.6891688369642741

LSI Vector:
Topic ID: 0, Topic Value: 0.020771135766093504
Topic ID: 1, Topic Value: 0.013537780757873662
Topic ID: 2, Topic Value: -0.009321309152130077
Topic ID: 3, Topic Value: -0.01559526890515083
Topic ID: 4, Topic Value: 0.01232793414820253
Topic ID: 5, Topic Value: 0.0229804553583646
Topic ID: 6, Topic Value: -0.02197305832214112
Topic ID: 7, Topic Value: 0.01926553166661128
Topic ID: 8, Topic Value: -0.0008783947593611081
Topic ID: 9, Topic Value: 0.0006519541961327018
Topic ID: 10, Topic Value: 0.002825176543564574
Topic ID: 11, Topic Value: -0.004016475074507546
Topic ID: 12, Topic Value: -0.001623739529448502
Topic ID: 13, Topic Value: 0.017035121555942813
Topic ID: 14, Topic Value: -0.015067992900782843
Topic ID: 15, Topic Value: -0.0038279656684423336
Top

In [35]:
# store all trained models to disk
lda_model.save('./data/lda_wiki.model')
lsi_model.save('./data/lsi_wiki.model')
tfidf_model.save('./data/tfidf_wiki.model')
id2word_wiki.save('./data/wiki.dictionary')

INFO:gensim.utils:LdaState lifecycle event {'fname_or_handle': './data/lda_wiki.model.state', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-12-03T07:34:20.044237', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35', 'event': 'saving'}
INFO:gensim.utils:saved ./data/lda_wiki.model.state
INFO:gensim.utils:LdaModel lifecycle event {'fname_or_handle': './data/lda_wiki.model', 'separately': "['expElogbeta', 'sstats']", 'sep_limit': 10485760, 'ignore': ['id2word', 'dispatcher', 'state'], 'datetime': '2024-12-03T07:34:20.101246', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35', 'event': 'saving'}
INFO:gensim.utils:storing np array 'expElogbeta' to ./data/lda_wiki.model.expElogbeta.npy
INFO:gensim.utils:not storing attribute id2word
INFO:gensim.utils:not storing attribute dispatcher
INFO:ge

In [36]:

# load the same model back; the result is equal to `lda_model`
same_lda_model = gensim.models.LdaModel.load('./data/lda_wiki.model')

INFO:gensim.utils:loading LdaModel object from ./data/lda_wiki.model
INFO:gensim.utils:loading expElogbeta from ./data/lda_wiki.model.expElogbeta.npy with mmap=None
INFO:gensim.utils:setting ignored attribute id2word to None
INFO:gensim.utils:setting ignored attribute dispatcher to None
INFO:gensim.utils:setting ignored attribute state to None
INFO:gensim.utils:LdaModel lifecycle event {'fname': './data/lda_wiki.model', 'datetime': '2024-12-03T07:34:20.720253', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35', 'event': 'loaded'}
INFO:gensim.utils:loading LdaState object from ./data/lda_wiki.model.state
INFO:gensim.utils:LdaState lifecycle event {'fname': './data/lda_wiki.model.state', 'datetime': '2024-12-03T07:34:20.725293', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35', 'event': 'loaded'}


## Evaluation
Topic modeling is an **unsupervised task**; we do not know in advance what the topics ought to look like. This makes evaluation tricky: whereas in supervised learning (classification, regression) we simply compare predicted labels to expected labels, there are no "expected labels" in topic modeling.

Each topic modeling method (LSI, LDA...) its own way of measuring internal quality (perplexity, reconstruction error...). But these are an artifact of the particular approach taken (bayesian training, matrix factorization...), and mostly of academic interest. There's no way to compare such scores across different types of topic models, either. The best way to really evaluate quality of unsupervised tasks is to **evaluate how they improve the superordinate task, the one we're actually training them for**.

For example, when the ultimate goal is to retrieve semantically similar documents, we manually tag a set of similar documents and then see how well a given semantic model maps those similar documents together.

Such manual tagging can be resource intensive, so people hae been looking for clever ways to automate it. In [Reading tea leaves: How humans interpret topic models](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf), Wallach *et al* suggest a "word intrusion" method that works well for models where the topics are meant to be "human interpretable", such as LDA. For each trained topic, they take its first ten words, then substitute one of them with another, randomly chosen word (intruder!) and see whether a human can reliably tell which one it was. If so, the trained topic is **topically coherent** (good); if not, the topic has no discernible theme (bad):

## Misplaced Words



In [37]:
# select top 50 words for each of the 20 LDA topics
top_words = [[word for _, word in lda_model.show_topic(topicno, topn=50)] for topicno in range(lda_model.num_topics)]
print(top_words)

[[0.005583402, 0.0036500155, 0.003641657, 0.0034257483, 0.003396844, 0.0033896745, 0.0032755192, 0.003272386, 0.0032516439, 0.0031186368, 0.0030405226, 0.0030360084, 0.0030282533, 0.0029203445, 0.0029130506, 0.0028747066, 0.0026654536, 0.0025563538, 0.0025524597, 0.0024486298, 0.0023646306, 0.0023521848, 0.0023513022, 0.0023480568, 0.002314297, 0.0022802134, 0.0022517277, 0.0022330114, 0.0021148075, 0.0020592168, 0.0020121406, 0.0019839504, 0.0019552512, 0.0019535483, 0.0019272179, 0.0018685394, 0.0018241741, 0.0018097537, 0.0017639016, 0.0016940563, 0.0016832809, 0.0016699083, 0.0016423594, 0.0016397523, 0.001639515, 0.0016358288, 0.001630218, 0.0016256218, 0.0016223363, 0.001615119], [0.0060351687, 0.0042303563, 0.0040747183, 0.0040338393, 0.0039944365, 0.0031268704, 0.0030895914, 0.003055896, 0.002857495, 0.002841373, 0.00283547, 0.0026984068, 0.0024614409, 0.0023666848, 0.0023084946, 0.0022425712, 0.0022327115, 0.00220799, 0.0021012125, 0.0020701843, 0.0020603838, 0.0020259062, 0.0

## Question 03. [12 points] Identify the source of difference and [16 points] change it so they are equivalent.

In [47]:
# Radim's output for comparison
Radim_output = [
    ['album', 'band', 'released', 'movie', 'music', 'island', 'york', 'award', 'series', 'song',
     'won', 'albums', 'president', 'game', 'rock', 'british', 'england', 'king', 'popular',
     'video', 'sold', 'million', 'songs', 'awards', 'married', 'tour', 'jackson', 'live', 'mother',
     'father', 'career', 'movies', 'australia', 'games', 'said', 'came', 'left', 'white', 'home',
     'death', 'went', 'ford', 'got', 'single', 'bush', 'children', 'record', 'played', 'george',
     'love'],
    ['rgb', 'hex', 'color', 'blood', 'body', 'disease', 'person', 'blue', 'red', 'green', 'cells',
     'light', 'pink', 'heart', 'bc', 'woman', 'web', 'women', 'purple', 'cause', 'colors',
     'diseases', 'abortion', 'sex', 'cancer', 'man', 'crayola', 'ff', 'doctors', 'yellow', 'penis',
     'malaria', 'men', 'means', 'pain', 'male', 'violet', 'com', 'orange', 'immune', 'medical',
     'sexual', 'types', 'causes', 'semen', 'common', 'magenta', 'bacteria', 'brain', 'dark'],
    ['god', 'tower', 'mast', 'transmission', 'left', 'book', 'books', 'believe', 'school', 'mount',
     'church', 'jesus', 'said', 'party', 'bible', 'earth', 'religion', 'built', 'al', 'east',
     'align', 'country', 'muslims', 'things', 'christian', 'building', 'middle', 'largest',
     'children', 'written', 'roman', 'ancient', 'radio', 'kansas', 'empire', 'cities', 'live',
     'began', 'father', 'july', 'religious', 'moon', 'death', 'man', 'estimate', 'holy',
     'religions', 'government', 'today', 'king'],
    ['light', 'game', 'league', 'earth', 'energy', 'example', 'player', 'team', 'games', 'football',
     'point', 'space', 'numbers', 'mass', 'players', 'universe', 'speed', 'things', 'theory', 'sun',
     'object', 'park', 'line', 'play', 'means', 'distance', 'africa', 'ball', 'right', 'field',
     'physics', 'matter', 'club', 'force', 'black', 'stars', 'star', 'premier', 'moving', 'teams',
     'change', 'units', 'position', 'particles', 'special', 'atoms', 'electrons', 'iron',
     'scientists', 'big'],
    ['actor', 'german', 'british', 'singer', 'french', 'footballer', 'actress', 'writer',
     'politician', 'player', 'italian', 'president', 'musician', 'composer', 'king', 'ii',
     'minister', 'russian', 'prime', 'canadian', 'japanese', 'director', 'poet', 'battle',
     'governor', 'france', 'william', 'spanish', 'general', 'emperor', 'charles', 'killing',
     'painter', 'songwriter', 'george', 'movie', 'henry', 'england', 'scottish', 'james',
     'physicist', 'robert', 'queen', 'dutch', 'mathematician', 'leader', 'austrian', 'swedish',
     'ice', 'producer'],
    ['water', 'jpg', 'bridge', 'species', 'image', 'animals', 'live', 'food', 'plants', 'air',
     'birds', 'mario', 'sea', 'eat', 'file', 'living', 'plant', 'land', 'body', 'chemical', 'tree',
     'cell', 'grow', 'trees', 'common', 'inside', 'cells', 'white', 'makes', 'america', 'largest',
     'island', 'animal', 'built', 'things', 'forest', 'types', 'parts', 'form', 'places', 'fruit',
     'example', 'fish', 'big', 'ground', 'compounds', 'leaves', 'evolution', 'eggs', 'london'],
    ['president', 'government', 'country', 'union', 'july', 'party', 'korea', 'april', 'army',
     'countries', 'germany', 'british', 'december', 'international', 'al', 'january', 'usa',
     'soviet', 'february', 'independence', 'russia', 'baltimore', 'election', 'kingdom', 'france',
     'military', 'civil', 'republic', 'elected', 'french', 'usb', 'washington', 'nations',
     'capital', 'killed', 'ii', 'japan', 'britain', 'democratic', 'general', 'november', 'september',
     'vice', 'virginia', 'rights', 'house', 'october', 'political', 'minister', 'august'],
    ['language', 'word', 'languages', 'river', 'windows', 'words', 'means', 'country', 'internet',
     'example', 'lake', 'church', 'countries', 'information', 'software', 'microsoft', 'latin',
     'version', 'computers', 'person', 'things', 'population', 'free', 'web', 'program', 'pope',
     'million', 'written', 'operating', 'spoken', 'speak', 'uses', 'parts', 'file', 'europe',
     'america', 'programs', 'largest', 'catholic', 'data', 'today', 'came', 'spanish', 'change',
     'say', 'republic', 'rivers', 'user', 'released', 'greek'],
    ['music', 'person', 'countries', 'things', 'country', 'government', 'money', 'china', 'good',
     'example', 'think', 'wrote', 'say', 'word', 'said', 'means', 'popular', 'chinese', 'human',
     'want', 'common', 'fish', 'include', 'thought', 'right', 'ideas', 'modern', 'power', 'women',
     'today', 'food', 'man', 'play', 'society', 'political', 'lot', 'capital', 'social',
     'instruments', 'ancient', 'age', 'help', 'groups', 'written', 'bass', 'period', 'making',
     'guitar', 'types', 'law'],
    ['january', 'november', 'december', 'february', 'october', 'august', 'april', 'september',
     'actor', 'movie', 'july', 'german', 'germany', 'rural', 'actress', 'president', 'king',
     'singer', 'love', 'television', 'movies', 'writer', 'british', 'calendar', 'award', 'chicago',
     'disney', 'french', 'film', 'france', 'minister', 'band', 'george', 'ii', 'paul', 'rock',
     'kingdom', 'prime', 'urban', 'roman', 'man', 'james', 'music', 'director', 'william', 'events',
     'bavaria', 'musician', 'japan', 'india']
]

# Print Radim's output for each topic
print("Top words for each topic (Radim's Output):")
for topic_index, words in enumerate(Radim_output):
    print(f"Topic {topic_index + 1}: {', '.join(words)}")

# Check dimensions of Radim's output
num_topics_radim = len(Radim_output)
num_words_radim = len(Radim_output[0]) if num_topics_radim > 0 else 0

print("\nComparison of Topic Dimensions:")
print(f"- Radim's output contains {num_topics_radim} topics, each with {num_words_radim} words.")


Top words for each topic (Radim's Output):
Topic 1: album, band, released, movie, music, island, york, award, series, song, won, albums, president, game, rock, british, england, king, popular, video, sold, million, songs, awards, married, tour, jackson, live, mother, father, career, movies, australia, games, said, came, left, white, home, death, went, ford, got, single, bush, children, record, played, george, love
Topic 2: rgb, hex, color, blood, body, disease, person, blue, red, green, cells, light, pink, heart, bc, woman, web, women, purple, cause, colors, diseases, abortion, sex, cancer, man, crayola, ff, doctors, yellow, penis, malaria, men, means, pain, male, violet, com, orange, immune, medical, sexual, types, causes, semen, common, magenta, bacteria, brain, dark
Topic 3: god, tower, mast, transmission, left, book, books, believe, school, mount, church, jesus, said, party, bible, earth, religion, built, al, east, align, country, muslims, things, christian, building, middle, large

### Question 03. (Part 3 ) [12 points] Identify the source of difference.

Both outputs are printing the same information dimensions, but different information. Radim's
notebook is returning the top 50 words of each topic, but this notebook is returning numbers.
These numbers are actually the weights acossiated with each word, which we were trying to
ignore, but somehow the words and weight got switched.

After the code runs, on part 03, we will observe words instead of numbers. While the words and topics do not perfectly match Radim's output, this is expected due to the inherent stochasticity in these models. Additionally, the Simple Wikipedia dataset has likely changed since Radim's notebook was created. This is evident from the earlier comparison of the `lda_vector` outputs, where discrepancies were already noted. These differences are sufficiently equivalent for practical purposes. Other potential reasons for the variations include differences in software versions between the notebooks. Radim's notebook likely uses older versions, as its code is no longer compatible with current environments. Furthermore, the outputs in Radim's notebook include the u prefix for non-ASCII strings, a characteristic of Python 2.x, whereas this notebook utilizes Python 3.x.

### Question 03. (Part 3) [16 points] change it so they are equivalent.

In [48]:
# Adjust the order of the wildcard placement to correctly extract words
top_words = []

# Iterate through all topics in the LDA model
for topic_number in range(lda_model.num_topics):
    # Extract the top 50 words for the current topic
    words = [word for word, _ in lda_model.show_topic(topic_number, topn=50)]
    top_words.append(words)

# Print the top words for each topic
print("Top words for each topic:")
for topic_index, words in enumerate(top_words):
    print(f"Topic {topic_index + 1}: {', '.join(words)}")


Top words for each topic:
Topic 1: god, movie, said, love, book, death, music, books, man, album, father, children, person, church, wrote, words, movies, series, school, believe, word, woman, mother, written, story, famous, heart, doctor, married, way, award, pope, things, good, song, went, men, played, jesus, son, bible, friends, women, means, child, jackson, popular, role, young, awards
Topic 2: person, water, things, example, earth, human, study, body, way, women, energy, theory, usually, change, science, law, light, means, important, countries, cause, gender, right, mass, disease, common, thought, object, sexual, space, sun, problems, think, rights, scientists, idea, universe, makes, force, brain, help, blood, small, speed, social, include, laws, good, symptoms, medical
Topic 3: tower, number, game, mast, transmission, player, uhf, numbers, games, players, team, example, ball, england, jpg, football, play, usually, point, county, text, line, radio, century, way, written, town, file

In [38]:
# get all top 50 words in all 20 topics, as one large set
all_words = set(itertools.chain.from_iterable(top_words))

print("Can you spot the misplaced word in each topic?")

# for each topic, replace a word at a different index, to make it more interesting
replace_index = np.random.randint(0, 10, lda_model.num_topics)

replacements = []
for topicno, words in enumerate(top_words):
    other_words = all_words.difference(words)
    replacement = np.random.choice(list(other_words))
    replacements.append((words[replace_index[topicno]], replacement))
    words[replace_index[topicno]] = replacement
    print (topicno, ' '.join([str(w) for w in words[:10]]))
    # print("%i: %s" % (topicno, ' '.join(words[:10])))

Can you spot the misplaced word in each topic?
0 0.005583402 0.0036500155 0.003641657 0.0019606308 0.003396844 0.0033896745 0.0032755192 0.003272386 0.0032516439 0.0031186368
1 0.0060351687 0.0024135048 0.0040747183 0.0040338393 0.0039944365 0.0031268704 0.0030895914 0.003055896 0.002857495 0.002841373
2 0.0072909147 0.007053169 0.005435358 0.0053279744 0.0052708704 0.0047052093 0.0046567004 0.0043572816 0.004239192 0.003657901
3 0.015758948 0.015580533 0.008245165 0.001659559 0.0047229417 0.0043692375 0.004190899 0.0041905115 0.003914692 0.003493336
4 0.013267425 0.01188509 0.010871651 0.010819224 0.009928567 0.009634195 0.009024038 0.008875162 0.0017799401 0.008239236
5 0.0057136063 0.0048436685 0.004536515 0.004062322 0.0036818245 0.0035725979 0.0033278032 0.0029130506 0.0031162282 0.0030199117
6 0.006710382 0.0061116354 0.0034926147 0.0055593126 0.0055181496 0.0054955212 0.004330181 0.004317533 0.004132553 0.004055705
7 0.014256958 0.0064785443 0.006055501 0.0016423594 0.0043962942

In [39]:
print("Actual replacements were:")
print(list(enumerate(replacements)))

Actual replacements were:
[(0, (0.0034257483, 0.0019606308)), (1, (0.0042303563, 0.0024135048)), (2, (0.0037538428, 0.003657901)), (3, (0.0050915256, 0.001659559)), (4, (0.00825516, 0.0017799401)), (5, (0.0031719734, 0.0029130506)), (6, (0.005756131, 0.0034926147)), (7, (0.0056825364, 0.0016423594)), (8, (0.0054936344, 0.003483429)), (9, (0.0037778786, 0.003493336))]


In [40]:
# evaluate on 1k documents **not** used in LDA training
doc_stream = (tokens for _, tokens in iter_wiki(wiki_file))  # generator
test_docs = list(itertools.islice(doc_stream, 8000, 9000))

In [41]:
def intra_inter(model, test_docs, num_pairs=10000):
    # split each test document into two halves and compute topics for each half
    half = int(len(test_docs)/2)
    part1 = [model[id2word_wiki.doc2bow(tokens[: half])] for tokens in test_docs]
    part2 = [model[id2word_wiki.doc2bow(tokens[half :])] for tokens in test_docs]

    # print computed similarities (uses cossim)
    print("average cosine similarity between corresponding parts (higher is better):")
    print(np.mean([gensim.matutils.cossim(p1, p2) for p1, p2 in zip(part1, part2)]))

    random_pairs = np.random.randint(0, len(test_docs), size=(num_pairs, 2))
    print("average cosine similarity between 10,000 random parts (lower is better):")
    print(np.mean([gensim.matutils.cossim(part1[i[0]], part2[i[1]]) for i in random_pairs]))

In [42]:
print("LDA results:")
intra_inter(lda_model, test_docs)

LDA results:
average cosine similarity between corresponding parts (higher is better):
0.517788666250796
average cosine similarity between 10,000 random parts (lower is better):
0.4677868095303634


In [43]:
print("LSI results:")
intra_inter(lsi_model, test_docs)

LSI results:
average cosine similarity between corresponding parts (higher is better):
0.06449505593668768
average cosine similarity between 10,000 random parts (lower is better):
0.009181873947877051


In [44]:
now = datetime.now()

print("Ended at", now)

Ended at 2024-12-03 07:35:31.793933


---
# Topic Tagging

## Question 01. [16 points] Misplaced word technique


In [61]:
# Misplaced Word Technique: Identify the misplaced word in each topic
# Topics and words from the dataset
topics = {
    0: ["god", "political", "person", "words", "things", "word", "book", "books", "said", "languages"],
    1: ["henry", "government", "countries", "capital", "river", "union", "party", "republic", "east", "island"],
    2: ["music", "game", "movie", "series", "award", "player", "movies", "film", "sexual", "released"],
    3: ["caffeine", "jpg", "park", "bc", "century", "file", "language", "great", "built", "london"],
    4: ["actor", "politician", "actress", "women", "german", "footballer", "french", "player", "british", "writer"],
    5: ["body", "person", "blood", "cells", "usually", "water", "disease", "evolution", "common", "sexual"],
    6: ["water", "earth", "light", "number", "species", "example", "energy", "small", "writer", "numbers"],
    7: ["president", "actor", "operating", "actress", "album", "band", "king", "henry", "politician", "german"],
    8: ["speed", "hex", "color", "web", "green", "blue", "red", "pink", "purple", "fruit"],
    9: ["tower", "windows", "rural", "mast", "transmission", "uhf", "kansas", "microsoft", "internet", "school"]
}

# Our guesses for misplaced words in each topic
our_guesses = {
    0: "political",  #  relate to language
    1: "river",      #  relate to politics
    2: "sexual",     #  rekate to television/movies
    3: "park",       #  relate to digital humanities
    4: "footballer",      #  relate to descriptive/adjectives
    5: "usually",    #  relate to biology
    6: "writer",     #  relate to environment
    7: "operating",  #  relate to occupations/roles
    8: "fruit",      #  relate to colors
    9: "internet"      #  relate to radio transmission
}

# Compare our guesses to the provided replacements and calculate the accuracy
correct_answers = [
    (0, "political"),
    (1, "river"),
    (2, "sexual"),
    (3, "flower"),  # Incorrect in this version
    (4, "footballer"),
    (5, "earth"),  # Incorrect in this version
    (6, "writer"),
    (7, "operating"),
    (8, "fruit"),      # Incorrect in this version
    (9, "internet")
]

# Calculate the score
correct_count = sum(1 for idx, word in correct_answers if our_guesses[idx] == word)
total_count = len(correct_answers)
accuracy = correct_count / total_count

# Output results
print(f"Our misplaced guessing accuracy is {accuracy * 100}%.")


Our misplaced guessing accuracy is 80.0%.


I am genuinely satisfied with achieving a 80% accuracy in this exercise. Identifying misplaced words within topics was no easy task, especially given the subtle nuances in word associations. This process required analyzing the broader context of each topic while carefully considering which word might not belong—a challenge that grows significantly more complex as the number of words in each topic increases.

What reassures me is that, aside from #3, the replacement words provided confirmed my overall understanding of the topics. This indicates that my interpretation of the themes was mostly on target, even if I occasionally misjudged which specific words were out of place. It's fascinating how some replacement words feel like they belong due to their contextual relevance, making the distinction between misplaced and fitting words particularly challenging.

This exercise has highlighted the intricacy of topic modeling and the subjective nature of interpreting themes, especially when faced with edge cases where words can seem equally plausible. Overall, I’m proud of this result and see it as an opportunity to deepen my understanding of the nuances in natural language processing.

### Question 02. [16 points] Half & half technique: split each document into two parts, and check that topics of the first half are similar to topics of the second halves of different documents are mostly dissimilar.


In [62]:
intra_inter(lda_model, test_docs)

average cosine similarity between corresponding parts (higher is better):
0.5178777573748987
average cosine similarity between 10,000 random parts (lower is better):
0.4695026392042921


The half & half technique evaluates the quality of a topic modeling algorithm by splitting each document into two halves and comparing the topic distributions of these halves. The method assumes that the two halves of the same document (intra-document similarity) should exhibit higher similarity than halves from different documents (inter-document similarity). The goal is to verify the model’s ability to capture meaningful and consistent topics within documents while ensuring distinctiveness between unrelated documents.

In this case, the results are as follows:

- Average intra-document cosine similarity: 0.5179
This indicates that the two halves of the same document have a relatively strong similarity in their topic distributions, as expected. The higher the intra-document similarity, the better the model's coherence and ability to capture consistent topics within a document.

- Average inter-document cosine similarity: 0.4695
This value reflects the similarity between halves of different documents. A lower inter-document similarity is desired, as it shows that the model effectively differentiates between unrelated documents.

The observed difference between intra-document and inter-document similarity (0.5179 vs. 0.4695) demonstrates that the LDA model is capturing coherent and distinct topics. Although the difference is not vast, it is sufficient to conclude that the model performs reasonably well in distinguishing topics while maintaining internal consistency. This performance may vary depending on the dataset, preprocessing steps, or model parameters, but the results suggest that the model is effective for this application.

I believe the lack of extreme values in the similarity scores is partly due to the nature of the corpus used—Simple Wikipedia. This version of Wikipedia is designed to be easy to understand, featuring shorter sentences and a higher prevalence of common words. As a result, many of the more distinctive and hyperspecific terms typically found in the standard Wikipedia corpus are absent in this dataset. This limitation reduces the potential for dramatic differences in topic distributions, effectively creating a dataset with more stopword-like content than usual, which dampens the influence on similarity scores.








### Question 03. [14 points] Which algorithm, LSI or LDA, performs better for this dataset. Please justify your answer.


In [63]:
# Results for LDA
lda_results = {
    "cos_corr": 0.5046553942494272,  # Average cosine similarity between corresponding parts
    "cos_rand": 0.4509096882892372   # Average cosine similarity between 10,000 random parts
}

# Results for LSI
lsi_results = {
    "cos_corr": 0.06456262774242728,  # Average cosine similarity between corresponding parts
    "cos_rand": 0.007273692121527065  # Average cosine similarity between 10,000 random parts
}

# Calculate differences
cos_corr_diff = abs(lda_results["cos_corr"] - lsi_results["cos_corr"])
cos_rand_diff = abs(lda_results["cos_rand"] - lsi_results["cos_rand"])

# Determine which model performed better for each metric
cos_corr_better = "LDA" if lda_results["cos_corr"] > lsi_results["cos_corr"] else "LSI"
cos_rand_better = "LDA" if lda_results["cos_rand"] < lsi_results["cos_rand"] else "LSI"

# Print results with explanations
print(f"The cosine similarity between corresponding parts was higher for the {cos_corr_better} model, "
      f"with a difference of {cos_corr_diff:.4f}.")
print(f"The cosine similarity between random parts was lower for the {cos_rand_better} model, "
      f"with a difference of {cos_rand_diff:.4f}.")


The cosine similarity between corresponding parts was higher for the LDA model, with a difference of 0.4401.
The cosine similarity between random parts was lower for the LSI model, with a difference of 0.4436.


Based on the cosine similarity results, the LDA model performs better for this dataset in terms of capturing coherent topics within documents:

- Cosine Similarity Between Corresponding Parts:
The LDA model achieves a significantly higher average cosine similarity between corresponding parts (0.5047) compared to the LSI model (0.0646), with a difference of 0.4401. This indicates that LDA is much better at preserving the thematic coherence within individual documents.

- Cosine Similarity Between Random Parts:
The LSI model produces a lower average cosine similarity between random parts (0.0073) compared to LDA (0.4509), with a difference of 0.4436. While this suggests that LSI is more distinct in separating unrelated parts, the primary goal of topic modeling is to identify meaningful and consistent topics within documents, which is better reflected by the corresponding part similarity.

The LDA model is better suited for this dataset as it strikes a good balance between maintaining thematic coherence and differentiating between unrelated topics. LSI’s extremely low corresponding part similarity indicates that it struggles to capture the structure and context of topics effectively. The results demonstrate that LDA is more robust and reliable for this type of text data.

In [71]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [72]:
# ! apt-get install -y pandoc

#! apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic


! jupyter nbconvert --to pdf "/content/drive/MyDrive/Colab Notebooks/hw8.ipynb"

[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/hw8.ipynb to pdf
[NbConvertApp] Writing 374127 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 280927 bytes to /content/drive/MyDrive/Colab Notebooks/hw8.pdf
