# Topic Modeling with Gensim

A **topic model** is an abstraction of the major topics contained in a corpus of texts. "Topic" in this context simply means a pattern of co-occurring words. The assumption is that if there are clearly identified patterns of co-occurring words, those patterns of co-occurring words reveal a latent structure in the corpus of texts. In short, a topic model is a representation of the major themes or structures of a corpus of texts.

`Gensim` is a popular Python library for building topic models. In this notebook we will use `Gensim` to build a topic model of Gibbon's _Decline and Fall of the Roman Empire_. After building a topic model, we will then use `pyLDAvis` to visualize the model so we can evaluate its usefulness.

I highly recommend that you read through `Gensim`'s [documentation](https://radimrehurek.com/gensim/auto_examples/index.html#core-tutorials-new-users-start-here). Much of the code below is adapted from that source.

## Set up

**NOTE**: one of the Python libraries we are using (`pyLDAvis`) can cause problems. Be sure to do the installations in the order that you see them below.

In [1]:
! pip install funcy

Collecting funcy
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: funcy
Successfully installed funcy-2.0
[0m

In [2]:
! pip install tzdata

Collecting tzdata
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: tzdata
Successfully installed tzdata-2023.3
[0m

In [3]:
! pip install --no-dependencies pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.0-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pyLDAvis
Successfully installed pyLDAvis-3.4.0
[0m

In [6]:
! pip install wget
! pip install gensim

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mCollecting gensim
  Downloading gensim-4.3.2-cp38-cp38-macosx_10_9_x86_64.whl.metadata (8.5 kB)
Collecting scipy>=1.7.0 (from gensim)
  Downloading scipy-1.10.1-cp38-cp38-macosx_10_9_x86_64.whl (35.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.0/35.0 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Downloading gensim-4.3.2-cp38-cp38-macosx_10_9_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip

In [30]:
from collections import defaultdict
import wget
from gensim import corpora, models
import pandas as pd
import pyLDAvis.gensim
import warnings
import requests
import spacy
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [31]:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

['ner', 'parser']

## Upload data

### Class example

In [9]:
url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon_sections.csv'
file_name = wget.download(url)
df = pd.read_csv(file_name)
df.head()

Unnamed: 0.1,Unnamed: 0,doc_id,text,lemmas
0,0,01-0,The extent and military force of the Roman emp...,extent force empire century comprehend part ea...
1,1,01-1,"exalted situation, had much less to hope than ...",exalt situation hope fear chance arm prosecuti...
2,2,01-2,"and towards the south, the sandy deserts of A...",south desert imitate successor repose mankind ...
3,3,01-3,the love of freedom without the spirit of unio...,love freedom spirit union take arm fierceness ...
4,4,01-4,"line of military stations, which was afterwar...",line station fortify reign rampart erect found...


In [13]:
response = requests.get('https://www.gutenberg.org/cache/epub/62754/pg62754.txt')
text = response.text

In [16]:
start = text.find('DO NOT THINK THAT BY TAKING AWAY MY MEMBERSHIP')
end = text.find('*** END OF THE PROJECT GUTENBERG EBOOK MUSSOLINI AS REVEALED IN HIS POLITICAL SPEECHES (NOVEMBER 1914-AUGUST 1923) ***')
data = text[start:end]

In [23]:
data_p = data.split('\r\n\r\n')
author = []
title = []
for para in data_p:
    author.append('Mussolini')
    title.append('Speeches')
text_df =  pd.DataFrame(list(zip(author, title, data_p)), columns=['author', 'title', 'text'])

In [32]:
# extract lemmas
def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

text_df['lemmas'] = text_df['text'].apply(process_text)

In [41]:
length_filter = text_df['lemmas'].str.len() > 25
filter_df = text_df[length_filter]

In [42]:
def remove_new_lines(text):
    text = text.replace('\n', ' ')
    text = text.replace('\r', ' ')
    return text

filter_df['text'] = filter_df['text'].apply(remove_new_lines)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filter_df['text'] = filter_df['text'].apply(remove_new_lines)


In [43]:
filter_df = filter_df.reset_index(drop=True)
filter_df

Unnamed: 0,author,title,text,lemmas
0,Mussolini,Speeches,DO NOT THINK THAT BY TAKING AWAY MY MEMBERSHIP...,think take away membership card away faith cau...
1,Mussolini,Speeches,FOR THE LIBERTY OF HUMANITY AND THE FUTURE OF...,liberty humanity future italy speech deliver p...
2,Mussolini,Speeches,“EITHER WAR OR THE END OF ITALY’S NAME AS A G...,war end italy great power speech deliver milan...
3,Mussolini,Speeches,“TO THE COMPLETE VANQUISHING OF THE HUNS” ...,complete vanquishing huns speech deliver sesto...
4,Mussolini,Speeches,“NO TURNING BACK!” ...,turning speech deliver rome february
...,...,...,...,...
1470,Mussolini,Speeches,"Working classes, post-war rights of, 63; ...",working class post war right intervention fasc...
1471,Mussolini,Speeches,"Yugoslavia, pact of Rome, 126; Isonzo and...",yugoslavia pact rome isonzo porto barro delta ...
1472,Mussolini,Speeches,"Zara, 53, 59; Treaty of Rapallo, 125, 262...",zara treaty rapallo fascismo adriatic question...
1473,Mussolini,Speeches,PRINTED BY ...,printed temple press letchworth great britain


## Prepare data for topic model
The Python library we are going to use to make our topic model requires the data to be in a form of a list. Within that list, each "document" is also a list. So it looks something like this:

`[
  ['This is document 1'],
  ['This is document 2'],
  ['This is document 3']
]`

In [46]:
# extract the data out of the DataFrame
documents = filter_df['lemmas'].to_list()
documents[0]

'think take away membership card away faith cause speech deliver milan november'

`Gensim` needs each document to be tokenized. We can use [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) to quickly achieve this result. When complete, our data will now look like this:

`[
  ['This', 'is', 'document', '1'],
  ['This', 'is', 'document', '2'],
  ['This', 'is', 'document', '3'],
]`

In [47]:
# tokenize - the syntax below will create a list of lists
texts =[
    [word for word in document.lower().split()]
    for document in documents
]

It takes a lot of preparation to build a useful topic model. An important part of that preparation is to eliminate "noise" from you model. One way to do this is to remove pieces of data that are irrelevant. Here we will remove tokens that only occur once. **You may want to adjust this as you refine your topic model.**

In [48]:
# create a count of each token
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

In [49]:
# remove words that appear only 1 time
texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

## Build topic model

`Gensim` is built around [four core concepts](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#core-concepts):

### Basic topic model



In [50]:
# create a dictionary based off our texts
# The dictionary maps each token to a unique integer id
dictionary = corpora.Dictionary(texts)

In [51]:
# create a corpus based off our dictionary and our texts
corpus = [dictionary.doc2bow(text) for text in texts]

In [52]:
# build LDA model
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, passes=50)

In [53]:
# explore topics
lda_model.print_topics()

[(0,
  '0.009*"people" + 0.009*"italian" + 0.008*"come" + 0.007*"government" + 0.006*"fascisti" + 0.006*"law" + 0.006*"right" + 0.006*"italy" + 0.006*"applause" + 0.006*"victory"'),
 (1,
  '0.025*"policy" + 0.022*"relation" + 0.020*"italy" + 0.016*"foreign" + 0.014*"states" + 0.014*"treaty" + 0.012*"italian" + 0.012*"economic" + 0.012*"united" + 0.011*"question"'),
 (2,
  '0.024*"italy" + 0.015*"state" + 0.014*"day" + 0.011*"long" + 0.010*"war" + 0.009*"shall" + 0.009*"nation" + 0.008*"think" + 0.008*"work" + 0.007*"live"'),
 (3,
  '0.018*"government" + 0.017*"italian" + 0.013*"people" + 0.011*"italy" + 0.011*"war" + 0.009*"day" + 0.008*"rome" + 0.007*"nation" + 0.007*"great" + 0.007*"treaty"'),
 (4,
  '0.024*"italian" + 0.012*"italy" + 0.011*"fiume" + 0.008*"government" + 0.008*"agreement" + 0.007*"state" + 0.007*"association" + 0.006*"economic" + 0.005*"year" + 0.005*"know"'),
 (5,
  '0.014*"italian" + 0.012*"victory" + 0.012*"people" + 0.010*"regard" + 0.010*"government" + 0.009*"it

In [54]:
# Find topics in each document
lda_model.get_document_topics(corpus[0])

[(8, 0.63079935), (15, 0.2999671)]

In [55]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

  default_term_info = default_term_info.sort_values(


### Tf-idf topic model

In [56]:
# initialize a tfidf model
tfidf = models.TfidfModel(corpus)

In [57]:
# make a new corpus based on the tfidf model
corpus_tfidf = tfidf[corpus]

In [58]:
# here we build our topic model
lda_model_tfidf = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=20, passes=50)
corpus_lda = lda_model_tfidf[corpus_tfidf]

In [59]:
lda_model_tfidf.print_topics()

[(0,
  '0.015*"american" + 0.006*"ambassador" + 0.006*"italo" + 0.005*"air" + 0.005*"association" + 0.004*"politics" + 0.004*"mastery" + 0.004*"review" + 0.004*"imprison" + 0.004*"improve"'),
 (1,
  '0.007*"january" + 0.007*"want" + 0.006*"upper" + 0.005*"parliament" + 0.005*"social" + 0.005*"adige" + 0.005*"state" + 0.005*"task" + 0.005*"united" + 0.005*"democracy"'),
 (2,
  '0.007*"xii" + 0.006*"economy" + 0.005*"dawning" + 0.005*"consular" + 0.004*"austria" + 0.004*"instruction" + 0.004*"xvii" + 0.004*"elementary" + 0.004*"war" + 0.003*"geneva"'),
 (3,
  '0.008*"declaration" + 0.007*"association" + 0.007*"second" + 0.007*"fighters" + 0.005*"national" + 0.005*"entente" + 0.004*"sauro" + 0.004*"approve" + 0.004*"vindications" + 0.004*"abbazia"'),
 (4,
  '0.009*"santa" + 0.009*"margherita" + 0.008*"demand" + 0.007*"labour" + 0.005*"fascisti" + 0.005*"difficulty" + 0.005*"socialist" + 0.005*"intervention" + 0.005*"italian" + 0.004*"action"'),
 (5,
  '0.017*"november" + 0.010*"deliver" +

In [60]:
# Find topics in each document
lda_model_tfidf.get_document_topics(corpus_tfidf[0])

[(0, 0.012390435),
 (1, 0.012390435),
 (2, 0.012390435),
 (3, 0.012390435),
 (4, 0.012390435),
 (5, 0.22170988),
 (6, 0.012390436),
 (7, 0.012390435),
 (8, 0.5552623),
 (9, 0.012390447),
 (10, 0.012390435),
 (11, 0.012390435),
 (12, 0.012390435),
 (13, 0.012390435),
 (14, 0.012390435),
 (15, 0.012390435),
 (16, 0.012390435),
 (17, 0.012390435),
 (18, 0.012390435),
 (19, 0.012390435)]

In [61]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_tfidf, corpus_tfidf, dictionary)
vis

  default_term_info = default_term_info.sort_values(
