# Topic modeling with Gensim

Gensim's preprocessing tools can be great for extremely fast preprocessing, but this isn't what Gensim is really designed for.  Gensim is designed mostly around _unsupervised learning_ tasks in NLP.  Namely, topic modeling and text embeddings.  This notebook covers Gensim's topic modeling in _very_ brief detail.

To run topic modeling in Genim, we need to do a few things:
- Preprocess our text.
- Convert it into Gensim's bag-of-words format (this is different from the sparse matrices we get from scikit-learn).
- Run the topic model of our choice--we're going to use Latent Dirichlet Allocation (LDA).

Topic modeling tends to be _very_ data hungry, so we'll be using a much larger dataset than we've used for the previous notebooks.  Specifically: we'll grab the entirety of the Electronics reviews from Amazon.

In [1]:
import os
import pandas as pd

if not os.path.isfile("electronics_topic_modeling.parquet"):
    reviews = pd.read_json(
        "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz",
        lines=True,
    )[["reviewText", "overall"]] # we'll use the "oveall" column in the next notebook
    reviews.to_parquet("electronics_topic_modeling.parquet")
else:
    reviews = pd.read_parquet("electronics_topic_modeling.parquet")
    
reviews = reviews["reviewText"].dropna().astype(str)
print(f"{reviews.shape[0]:,} reviews.")

1,689,188 reviews.


We're going to employ some tricks to process this data as quickly as possible.  We're going to create our own preprocessing function using only some of the steps from Gensim's `preprocess_string` function.  Then we're going to call it in parallel over our texts using Joblib, for extra speed.

In [2]:
# Preprocessing
from joblib import Parallel, delayed
from gensim.parsing import preprocessing
from tqdm.notebook import tqdm

def preprocess(s):
    """Apply some of gensim's preprocessing tools."""
    s = preprocessing.strip_punctuation(s)
    s = preprocessing.strip_numeric(s)
    s = preprocessing.remove_stopwords(s.lower())
    s = preprocessing.strip_short(s)
    s = preprocessing.stem_text(s)
    return s.split()

parsed = Parallel(-1)(
    delayed(preprocess)(i)
    for i in tqdm(reviews, unit_scale=True, smoothing=0)
)

print(reviews.iloc[0])
print(parsed[0])

  0%|          | 0.00/1.69M [00:00<?, ?it/s]

We got this GPS for my husband who is an (OTR) over the road trucker.  Very Impressed with the shipping time, it arrived a few days earlier than expected...  within a week of use however it started freezing up... could of just been a glitch in that unit.  Worked great when it worked!  Will work great for the normal person as well but does have the "trucker" option. (the big truck routes - tells you when a scale is coming up ect...)  Love the bigger screen, the ease of use, the ease of putting addresses into memory.  Nothing really bad to say about the unit with the exception of it freezing which is probably one in a million and that's just my luck.  I contacted the seller and within minutes of my email I received a email back with instructions for an exchange! VERY impressed all the way around!
['got', 'gp', 'husband', 'otr', 'road', 'trucker', 'impress', 'ship', 'time', 'arriv', 'dai', 'earlier', 'expect', 'week', 'us', 'start', 'freez', 'glitch', 'unit', 'work', 'great', 'work', 'wor

In [3]:
# Convert to a bag-of-words format using a Dictionary()
# object, which will handle filtering words for us and
# doing the bag of words conversion.
from gensim.corpora import Dictionary

# Dictonary([[word, word, ...], [word, word, ...], ...])
d = Dictionary(tqdm(parsed, desc="Creating gensim Dictionary", unit_scale=True))

# Remove rare and super common words
d.filter_extremes(no_above=0.5, no_below=20)

# Convert to bag of words
bow = [
    d.doc2bow(i)
    for i in tqdm(parsed, desc="Converting to bag of words", unit_scale=True)
]

# For topic modeling we usually only want texts of at least
# some minimum number of words, after filtering.  I'm picking
# 20, somewhat arbitrarily.
bow = [i for i in bow if len(i) >= 20]

print(f"{len(bow):,} reviews remain.")

Creating gensim Dictionary:   0%|          | 0.00/1.69M [00:00<?, ?it/s]

Converting to bag of words:   0%|          | 0.00/1.69M [00:00<?, ?it/s]

959,803 reviews remain.


Now, we build the LDA model on the bag of words representations.  Gensim has a few different topic models, but as mentioned, we're going to use one called Latent Dirichlet Allocation.  It tends to find more compact/coherent topics than some others, but its parameters can be very fiddly for datasets that aren't absurdly huge (this dataset is pretty moderately sized for LDA).

Gensim's LDA model does have an option to use "callbacks," which is a fancy way of saying "a function we pass that Gensim will periodically call for us."  Gensim typically calls these at the beginning and end of each pass over the dataset (LDA generally does several passes).  But, this makes it a bit hard to monitor the progress of each pass itself, so I'm going to slap together a quick class that'll print the progress on each iteration through a corpus.  Don't worry too much about the code details here.

In [4]:
class Corpus:
    def __init__(self, it):
        self.it = it
        self.n_passes = 1
        
    def __len__(self):
        return len(self.it)
    
    def __iter__(self):
        for i in tqdm(
            self.it,
            desc=f"Pass {self.n_passes}",
            unit_scale=True,
            smoothing=0,
        ):
            yield i
        self.n_passes += 1

In [5]:
# Set some environemnt variables before we run the model.
# Some of the lower-level libaries used by the LDA model
# are already multithreaded.  But we get more mileage
# out of multithreading at the LDA level.  So we need to
# single-thread the lower level libraries to avoid over-
# subscribing the CPU.
import os

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MLK_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

In [6]:
from gensim.models.ldamulticore import LdaMulticore

lda = LdaMulticore(
    corpus=Corpus(bow),
    id2word=d,
    num_topics=10,
    passes=5, # 50 passes over the corpus
    workers=10,
)

Pass 1:   0%|          | 0.00/960k [00:00<?, ?it/s]

Pass 2:   0%|          | 0.00/960k [00:00<?, ?it/s]

Pass 3:   0%|          | 0.00/960k [00:00<?, ?it/s]

Pass 4:   0%|          | 0.00/960k [00:00<?, ?it/s]

Pass 5:   0%|          | 0.00/960k [00:00<?, ?it/s]

In [7]:
print(lda.print_topics())

[(0, '0.033*"drive" + 0.017*"card" + 0.017*"usb" + 0.011*"window" + 0.011*"work" + 0.009*"instal" + 0.009*"us" + 0.008*"hard" + 0.008*"laptop" + 0.008*"file"'), (1, '0.029*"cabl" + 0.022*"screen" + 0.019*"tablet" + 0.014*"work" + 0.013*"product" + 0.010*"protector" + 0.009*"ipad" + 0.009*"like" + 0.008*"us" + 0.007*"good"'), (2, '0.040*"speaker" + 0.026*"sound" + 0.014*"radio" + 0.010*"mount" + 0.009*"unit" + 0.009*"good" + 0.009*"power" + 0.008*"receiv" + 0.008*"antenna" + 0.008*"great"'), (3, '0.026*"batteri" + 0.021*"keyboard" + 0.021*"charg" + 0.017*"work" + 0.013*"us" + 0.011*"mous" + 0.010*"time" + 0.009*"kei" + 0.009*"power" + 0.009*"like"'), (4, '0.020*"connect" + 0.020*"devic" + 0.019*"work" + 0.017*"router" + 0.012*"wireless" + 0.011*"network" + 0.010*"set" + 0.008*"us" + 0.008*"support" + 0.007*"wifi"'), (5, '0.034*"sound" + 0.023*"headphon" + 0.016*"ear" + 0.013*"music" + 0.013*"qualiti" + 0.013*"good" + 0.011*"like" + 0.009*"bass" + 0.008*"listen" + 0.008*"great"'), (6, '0

In [8]:
# Reshp that output into soemthing more readable.
for (topic_num, words) in lda.show_topics(formatted=False):
    print(f"TOPIC #{topic_num}")
    for w in words: 
        print(w)
    print()

TOPIC #0
('drive', 0.032778084)
('card', 0.017403973)
('usb', 0.016849024)
('window', 0.0111500565)
('work', 0.011146831)
('instal', 0.009073854)
('us', 0.009033086)
('hard', 0.00796388)
('laptop', 0.007916439)
('file', 0.00764729)

TOPIC #1
('cabl', 0.029098772)
('screen', 0.02167196)
('tablet', 0.01869011)
('work', 0.0135153)
('product', 0.012651278)
('protector', 0.010461979)
('ipad', 0.009393578)
('like', 0.009151966)
('us', 0.008449879)
('good', 0.0066706496)

TOPIC #2
('speaker', 0.039928056)
('sound', 0.026328849)
('radio', 0.013600866)
('mount', 0.010017371)
('unit', 0.00931071)
('good', 0.009050677)
('power', 0.008708894)
('receiv', 0.008392777)
('antenna', 0.008134712)
('great', 0.007953939)

TOPIC #3
('batteri', 0.026297644)
('keyboard', 0.021481948)
('charg', 0.020904819)
('work', 0.017413568)
('us', 0.013355687)
('mous', 0.01112192)
('time', 0.009931391)
('kei', 0.009099643)
('power', 0.009069663)
('like', 0.008928605)

TOPIC #4
('connect', 0.020471772)
('devic', 0.0197821

We can obviously see-even though the mangling from the Porter Stemmer--some pretty consistent themes popping up in this data.  Cameras, keyboards and mice, headphones, speakers, etc. all show up pretty clearly in each topic.

We can also use the topic model to _transform_ a document into its topic distributions, i.e., into a vector that tells us "how much" of each topic is in the document.  I'll grab a review from Amazon (for the Logitech G610 Orion keyboard--a good, well priced mechanical keyboard if anyone happens to be curious).

In [9]:
sample_review = """Hello everyone this is my review on the Logitech G610 Orion Red. You can find the TL;DR at the bottom. :)

BUILD QUALITY:

The keyboard feels sturdy and has a bit of weight to it which results in a unit that feels robust and fairly high-quality. The top of the keyboard has a nice matte finish and the keycaps also have a matte, almost "satin" finish to them. The side of the keyboard is glossy and will surely attract fingerprints. Matte sides would have been much nicer and the decision to make them glossy seems a bit strange, but it does not seem to be that big a deal. The keycaps do leave a bit to be desired as they seem to be of a lower quality than the rest of the board and from the looks of it these keycaps will be glossing over fairly soon as I put this keyboard to extensive use. The media keys and volume roller situated in the upper right-hand corner of the keyboard both feel of a high quality.

KEYCAPS:

Do note, and this may be important for some folks, if you want to use custom keycaps, this keyboard does have an odd-size bottom row, particularly the spacebar which comes in at a very weird 5.75u. The entire bottom row on this keyboard is: 1.5u, 1.25u, 1.25u, 5.75u, 1.25u, 1.25u, 1.25u, 1.5u. With enough searching (and of course money) you could probably find keycaps for this bottom row, but do bear in mind that it may not be so easy and once you do find some nicer caps by time you pay for them you could be faced with a situation where the cost of those caps on top of the cost of this board exceeds that of getting a board with better keycaps to begin with. This is perhaps something for you to consider.

LIGHTING:

The lighting on the G610 only comes in white, however if you like the subdued styling of the keyboard but would prefer more lighting colors there is an RGB option, which is the Logitech G810. It is very similar to this board that in addition to the RGB lighting will feature Logitech's proprietary Romer-G switches as opposed to the Cherry brown or red switch options available on the G610. Whether one enjoys the feel of Romer-G switches is very much personal preference, but one thing to note with Romer-Gs is the backlighting will be more consistent across each key, unlike Cherry’s keys where thanks to the switch design influencing the placement of the LED, the lighting can be uneven across a key. For example, the bottom half of the keys of the number row (e.g. @, %, ^, &), and the secondary functions of the Numpad (e.g. Arrow Keys, END, PG DN, DEL) will be dimmer than the top half of the key. I have included a picture of a section the number pad demonstrating this. For many it may not be that big a deal, but if you are particular about things like that it may be an issue for you and could be one reason to look into Romer-G switches. Regardless of the inconsistent lighting, you should have no problem seeing everything in a dark room.

As far as lighting is concerned you can opt to use Logitech's software to customize between different effects such as "Breathing," "Star Effect," "Light Wave" and things like that or if you would rather not deal with the software you can forgo some of the fancier lighting schemes and simply hold down the keyboard's Brightness Setting key while pressing numbers 1, 2, 3, 4, 5, or 0 (0 is for solid) to cycle through different preset lighting schemes. It’s a pretty neat feature. The settings you select through Brightness key + numbers 1-5 and 0 should save through a cold boot. Considering my Logitech G600 mouse features a nice onboard memory that saves lighting and binding profiles, this keyboard could have certainly used something similar which for a unit that retails for $120 would have been nice to have.

FINAL THOUGHTS/OPINIONS:

This keyboard was a second choice for me as I intended to get a WASD CODE which retails for $150. The only reason I picked this Logitech up is the Cherry MX Red version was available on Amazon for $80 and I wanted to save a bit of money. Personally, I’d be reluctant to pay $100 for this keyboard and certainly not the retail asking price of $120 when for just $30 more I could have gotten a backlit board from WASD which has an even superior build quality, more choice of switches, and keycaps that are easier to replace... along with no "in your face" company logos, no media control switches (which for me are useless) and features onboard memory to easily save lighting profiles. Because of a lack of onboard memory to store lighting and key assignment profiles and because of the completely odd-ball and frankly quite annoying size of the bottom row’s spacebar which makes finding a full set of custom keycaps a nightmare, I will be removing one star.

In the end this is a nice keyboard for $80, and even while $100 would be doable it would be stretching it but I do trust the Logitech name and therefore can be pretty certain this keyboard will serve me well for the next few years, even if only as a backup. The flaws I mention are relatively minor and the quality and styling of this board far outweigh them. Personally I do not at all regret the purchase of this keyboard. If you want an easy, simple keyboard from a reputable company go ahead and pick this up and it should serve you well.

TL;DR

+Solid build quality
+Tried and true Cherry MX switches
+Backlit with five levels of brightness
+Easy to take apart (see youtube for teardowns)
+Comfortable to use

-Weird sizing for keycaps on bottom row, good luck finding a 5.75u spacebar :(
-No onboard memory to save lighting profile, key remaps, etc.
-A bit pricey considering no onboard memory, no USB passthrough, etc.
-Uneven lighting on some keys. While due to the design of Cherry switches this is not necessarily a fault of Logitech, the company could have opted to orient the keycap’s primary and secondary legends horizontally instead of vertically (e.g. as see on the backlit WASD Code) which would have provided more even lighting across the key.

Despite some flaws like a very odd sizing on the keycaps of the bottom row and lack of onboard memory to save lighting and other such profiles, The Logitech G610 Orion Red is solid board and one that I can happily recommend."""

# Run all the preprocessing
sample_review = preprocess(sample_review)
sample_review = d.doc2bow(sample_review)
# print(sample_review)

# Gensim's models override the `__get__()` method
# to transform data, so we can use this weird-looking
# syntax.  This filters out topics that are below a threshold.
# You can specify the threshold when creating the 
# LdaMulticore model--I left it at the default.
print(lda[sample_review])

[(0, 0.09168905), (3, 0.6173495), (6, 0.120697916), (7, 0.11467561), (8, 0.054508593)]


In [10]:
# Or: get all topics, regardless of their proportion.
print(lda.get_document_topics(sample_review, minimum_probability=0))

[(0, 0.09170318), (1, 0.00021587481), (2, 0.00021586963), (3, 0.6173657), (4, 0.00021585888), (5, 0.0002158707), (6, 0.12075702), (7, 0.114705995), (8, 0.054388754), (9, 0.00021587199)]


In [11]:
# Convert to a Numpy array for, e.g., input to an ML
# or statistical model.
from gensim.matutils import corpus2dense
topics = lda.get_document_topics(sample_review, minimum_probability=0)
print(topics)

# corpus2dense expects a list of documents.
# `topics` is formatted like a single document.
# We also need to specify how many "terms" are in the 
# whole corpus.  corpus2dense returns an array with
# each document in a column, so we also transpose it.
print(corpus2dense([topics], 10).T)

[(0, 0.09169665), (1, 0.00021587482), (2, 0.00021586964), (3, 0.61735773), (4, 0.00021585889), (5, 0.00021587072), (6, 0.120721094), (7, 0.11469636), (8, 0.054448795), (9, 0.000215872)]
[[9.16966498e-02 2.15874825e-04 2.15869644e-04 6.17357731e-01
  2.15858890e-04 2.15870721e-04 1.20721094e-01 1.14696361e-01
  5.44487946e-02 2.15872002e-04]]
