## ASA Week 09: Topic Modeling and Word Embedding in Gensim


Part 1. Topic Modeling using Gensim
---

Let's first take a look at how to perform topic modeling using the Python package [*Gensim*](https://radimrehurek.com/gensim/).

We need ot install the following two packages:

- pip install gensim
- pip install pyldavis

We will use the [movie review dataset](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) in the file "reviews.csv" as the training data (each line is a review document).

In [14]:
# read textual review data
lines = open("reviews.csv",encoding="utf-8",mode="r").readlines()
reviews = [line.strip().split(',')[1] for line in lines]
print(reviews[:2])

[' with all this stuff going down at the moment with mj started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe just want to get certain insight into this guy who thought wa really cool in the eighty just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which remember going to see at the cinema when it wa originally released some of it ha subtle message about mj feeling towards the press and also the obvious message of drug are bad visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fan would say that he made it for the fan which if true is really nice of him the actual feature film bit when it finally start is only on for 20 minute or so excluding the smooth crim

In [3]:
# process data
from gensim.parsing.preprocessing import remove_stopwords

# remove stopwords
reviews = [remove_stopwords(review) for review in reviews]
# review doc: string to list
reviews = [review.split() for review in reviews]

In [4]:
# prepare data for gensim
from gensim import corpora

%time vocabs = corpora.Dictionary(reviews) # generate vocabulary dictionary
vocabs.save('movie_reviews.dict') # you can save the dictionary in disk for future use
print(vocabs)

%time corpus = [vocabs.doc2bow(review_doc) for review_doc in reviews] # generate corpus for training model
corpora.MmCorpus.serialize('movie_reviews.mm', corpus) # you can save the corpus for future use
print(corpus[0])

CPU times: user 2.92 s, sys: 29.8 ms, total: 2.95 s
Wall time: 3 s
Dictionary(65049 unique tokens: ['20', 'actual', 'attention', 'away', 'bad']...)
CPU times: user 1.82 s, sys: 85 ms, total: 1.91 s
Wall time: 1.98 s
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 3), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 2), (39, 2), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 3), (47, 1), (48, 2), (49, 2), (50, 1), (51, 3), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 2), (59, 2), (60, 1), (61, 1), (62, 3), (63, 1), (64, 1), (65, 1), (66, 3), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 3), (73, 3), (74, 2), (75, 1), (76, 1), (77, 11), (78, 1), (79, 2), (80, 3), (81, 2), (82, 1), (83, 2), (84, 1), (85, 1), (86, 1), (87, 1), 

In [5]:
# train LDA model
from gensim import models

%time lda = models.ldamodel.LdaModel(corpus=corpus, id2word=vocabs, num_topics=100)
lda.save('reviews_lda_100.model') # you can save the trained model for future use

# you can load the saved model later as follows:
# lda = models.LdaModel.load('reviews_lda_100.model')

CPU times: user 1min 22s, sys: 7.25 s, total: 1min 29s
Wall time: 52.6 s


In [6]:
# show topics
lda.show_topics(-1) # print topics

[(0,
  '0.032*"steve" + 0.026*"robot" + 0.021*"dan" + 0.020*"fantasy" + 0.014*"machine" + 0.011*"life" + 0.011*"renaissance" + 0.011*"human" + 0.010*"grandmother" + 0.010*"real"'),
 (1,
  '0.045*"film" + 0.011*"horror" + 0.009*"wa" + 0.008*"character" + 0.008*"director" + 0.007*"ha" + 0.006*"doe" + 0.006*"like" + 0.006*"time" + 0.005*"good"'),
 (2,
  '0.106*"thomas" + 0.085*"reunion" + 0.059*"overlooked" + 0.059*"kurt" + 0.036*"photograph" + 0.035*"sheet" + 0.030*"marty" + 0.029*"blowing" + 0.028*"prank" + 0.024*"ada"'),
 (3,
  '0.059*"jane" + 0.055*"animal" + 0.053*"tarzan" + 0.039*"native" + 0.035*"jungle" + 0.034*"ape" + 0.033*"chris" + 0.018*"elephant" + 0.017*"walken" + 0.015*"toilet"'),
 (4,
  '0.046*"comedy" + 0.027*"funny" + 0.015*"film" + 0.014*"laugh" + 0.012*"humour" + 0.011*"great" + 0.011*"hilarious" + 0.011*"adam" + 0.011*"humor" + 0.009*"funniest"'),
 (5,
  '0.052*"fight" + 0.050*"art" + 0.042*"action" + 0.019*"martial" + 0.016*"fighting" + 0.015*"master" + 0.013*"chines

In [7]:
# represent documents using topic distributions
corpus_lda = lda[corpus]
print(corpus_lda[0])

[(6, 0.22105563), (11, 0.22829233), (23, 0.023068018), (55, 0.1682661), (60, 0.05279133), (69, 0.18127695), (80, 0.012877543), (86, 0.034553945), (95, 0.018431295)]


In [8]:
# visualize the trained lda model using pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models # pyLDAvis version = 3.3.x
# import pyLDAvis.gensim # pyLDAvis version = 3.2.2
pyLDAvis.enable_notebook()

%time vis_data = pyLDAvis.gensim_models.prepare(lda, corpus, vocabs) # pyLDAvis version = 3.3.x
# %time vis_data = pyLDAvis.gensim_models.prepare(lda, corpus, vocabs) # pyLDAvis version = 3.2.2

CPU times: user 1min 22s, sys: 6.4 s, total: 1min 28s
Wall time: 1min 32s


In [9]:
vis_data

  and should_run_async(code)


### Exercise

In this exercise, please use Gensim to train a LDA model with *50* topics on the above movie review dataset, and visualize the trained model using pyLDAvis. Can you get more meaningful topics compared with the previous LDA model with 100 topics?

In [10]:
# write your code here

  and should_run_async(code)


In [11]:
lines = open("reviews.csv",encoding="utf-8",mode="r").readlines()
reviews = [line.strip().split(',')[1] for line in lines]

# remove stopwords
reviews = [remove_stopwords(review) for review in reviews]
# review doc: string to list
reviews = [review.split() for review in reviews]

%time vocabs = corpora.Dictionary(reviews) # generate vocabulary dictionary
%time corpus = [vocabs.doc2bow(review_doc) for review_doc in reviews] # generate corpus for training model

%time lda = models.ldamodel.LdaModel(corpus=corpus, id2word=vocabs, num_topics=50)

vis_data = pyLDAvis.gensim_models.prepare(lda, corpus, vocabs)

vis_data

  and should_run_async(code)


CPU times: user 3.67 s, sys: 24.4 ms, total: 3.7 s
Wall time: 3.85 s
CPU times: user 2.17 s, sys: 58.6 ms, total: 2.23 s
Wall time: 2.25 s
CPU times: user 1min 1s, sys: 4.96 s, total: 1min 6s
Wall time: 40.4 s


Part 2. Word Embedding using Gensim
---

Now, let's take a look at how to train word embedding model using Gensim. We will also use the [movie review dataset](https://www.kaggle.com/c/word2vec-nlp-tutorial/data). When preparing the data for training Word2Vec model, we process the textual review data as a collection of sentences rathen than documents. This is a standard routine which usually results in better model. The processed data is in the file "review_sentences.csv" (each line is a review sentence).


In [13]:
# read textual review data

review_sents = open("review_sentences.csv",encoding="utf-8",mode="r").readlines()
review_sents = [sent.strip().split() for sent in review_sents]
print(len(review_sents))
print(review_sents[:2])

  and should_run_async(code)


218725
[['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker'], ['maybe', 'want', 'certain', 'insight', 'guy', 'thought', 'wa', 'cool', 'eighty', 'maybe', 'mind', 'guilty', 'innocent']]


Different from training topic models in Gensim, we do not need to build the dictionary and corpus by ourselves when training a Word2Vec model.

In [14]:
# train word2vec skipgram model
from gensim.models import Word2Vec

# set model parameters
num_features = 100    # Word vector dimensionality                      
min_word_count = 5    # Minimum word count                        
context = 5           # Context window size
sg = 1                # skipgram=1, cbow=0
num_workers = 4       # Number of threads to run in parallel

# train the model
%time model_sg100 = Word2Vec(review_sents, vector_size=num_features, window=context, min_count=min_word_count, sg=sg, workers=num_workers)

# you can save the trained model for future use
model_sg100.save("reviews_word2vec_skipgram_100.model")

# you can load saved model later as follows:
# model_w2v = Word2Vec.load("reviews_word2vec_skipgram_100.model")

  and should_run_async(code)


CPU times: user 1min 52s, sys: 917 ms, total: 1min 53s
Wall time: 45.6 s


For more details about the Word2Vec model parameters, you can refer to the [Gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).

Once the model is trained, we can examine it as follows:

In [15]:
# examine the trained word2vec skipgram model

print('embedding vector of the word "movie":')
print(model_sg100.wv['movie']) # word embedding vector of the word "movie"
print(model_sg100.wv.most_similar(positive=['woman', 'king'], negative=['man'])) # woman-man+king=?
print(model_sg100.wv.doesnt_match("man woman child kitchen".split()))
print(model_sg100.wv.doesnt_match("france england germany berlin".split()))
print(model_sg100.wv.doesnt_match("breakfast cereal dinner lunch".split()))
print(model_sg100.wv.most_similar("man"))
print(model_sg100.wv.most_similar("woman"))
print(model_sg100.wv.most_similar("awful"))
print(model_sg100.wv.most_similar("excellent"))
print(model_sg100.wv.most_similar("cage"))

embedding vector of the word "movie":
[-0.13262676  0.16056094  0.28365612  0.22836727  0.040193   -0.1073219
  0.19703384  0.20268294 -0.33878648 -0.18812129 -0.21795915 -0.41134977
  0.23389016  0.178141    0.10468085 -0.2573494   0.70758843 -0.15889019
 -0.21894109 -0.29213256  0.27027076  0.1569035   0.23812015 -0.33200762
  0.1388619  -0.02589815 -0.2270016   0.32956994 -0.37776718  0.16968004
  0.28669795 -0.18214868 -0.13263872 -0.42392448 -0.1299315  -0.21997157
  0.15610188  0.2358365  -0.08900692 -0.00480238  0.23624206 -0.35871303
 -0.06051981  0.08753847  0.0051061   0.02259411 -0.03256884  0.03449161
  0.091901    0.25280926 -0.05675744 -0.29356095 -0.30331823 -0.40025184
 -0.305623    0.06302369 -0.00423238  0.01069803 -0.12573676  0.01525086
 -0.18297216 -0.28861108  0.08750325  0.22940835 -0.01921438  0.5184546
 -0.04013998  0.09056639 -0.38322476  0.5284049   0.07507122 -0.10726035
  0.5324626  -0.02726773  0.00198081  0.22666588  0.3940536   0.04165513
 -0.10894325  0

  and should_run_async(code)


In [16]:
# all word embeddings
word_vectors = model_sg100.wv.vectors
print(type(word_vectors))
print(word_vectors.shape) # num_words * num_features

<class 'numpy.ndarray'>
(24137, 100)


  and should_run_async(code)


#### We can use  [gensim.models.phrases](https://radimrehurek.com/gensim/models/phrases.html#module-gensim.models.phrases) module for phrases with more than one word.

In [17]:
# train a bigram word2vec skipgram model
from gensim.models import Phrases
from gensim.models.phrases import Phraser

bigram_transformer = Phraser(Phrases(review_sents))
%time model_sg100_bigram = Word2Vec(bigram_transformer[review_sents], vector_size=100, window=5, min_count=5, sg=1, workers=4)
model_sg100_bigram.save("reviews_word2vec_skipgram_100_bigram.model")

  and should_run_async(code)


CPU times: user 2min 10s, sys: 1 s, total: 2min 11s
Wall time: 51.3 s


In [18]:
# now we can play with some bigram phrases
print(model_sg100_bigram.wv.most_similar("nicolas_cage"))
print(model_sg100_bigram.wv.most_similar("harry_potter"))

[('paul_newman', 0.935321033000946), ('quinn', 0.9315627217292786), ('olsen', 0.9299476146697998), ('chris_cooper', 0.9235896468162537), ('kurt', 0.9209234714508057), ('william_macy', 0.9201018810272217), ('tony_danza', 0.9188805222511292), ('dreyfus', 0.9177072048187256), ('sam_neill', 0.9170575141906738), ('micheal', 0.9167933464050293)]
[('talespin', 0.9301210641860962), ('undoubtedly_best', 0.9263731837272644), ('zu_warrior', 0.9237160086631775), ('pale_comparison', 0.9185023903846741), ('goodfellas', 0.9182243347167969), ('disney_animated', 0.9170673489570618), ('stayed_true', 0.9166109561920166), ('house_wax', 0.9149521589279175), ('ado', 0.9144208431243896), ('mary_poppins', 0.9143704771995544)]


  and should_run_async(code)


### Exercise

In this exercise, please use Gensim to train a Word2Vec CBOW (rather than Skipgram) model on the above movie review dataset (i.e., the file "review_sentences.csv") as follows:

- set the model parameters as follows:
    * num_features = 100 (word vector dimension)
    * min_count = 5 (minimum word count)
    * window = 5 (context window size)
    * sg = 0 (train a **CBOW** model)
- play with the learnt model by examining:
    * the most similar words to some words, e.g., excellent, cage
    * find the word that does not match with other words in a word list
- qualitatively compare the CBOW model with Skipgram model, which is better in your opinion?

In [19]:
# write your code here

  and should_run_async(code)


In [20]:
review_sents = open("review_sentences.csv",encoding="utf-8",mode="r").readlines()
review_sents = [sent.strip().split() for sent in review_sents]

# set model parameters
num_features = 100    # Word vector dimensionality                      
min_word_count = 5    # Minimum word count                        
context = 5           # Context window size
sg = 0               # skipgram=1, cbow=0
num_workers = 4       # Number of threads to run in parallel

# train the model
%time model_cbow = Word2Vec(review_sents, vector_size=num_features, window=context, min_count=min_word_count, sg=sg, workers=num_workers)

  and should_run_async(code)


CPU times: user 39.8 s, sys: 201 ms, total: 40 s
Wall time: 14.3 s


In [21]:
print(model_cbow.wv.most_similar("excellent"))
print(model_cbow.wv.most_similar("cage"))

[('fantastic', 0.8971798419952393), ('outstanding', 0.8954408168792725), ('fine', 0.877120316028595), ('superb', 0.8752452731132507), ('terrific', 0.8617680668830872), ('solid', 0.8468698859214783), ('brilliant', 0.8424371480941772), ('wonderful', 0.8410651087760925), ('exceptional', 0.8381034731864929), ('fabulous', 0.8247731328010559)]
[('hopper', 0.8688058853149414), ('wilson', 0.8406587243080139), ('goldblum', 0.8380635976791382), ('phillips', 0.8287017941474915), ('bacon', 0.8267509341239929), ('dern', 0.826496422290802), ('nicolas', 0.8223949670791626), ('jeff', 0.818874180316925), ('nicholas', 0.8181320428848267), ('freeman', 0.8176431655883789)]


  and should_run_async(code)


In [22]:
print(model_cbow.wv.doesnt_match(['breakfast', 'cereal', 'dinner', 'lunch']))

dinner


  and should_run_async(code)


In [23]:
# Skip-gram model is better

  and should_run_async(code)


Homework (Optional): Topic Modeling and Word Embedding on 10-K MD&A Corpus
---
In this homework, we will use another dataset - the MD&A texts (item 7) of 10-K forms in 2012. For your convenience, I have already processed the data in the file "10kmda2012_sentences.csv" (each line is a MD&A sentence).

**Task 1:** train a LDA model on the processed 10-K MD&A corpus:
- only use the first 100,000 sentences in the corpus (otherwise, it will cost too much time)
- set the number of topics to 100
- visualize the trained model using pyLDAvis
- manually check whether the learnt topics are meaningful

**Task 2:** train a Word2Vec model on the processed 10K MD&A corpus:
- set the model parameters as follows:
    + num_features = 100 (word vector dimension)
    + min_count = 5 (minimum word count)
    + window = 5 (context window size)
    + sg = 1 (i.e., train a **skip-gram** model)
- play with the learnt model by examining:
    + the most similar words to some words, e.g., apple, ceo, excellent
    + find the word that does not match with other words in a word list, e.g., "apple google amazon banana", ""breakfast cereal dinner lunch""

In [24]:
# write your code here

  and should_run_async(code)


In [25]:
# LDA
sents = open("10kmda2012_sentences.csv",encoding="utf-8",mode="r").readlines()
sents = [sent.strip() for sent in sents]
sents = sents[:100000]

  and should_run_async(code)


In [26]:
# process data
from gensim.parsing.preprocessing import remove_stopwords

# remove stopwords
sents = [remove_stopwords(sent) for sent in sents]
# review doc: string to list
sents = [sent.split() for sent in sents]

  and should_run_async(code)


In [27]:
# prepare data for gensim
from gensim import corpora

%time vocabs = corpora.Dictionary(sents) # generate vocabulary dictionary
# vocabs.save('movie_reviews.dict') # you can save the dictionary in disk for future use
print(vocabs)

%time corpus = [vocabs.doc2bow(sent_doc) for sent_doc in sents] # generate corpus for training model
# corpora.MmCorpus.serialize('movie_reviews.mm', corpus) # you can save the corpus for future use
print(corpus[0])

  and should_run_async(code)


CPU times: user 2.62 s, sys: 15.4 ms, total: 2.64 s
Wall time: 2.66 s
Dictionary(17151 unique tokens: ['analysis', 'condition', 'discussion', 'financial', 'management']...)
CPU times: user 1.97 s, sys: 64.4 ms, total: 2.03 s
Wall time: 2.04 s
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]


In [28]:
# train LDA model
from gensim import models

%time lda = models.ldamodel.LdaModel(corpus=corpus, id2word=vocabs, num_topics=100)

  and should_run_async(code)


CPU times: user 50.4 s, sys: 1.55 s, total: 52 s
Wall time: 49.1 s


In [29]:
# visualize the trained lda model using pyLDAvis
pyLDAvis.enable_notebook()
import pyLDAvis
import pyLDAvis.gensim_models # pyLDAvis version = 3.3.x

%time vis_data = pyLDAvis.gensim_models.prepare(lda, corpus, vocabs) # pyLDAvis version = 3.3.x
vis_data

  and should_run_async(code)


CPU times: user 32.4 s, sys: 1.4 s, total: 33.8 s
Wall time: 38.6 s


In [30]:
# word2vec

  and should_run_async(code)


In [31]:
sents = open("10kmda2012_sentences.csv",encoding="utf-8",mode="r").readlines()
sents = [sent.strip().split() for sent in sents]

  and should_run_async(code)


In [32]:
# train word2vec skipgram model
from gensim.models import Word2Vec

# set model parameters
num_features = 100    # Word vector dimensionality                      
min_word_count = 5    # Minimum word count                        
context = 5           # Context window size
sg = 1                # skipgram=1, cbow=0
num_workers = 4       # Number of threads to run in parallel

# train the model
%time model_sg = Word2Vec(review_sents, vector_size=num_features, window=context, min_count=min_word_count, sg=sg, workers=num_workers)

  and should_run_async(code)


CPU times: user 1min 48s, sys: 615 ms, total: 1min 49s
Wall time: 32.8 s


In [33]:
print(model_sg.wv.most_similar("apple"))
print(model_sg.wv.most_similar("ceo"))
print(model_sg.wv.most_similar("excellent"))

[('bicycle', 0.7788037061691284), ('fish', 0.7763434648513794), ('cone', 0.7683797478675842), ('bowl', 0.7620484828948975), ('shiny', 0.7609529495239258), ('sensation', 0.7603657841682434), ('straddling', 0.7584481239318848), ('pond', 0.7582493424415588), ('mud', 0.7581585645675659), ('lebowski', 0.7571097016334534)]
[('trautman', 0.9422417879104614), ('alphonse', 0.9385595917701721), ('rushton', 0.9356604218482971), ('commender', 0.9348047971725464), ('inefficient', 0.9346885085105896), ('murdock', 0.9346219897270203), ('considine', 0.9336346983909607), ('magnate', 0.9326514005661011), ('assan', 0.93124920129776), ('organizing', 0.9298204183578491)]
[('superb', 0.8397080302238464), ('outstanding', 0.8107491135597229), ('terrific', 0.7982479333877563), ('fantastic', 0.7792264223098755), ('fine', 0.7720721960067749), ('marvelous', 0.7704530358314514), ('great', 0.7698812484741211), ('stellar', 0.7666733264923096), ('exceptional', 0.7586615085601807), ('brilliant', 0.756466269493103)]


  and should_run_async(code)


In [34]:
print(model_sg.wv.doesnt_match(['apple', 'google', 'amazon', 'banana']))
print(model_sg.wv.doesnt_match(['breakfast', 'cereal', 'dinner', 'lunch']))

apple
dinner


  and should_run_async(code)
