# LDA Modelling - 2nd Run

In this notebook, I experiment with the filter_extremes() function to see if it helps increase coherence.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import string
import nltk
from nltk.corpus import stopwords

import gensim
from gensim import corpora, models, similarities
from gensim.models import CoherenceModel, LdaModel, LdaMulticore

import pyLDAvis
import pyLDAvis.gensim as p_gensim

import pathlib
import os
%matplotlib inline

In [2]:
final_df = pd.read_csv('./dataframes/final_df.csv',index_col=0)

In [3]:
final_df = final_df[['review','clean_reviews','2gram_reviews','3gram_reviews']]
final_df.head()

Unnamed: 0,review,clean_reviews,2gram_reviews,3gram_reviews
0,Well for me game still tons of work. i like it...,"['tons', 'work', 'recommend', 'one', 'diplomac...","['ton', 'work', 'recommend', 'diplomacy', 'jok...","['ton', 'work', 'recommend', 'diplomacy', 'jok..."
1,I pursued Lu Bu. Now I [b]AM[/b] LU BU.,"['pursued', 'lu', 'bu', 'lu', 'bu']","['pursue', 'lu_bu', 'lu_bu']","['pursue', 'lu_bu', 'lu_bu']"
2,Absolutely great game. \nAll the new diplomacy...,"['absolutely', 'new', 'diplomacy', 'options', ...","['absolutely', 'new', 'diplomacy_options', 'de...","['absolutely', 'new', 'diplomacy_options', 'de..."
3,A fine blend of Warhammer I/II: Total War and ...,"['fine', 'blend', 'warhammer', 'two', 'total',...","['fine', 'blend', 'warhammer_two', 'total_war'...","['fine', 'blend', 'warhammer_two', 'total_war'..."
4,Innovative Total Game that has lots of persona...,"['innovative', 'total', 'lots', 'personality',...","['innovative', 'total', 'lot', 'personality', ...","['innovative', 'total', 'lot', 'personality', ..."


In [4]:
final_df['clean_reviews'] = final_df['clean_reviews'].map(lambda x: ''.join(c for c in x if c=='_' or c not in string.punctuation).split()) #n-grams underscores must be preserved for readability
final_df['2gram_reviews'] = final_df['2gram_reviews'].map(lambda x: ''.join(c for c in x if c=='_' or c not in string.punctuation).split()) 
final_df['3gram_reviews'] = final_df['3gram_reviews'].map(lambda x: ''.join(c for c in x if c=='_' or c not in string.punctuation).split()) 
#Reading in the DF from a CSV turned the list of words in each cell into string, so we have to remove the punctuation and split them again

In [19]:
final_df['3gram_reviews'] = final_df['3gram_reviews'].map(lambda x: [y for y in x if len(y)>1]) #remove single characters from the dataset

In [20]:
final_df.head()

Unnamed: 0,review,clean_reviews,2gram_reviews,3gram_reviews
0,Well for me game still tons of work. i like it...,"[tons, work, recommend, one, diplomacy, joke, ...","[ton, work, recommend, diplomacy, joke, work, ...","[ton, work, recommend, diplomacy, joke, work, ..."
1,I pursued Lu Bu. Now I [b]AM[/b] LU BU.,"[pursued, lu, bu, lu, bu]","[pursue, lu_bu, lu_bu]","[pursue, lu_bu, lu_bu]"
2,Absolutely great game. \nAll the new diplomacy...,"[absolutely, new, diplomacy, options, depth, u...","[absolutely, new, diplomacy_options, depth, un...","[absolutely, new, diplomacy_options, depth, un..."
3,A fine blend of Warhammer I/II: Total War and ...,"[fine, blend, warhammer, two, total, war, shog...","[fine, blend, warhammer_two, total_war, shogun...","[fine, blend, warhammer_two, total_war, shogun..."
4,Innovative Total Game that has lots of persona...,"[innovative, total, lots, personality, brings,...","[innovative, total, lot, personality, bring, n...","[innovative, total, lot, personality, bring, n..."


# LDA Model - 3grams with filter_extremes()

In [21]:
#build dictionary and corpus from 3gram dataset -- this time with filter_extremes

documents = list(final_df['3gram_reviews'])
dictionary = gensim.corpora.Dictionary(documents)
dictionary.filter_extremes(no_below = 5,no_above=0.7)
corpus = [dictionary.doc2bow(word) for word in documents]

In [22]:
# LDA model parameters.
num_topics = 10
passes = 100
eval_every = None #Evaluation will happen later so no need to evaluate while training

In [23]:
%time ldamodel1 = LdaMulticore(corpus, num_topics=num_topics, id2word = dictionary, passes=passes, alpha='asymmetric',eval_every=eval_every,workers=3)

# Check resulting topics.
topic_list = ldamodel1.print_topics(num_topics=num_topics, num_words=15)
for index, i in enumerate(topic_list):
    str1 = str(i[1])
    for c in "0123456789+*\".":
        str1 = str1.replace(c, "")
    str1 = str1.replace("  ", " ")
    print(str1)

Wall time: 2min 14s
total_war play series love game far best_total_war fun shogun_two fan good tw three_kingdoms feel total_war_games
not time total_war do _ play fuck war battle people ai fight china buy army
unit general battle army faction ai need building thing campaign bad cavalry diplomacy time enemy
battle unit character total_war general diplomacy campaign play feel hero ui look combat graphic new
total_war character diplomacy faction feel battle new play three_kingdoms general series different fun ai add
faction turn crash play vassal army start time general liu_bei want patch war get leave
lu_bu play cao_cao turn want faction get friend win bug coalition start let sun_jian yuan_shao
total_war play thing new time love change campaign buy know work recommend hour need review
ca battle add time dlc feel release nice campaign look thing character go start little
campaign character con h rotk pro lord family sun_ce koei rtk battle fun hero total_war


In [24]:
# Compute Perplexity
print('\nPerplexity: ', ldamodel1.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda1 = CoherenceModel(model=ldamodel1, texts=documents, dictionary=dictionary, coherence='c_v')
coherence_lda1 = coherence_model_lda1.get_coherence()
print('\nCoherence Score: ', coherence_lda1)


Perplexity:  -7.136575014628985

Coherence Score:  0.40131799679490987


In [25]:
pyLDAvis.enable_notebook()
vis = p_gensim.prepare(ldamodel1, corpus, dictionary)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [26]:
#this model has filtered out terms that occur 70% of the time
#it has topics that seem like they can be grouped together, so maybe we are on the right track here.
#coherence score is now 0.42, slightly better than previous iteration

newpath = './models/filter_extremes/' 
if not os.path.exists(newpath):
    os.makedirs(newpath)

ldamodel1.save('./models/filter_extremes/fe1.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [27]:
#build dictionary and corpus from 3gram dataset -- this time, use default settings for filter_extremes()

documents = list(final_df['3gram_reviews'])
dictionary2 = gensim.corpora.Dictionary(documents)
dictionary2.filter_extremes()
corpus2 = [dictionary2.doc2bow(word) for word in documents]

In [28]:
# LDA model parameters.
num_topics = 10
passes = 100
eval_every = None #Evaluation will happen later so no need to evaluate while training

In [29]:
%time ldamodel2 = gensim.models.ldamulticore.LdaMulticore(corpus2, num_topics=num_topics, id2word = dictionary2, passes=passes, alpha='asymmetric',eval_every=eval_every,workers=3)

# Check resulting topics.
topic_list = ldamodel2.print_topics(num_topics=num_topics, num_words=15)
for index, i in enumerate(topic_list):
    str1 = str(i[1])
    for c in "0123456789+*\".":
        str1 = str1.replace(c, "")
    str1 = str1.replace("  ", " ")
    print(str1)

Wall time: 2min 10s
play total_war buy love campaign hour time _ ca fun amazing three_kingdoms fan want 
faction general battle character play diplomacy campaign unit army feel new total_war thing different building
total_war play series game far best_total_war tw fun love diplomacy shogun_two new three_kingdoms fan good
lu_bu general _ china army need man change enemy history fight dong duel hero battle
unit battle general ai feel campaign total_war make look mechanic diplomacy time duel character cavalry
play crash fix ai get time issue patch release bad thing review bug hour problem
character faction not vassal do add time lack map get pro ai start unique campaign
army cao_cao turn war liu_bei enemy yuan_shao fight sun_jian warlord faction kill coalition win friend
ca history three_kingdoms chinese title novel release player historical time know want one_best_total_wars content work
total_war new play three_kingdoms old experience love mechanic total_wars fan medieval_two gameplay c

In [30]:
# Compute Perplexity
print('\nPerplexity: ', ldamodel2.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda2 = CoherenceModel(model=ldamodel2, texts=documents, dictionary=dictionary2, coherence='c_v')
coherence_lda2 = coherence_model_lda2.get_coherence()
print('\nCoherence Score: ', coherence_lda2)


Perplexity:  -7.136011152263055

Coherence Score:  0.4092940169787706


In [31]:
pyLDAvis.enable_notebook()
vis = p_gensim.prepare(ldamodel2, corpus, dictionary)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


# 2nd Run in Summary

Overall coherence score went down when using default settings.

Visual insepction of the topics did not yield any set of topics more coherent than the 1st run.

Some topics also have serious overlapping. This does not seem like a very useful set of topics.

Next, I will try using a dataset with nouns only.

### References

https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/