# Part 6A: LDA Topic Modelling

In parallel with KMeans Clustering, I also would like to try clustering the review text using LDA Topic Modelling. The key difference between the 2 clustering methods is LDA topic modelling clusters reviews into different topics by solely looking at **text data** which in this case will be the review text. In contrast, KMeans Clustering can cluster the reviews based on **all features**, tokenized text and other numeric features. 

The goal of this notebook is to perform LDA topic modelling on the review text and compare the results with the clusters formed via KMeans clustering to try and identify fake reviews.

In [1]:
#importing relevant libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [2]:
#importing libraries for text processing
from nltk.corpus import stopwords 
ENGLISH_STOP_WORDS = stopwords.words('english')
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
#bringing in just the reviewText from the dataset (require custom functions library)
import functions_library as fl
review_text = fl.cleanDF(fl.createPdDF('All_Beauty.json.gz'))['reviewText']

In [8]:
#make sure it's loaded in properly
review_text

0                                                     great
1         My  husband wanted to reading about the Negro ...
2         This book was very informative, covering all a...
3         I am already a baseball fan and knew a bit abo...
4         This was a good story of the Black leagues. I ...
                                ...                        
362247    It was awful. It was super frizzy and I tried ...
362248    I was skeptical about buying this.  Worried it...
362249                             Makes me look good fast.
362250    Way lighter than photo\nNot mix blend of color...
362251    No return instructions/phone # in packaging.  ...
Name: reviewText, Length: 362252, dtype: object

#### TF-IDF vectorization

In [5]:
#using same settings used for KMeans clustering to be consistent
vectorizer = TfidfVectorizer(min_df = 1000, tokenizer = fl.spl_tokenizer, ngram_range = (1,2))

In [9]:
#get tokens from reviewText
word_matrix = vectorizer.fit_transform(review_text)

In [10]:
# Helper function\
#source: https://github.com/kapadias/mediumposts/blob/master/nlp/published_notebooks/Introduction%20to%20Topic%20Modeling.ipynb
def print_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(",".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

## LDA Topic Modelling with 25 Topics
For LDA topic Modelling, we need to pre-select the number of topics we think exist in our text. To be consistent with KMeans clustering, I will choose 25 topics as we had selected 25 clusters for KMeans. Note: this is not necessarily the optimal way to determine the number of topics. Can make improvements in future iterations.

In [9]:
# Setting number of topics and also the top number of words we want to see from the model
number_topics = 25
number_words = 15

In [10]:
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=4, verbose=1)
lda.fit(word_matrix)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=25, n_jobs=4,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=1)

In [11]:
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, vectorizer, number_words)

Topics found via LDA:

Topic #0:
brush,perfectly,stand,hold,great,sturdy,nice,look,fit,well,look great,razor,bristle,handle,set

Topic #1:
cute,nail,polish,happy,super,every,working,coat,love,compliment,color,day,nail polish,wear,use

Topic #2:
blade,razor,shave,year,sharp,shaving,old,close,cut,get,year old,one,electric,gillette,use

Topic #3:
teeth,water,use,ok,floss,waterpik,gum,easy,easy use,clean,dentist,mouth,flossing,one,dental

Topic #4:
nice,smell,scent,soap,deodorant,awesome,like,fragrance,bar,really,love,natural,strong,smell like,product

Topic #5:
perfect,beautiful,color,worth,love,love color,blush,needed,light,dark,coverage,worth money,well,shade,tone

Topic #6:
color,lip,lipstick,stay,pink,love,apply,like,red,nice,matte,look,liner,gloss,last

Topic #7:
work,work great,great,quality,good quality,look,good,worked,picture,worked great,bad,product work,like picture,look like,like

Topic #8:
money,didnt,waste,star,waste money,okay,didnt work,pretty,work,dont,horrible,5,product,

As we can see above, these are the top words for each topic. The results are pretty good: we can see topics related to specific types of products like topic 2 (shaving), topic 3 (teeth) and topic 17 (skin). Other topics are related to logistics such as topic 9 and 13.

There is a special package called LDAvis which allows us to visualize the topics. This gives the opportunity to take a deeper dive into the topics. For instance, in this package, we can adjust a relevance metric which allows us to rank terms of a topic according to their topic-specific probability.

In [11]:
import joblib

In [14]:
#saving model to computer
joblib.dump(lda,'lda_25.pkl')

['lda_25.pkl']

In [12]:
#use this line if you need to load the model back into the notebook
lda = joblib.load('lda_25.pkl')

### Visualize
Let's use LDAvis to visualize the topics created from LDA topic modelling.

In [14]:
#import relevant libraries
from pyLDAvis import sklearn as sklearn_lda
import pickle 
import pyLDAvis
import os

In [17]:
#creating file path
LDAvis_data_filepath = os.path.join('./ldavis_prepared_'+str(number_topics))

In [18]:
LDAvis_data_filepath

'./ldavis_prepared_25'

In [19]:
# preparing the LDA model to be saved in the LDAvis visualizer
LDAvis_prepared = sklearn_lda.prepare(lda, word_matrix, vectorizer)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [20]:
#saving LDA model into LDAvis visualization
with open(LDAvis_data_filepath, 'wb') as f:
    pickle.dump(LDAvis_prepared, f)

In [21]:
#saving LDA model into LDAvis visualization html file
pyLDAvis.save_html(LDAvis_prepared, './ldavis_prepared_'+ str(number_topics) +'.html')

We can now open the html file which contains the LDA visualization and take a deeper look into the topics and most relevant terms. An analysis of the terms will be provided in the final report.

Please proceed to the next book in which cluster analysis of the KMeans model is performed.