# Capstone Project: Amazon Review Classification (Part 2)
Author: **Steven Lee**

# Topic Modelling with Latent Dirichlet Allocation (LDA)

The Latent Dirichlet Allocation (LDA) from the gensim library will be used to extract topics from the Amazon reviews dataset (Tools and Home Improvement).  The review text has already been cleaned of duplicates, lemmatized and removed of stop words in a prior preprocessing notebook.  LDA considers every topic a collection of keywords in various proportions, and every document a collection of topics in various proportions.  All that is needed is to provide the algorithm a parameter of the number of topics and it will create an optimal keyword-topic distribution.

A good segregation of topics, depends on the following key factors:

- The quality of text processing.
- The variety of topics covered in the text.
- Choice of modelling algorithm.
- The number of topics parameter provided to the algorithm.
- The tunning parameters of the algorithm.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Prepare-Data" data-toc-modified-id="Prepare-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Prepare Data</a></span></li><li><span><a href="#Tokenize-Documents" data-toc-modified-id="Tokenize-Documents-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Tokenize Documents</a></span></li><li><span><a href="#Create-Bigrams" data-toc-modified-id="Create-Bigrams-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create Bigrams</a></span></li><li><span><a href="#Create-LDA-Dictionary-and-Corpus" data-toc-modified-id="Create-LDA-Dictionary-and-Corpus-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Create LDA Dictionary and Corpus</a></span></li><li><span><a href="#Build-Topic-Model" data-toc-modified-id="Build-Topic-Model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Build Topic Model</a></span></li><li><span><a href="#View-Model-Topics" data-toc-modified-id="View-Model-Topics-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>View Model Topics</a></span></li><li><span><a href="#Visualize-Topics-and-Keywords" data-toc-modified-id="Visualize-Topics-and-Keywords-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Visualize Topics and Keywords</a></span></li><li><span><a href="#Compute-Model-Perplexity-and-Coherence-Score" data-toc-modified-id="Compute-Model-Perplexity-and-Coherence-Score-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Compute Model Perplexity and Coherence Score</a></span></li><li><span><a href="#Create-Visualizations" data-toc-modified-id="Create-Visualizations-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Create Visualizations</a></span></li><li><span><a href="#Save-Data-to-File" data-toc-modified-id="Save-Data-to-File-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Save Data to File</a></span></li></ul></div>

## Import Libraries

In [1]:
import pandas as pd
from random import sample

# Set pandas display options.
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

from gensim.utils import simple_preprocess
from gensim.models import LdaMulticore, CoherenceModel
import gensim.corpora as corpora
import gensim

import os
import pyLDAvis
import pyLDAvis.gensim
import pickle 

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

## Prepare Data

In [2]:
# Read in clean reviews dataset.
reviews = pd.read_csv("../data/reviews_clean.csv")

In [3]:
# Check product categories.
reviews['main_cat'].value_counts()

Tools & Home Improvement     1199885
Sports & Outdoors              73750
Amazon Home                    73204
Industrial & Scientific        62453
Automotive                     32150
Office Products                 5718
Home Audio & Theater            3871
Arts, Crafts & Sewing           3747
Musical Instruments             3669
All Electronics                 2473
Baby                            2356
Toys & Games                    2318
Amazon Fashion                  2233
Health & Personal Care          2121
Computers                       1847
Camera & Photo                  1777
Cell Phones & Accessories       1773
All Beauty                       994
Pet Supplies                     902
Car Electronics                  630
Grocery                          207
Appliances                        98
Video Games                       50
Handmade                          23
Amazon Devices                    21
Books                             16
GPS & Navigation                   9
D

In [4]:
# Cleanup null brand values.
reviews.loc[reviews['brand'].isnull(), 'brand'] = "None"

In [None]:
# Drop unwanted columns.
unwanted = ['asin', 'summary', 'vote', 'description', 'title', 'feature', 'rank', 'price']
reviews.drop(unwanted, axis=1, inplace=True)
reviews.rename(columns = {'length':'word_cnt'}, inplace=True)
reviews.head(3)

## Tokenize Documents

In [6]:
# Custom function to tokenize review documents.
def tokenize_docs(documents):
    for doc in documents:
        yield(simple_preprocess(str(doc)))

In [7]:
documents = reviews['document'].values.tolist()
documents[3][:25]

'hooked tested filled five'

In [8]:
reviews[reviews['document'].index == 3].values[0][1]

'So far I hooked it up and tested it , filled a five gallon bucket with hot water, it is the perfect temp for a shower,the flow valve that came with it broke when i tried to tighten it to get it to stop leaking,just hooked it directly to sharkbite pex fitting,seems to work fine without the valve. it is hooked up to a 2 poll 20 amp breaker, will have to wait till the rest of my bathroom is finished to give a better review.'

In [9]:
word_tokens = list(tokenize_docs(documents))
word_tokens[:][3][:4]

['hooked', 'tested', 'filled', 'five']

## Create Bigrams

Gensim’s Phrases model will be used to create bigrams in the documents.  Min_count and threshold are two important parameters to tune the Phrases model.  Higher values result in fewer bigrams created.

In [10]:
# Build bigram model and custom function to create bigrams.
bigrams = gensim.models.Phrases(word_tokens, min_count=5, threshold=100)    # Higher threshold fewer phrases.
bigram_model = gensim.models.phrases.Phraser(bigrams)

def create_bigrams(documents):
    return [bigram_model[doc] for doc in documents]

In [11]:
# Create Bigrams.
bigram_tokens = create_bigrams(word_tokens)
bigram_tokens[:][3][:4]

['hooked', 'tested', 'filled', 'five']

## Create LDA Dictionary and Corpus

A dictionary of every word in the documents will be created, and each word will be assigned an unique Id.  The corpus or bag of words will consist of word Id and frequency mappings.  For example (5, 2), word Id 5 occurred twice in a document, etc..

In [12]:
# Create dictionary.
id2word = corpora.Dictionary(bigram_tokens)

# Filter out extremes.
id2word.filter_extremes(no_below=15, no_above=0.5, keep_n=150000)

# Generate term document frequencies.
corpus = [id2word.doc2bow(doc) for doc in bigram_tokens]
corpus[:][3][:4]

[(14, 1), (15, 1), (16, 1), (17, 1)]

## Build Topic Model

Both dictionary and corpus, as well as, the number of topics, are required to build the LDA model.

In [13]:
# Optimal number of topics based on Topic Coherence score.  See section below.
num_topics = 26

# Build LDA model
lda_model = LdaMulticore(corpus=corpus, id2word=id2word, num_topics=num_topics, random_state=42) 

## View Model Topics

The LDA model has been created with the number of topics as provided as a parameter.  Each topic consists of a combination of keywords in different proportions or percentage contributions.  The is can be seen as printed out below.

In [14]:
# Print top 10 keywords and contribution of all topics.
# print(lda_model.print_topics())
for i in range(num_topics):
    print(f"\nTopic {i}:\n{lda_model.print_topic(i)}")


Topic 0:
0.025*"small" + 0.020*"tape" + 0.015*"hold" + 0.013*"space" + 0.013*"perfect" + 0.013*"sturdy" + 0.012*"size" + 0.012*"place" + 0.011*"great" + 0.011*"fit"

Topic 1:
0.075*"door" + 0.035*"garage" + 0.017*"house" + 0.014*"front" + 0.013*"open" + 0.013*"work" + 0.011*"cabinet" + 0.011*"installed" + 0.011*"foot" + 0.009*"close"

Topic 2:
0.023*"look" + 0.022*"glass" + 0.018*"would" + 0.017*"love" + 0.017*"like" + 0.016*"color" + 0.012*"great" + 0.011*"picture" + 0.010*"recommend" + 0.010*"little"

Topic 3:
0.059*"blade" + 0.019*"cutting" + 0.018*"edge" + 0.015*"cut" + 0.012*"good" + 0.011*"sharp" + 0.011*"make" + 0.009*"like" + 0.009*"wood" + 0.009*"smooth"

Topic 4:
0.182*"great" + 0.117*"work" + 0.023*"worked" + 0.022*"used" + 0.017*"ve" + 0.015*"bought" + 0.015*"love" + 0.014*"house" + 0.013*"time" + 0.011*"working"

Topic 5:
0.019*"used" + 0.018*"paint" + 0.015*"work" + 0.014*"stick" + 0.013*"floor" + 0.013*"clean" + 0.012*"wood" + 0.011*"great" + 0.010*"surface" + 0.009*"wa

In [15]:
topics = set(lda_model.get_document_topics(corpus[3]))

## Visualize Topics and Keywords

In [24]:
# Visualize the LDA topics.
pyLDAvis.enable_notebook()

# Initialise data file.
LDAvis_data_filepath = os.path.join("../ldavis/ldavis_bigrams_" + str(num_topics))

# Generate visualization data.
# Warning: Time consuming activity.  Negate condition to skip.
if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from file.
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)
    
pyLDAvis.save_html(LDAvis_prepared, "../ldavis/ldavis_bigrams_" + str(num_topics) +'.html')
LDAvis_prepared

## Compute Model Perplexity and Coherence Score

In [17]:
# Compute Perplexity, a measure of how good the model is. lower the better.
print('\nPerplexity: ', lda_model.log_perplexity(corpus))

# Compute Coherence Score.
coherence_model_lda = CoherenceModel(model=lda_model, texts=bigram_tokens, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.935274902180763

Coherence Score:  0.5183858632024184


## Create Visualizations

In [18]:
# Custom function to extract dominant topic and percentage contribution.
def get_dmnt_topic():
    dmnt_topic = []
    pct_contrib = []
    
    for i, topics in enumerate(lda_model[corpus]):
        dmnt_top = max(topics, key=lambda x: x[1])
        dmnt_topic.append(dmnt_top[0])
        pct_contrib.append(round(dmnt_top[1], 4))
    
    return dmnt_topic, pct_contrib

In [19]:
%%time

# Create new columns for dominant topic and percentage contribution.
reviews['dmnt_topic'], reviews['pct_contrib'] = get_dmnt_topic()

Wall time: 6min 4s


In [20]:
# Convert categories to numeric.
reviews['category'] = reviews['main_cat'].map({
    "Tools & Home Improvement": 27, "Sports & Outdoors": 26, "Amazon Home": 25, "Industrial & Scientific": 24, "Automotive": 23, "Office Products": 22, 
    "Home Audio & Theater": 21, "Arts, Crafts & Sewing": 20, "Musical Instruments": 19, "All Electronics": 18, "Baby": 17, "Toys & Games": 16, 
    "Amazon Fashion": 15, "Health & Personal Care": 14, "Computers": 13, "Camera & Photo": 12, "Cell Phones & Accessories": 11, "All Beauty": 10, 
    "Pet Supplies": 9, "Car Electronics": 8, "Grocery": 7, "Appliances": 6, "Video Games": 5, "Handmade": 4, "Amazon Devices": 3, "Books": 2, 
    "GPS & Navigation": 1, "Digital Music": 0
})

# Reorder columns.
cols = ['dmnt_topic', 'pct_contrib', 'reviewText', 'main_cat', 'category', 'brand', 'document', 'word_cnt', 'overall']
reviews = reviews[cols]
reviews.head(3)

Unnamed: 0,dmnt_topic,pct_contrib,reviewText,main_cat,category,brand,document,word_cnt,overall
0,7,0.2918,"returned, decided against this product",Tools & Home Improvement,27,SioGreen,returned decided product,5,5.0
1,9,0.4205,Awesome heater for the electrical requirements! Makes an awesome preheater for my talnkless system,Tools & Home Improvement,27,SioGreen,awesome heater electrical requirement make awesome preheater talnkless system,14,5.0
2,5,0.8396,Keeps the mist of your wood trim and on you. Bendable too.,Tools & Home Improvement,27,SioGreen,keep mist wood trim bendable,12,5.0


In [21]:
# Group top 5 sentences under each topic.
top5_topic_text = pd.DataFrame()

dmnt_topic_grps = reviews.groupby('dmnt_topic')

for i, grp in dmnt_topic_grps:
    top5_topic_text = pd.concat([top5_topic_text, grp.sort_values(['pct_contrib'], ascending=[0]).head(5)], axis=0)

# Reset df index.
top5_topic_text.reset_index(drop=True, inplace=True)

In [22]:
top5_topic_text.loc[:, ['dmnt_topic', 'pct_contrib', 'reviewText']][0:5]

Unnamed: 0,dmnt_topic,pct_contrib,reviewText
0,0,0.9771,"Fantastic. Took 2 seconds to open it up. Wheels were already attached. I do crafts and needed a small cart to move my supplies from one place to another. This one is just what I needed. Small enough to maneuver around and not take up a lot of space, but large enough to put all I need on it instead of me having to make lots of trips. The 3 shelves are great as I can keep things on them until I need them and not have everything all over the table. Nice size wheels so it maneuvers well."
1,0,0.9599,I wanted a smaller tape for woodworking projects and general measurement. I use Fat Max tapes when I need the stand-off capability for DIY carpentry projects. This thing is great! It is so much easier to have the tape stay in place when you extend it until you are ready to retract.
2,0,0.9563,"...seems like a solidly built item otherwise, but no metric scale available. Pull back is strong and the stop is holding the tape from retracting very well. If you don't need metric than this is a good item to purchase."
3,0,0.9563,"...seems like a solidly built item otherwise, but no metric scale available. Pull back is strong and the stop is holding the tape from retracting very well. If you don't need metric than this is a good item to purchase."
4,0,0.9563,We needed to attach the headboard of a king size bed to the wall rather than use the bed frame. This product did the job although we needed to use two due to the headboard size. We had to measure carefully and check that everything was level. The headboard is very securely on the wall.


## Save Data to File

In [23]:
# Save LDA data to file.
reviews.to_csv("../data/reviews_lda.csv", index=False)