## Topic Modeling Amazon Reviews

We will use the 142.8 million reviews spanning May 1996 – July 2014. 

In [25]:
# Import library
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from stop_words import get_stop_words
from nltk.stem.snowball import SnowballStemmer
from gensim import corpora, models
import gensim
import pandas as pd
import gzip
import operator

In [2]:
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

In [3]:
# one-review-per-line in json
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)
        
# Get data
def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [4]:
# Loading our data
df = getDF('../data/reviews_Automotive_5.json.gz')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20473 entries, 0 to 20472
Data columns (total 9 columns):
reviewerID        20473 non-null object
asin              20473 non-null object
reviewerName      20260 non-null object
helpful           20473 non-null object
reviewText        20473 non-null object
overall           20473 non-null float64
summary           20473 non-null object
unixReviewTime    20473 non-null int64
reviewTime        20473 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 1.6+ MB


In [6]:
@interact
def show_articles_more_than(column=['overall','reviewerID'], x=(0,10,1)):
    return df.loc[df[column] >= x]

interactive(children=(Dropdown(description='column', options=('overall', 'reviewerID'), value='overall'), IntS…

In [70]:
import re
import datetime

In [150]:
def convert_to_timestamp(x): 
    month, day, year = re.split(' |, ',str(x))
    
    return pd.Timestamp(year=int(year), month=int(month), day=int(day))

In [151]:
def print_products_reviewed(start_date, end_date):

    date = df['reviewTime'].apply(lambda x: convert_to_timestamp(x))
    
    start_date = pd.Timestamp(start_date)
    end_date = pd.Timestamp(end_date)
    
    stat_df = df.loc[(date >= start_date) & (date <= end_date)].copy()
    total_words = stat_df['reviewTime'].sum()

    num_articles = len(stat_df)
    print(f'You reviewed {num_articles} products between {start_date.date()} and {end_date.date()}.');

In [152]:
# Create interactive version of function with DatePickers
interact(print_products_reviewed,
         start_date=widgets.DatePicker(value=pd.to_datetime('2011-01-01')),
         end_date=widgets.DatePicker(value=pd.to_datetime('2014-01-01')));

interactive(children=(DatePicker(value=Timestamp('2011-01-01 00:00:00'), description='start_date'), DatePicker…

In [7]:
df['reviewText'][0]

"I needed a set of jumper cables for my new car and these had good reviews and were at a good price.  They have been used a few times already and do what they are supposed to - no complaints there.What I will say is that 12 feet really isn't an ideal length.  Sure, if you pull up front bumper to front bumper they are plenty long, but a lot of times you will be beside another car or can't get really close.  Because of this, I would recommend something a little longer than 12'.Great brand - get 16' version though."

Now that we have a nice corpus of text, lets go through some of the standard preprocessing required for almost any topic modeling or NLP problem.

Approach will involve:

- Tokenizing: converting a document to its atomic elements
- Stopping: removing meaningless words
- Stemming: merging words that are equivalent in meaning

#### Tokenization
We have many ways to segment our document into its atomic elements. To start we'll tokenize the document into words. For this instance we'll use NLTK’s tokenize.regexp module.

In [8]:
tokenizer = RegexpTokenizer(r'\w+')

Running through part of the first review to demonstrate:

In [9]:
doc_1 = df.reviewText[0]

In [10]:
# Using one of our docs as an example
tokens = tokenizer.tokenize(doc_1.lower())

print('{} characters in string vs {} words in a list'.format(len(doc_1), len(tokens)))
print(tokens[:10])

516 characters in string vs 103 words in a list
['i', 'needed', 'a', 'set', 'of', 'jumper', 'cables', 'for', 'my', 'new']


#### Stop Words

Determiners like "the" and conjunctions such as "or" and "for" do not add value to our simple topic model. We refer to these types of words as stop words and want to remove them from our list of tokens. The definition of a stop work changes depending on the context of the documents we are examining. If considering Product Reviews for children's board games on Amazon.com we would not find "Chutes and Ladders" as a token and eventually an entity in some other model if we remove the word "and" as we'll end up with a distinct "chutes" AND "ladders" in our list.

Let's make a super list of stop words from the stop_words and nltk package below.

In [14]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chrisjcc/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
nltk_stpwd = stopwords.words('english')
stop_words_stpwd = get_stop_words('en')
merged_stopwords = list(set(nltk_stpwd + stop_words_stpwd))

print(len(set(merged_stopwords)))
print(merged_stopwords[:10])

211
['by', 'don', "wouldn't", 'ought', 'yourselves', 'isn', 'her', "hasn't", "i'd", 'each']


In [16]:
stopped_tokens = [token for token in tokens if not token in merged_stopwords]
print(stopped_tokens[:10])

['needed', 'set', 'jumper', 'cables', 'new', 'car', 'good', 'reviews', 'good', 'price']


### Stemming

Stemming allows us to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance, running and runner to run. Another example:

Amazon's catalog contains bike tires in different sizes and colors $\Rightarrow$ Amazon catalog contain bike tire in differ size and color

Stemming is a basic and crude heuristic compared to Lemmatization which understands vocabulary and morphological analysis instead of lobbing off the end of words. Essentially Lemmatization removes inflectional endings to return the word to its base or dictionary form of a word, which is defined as the lemma. Great illustrative examples from Wikipedia:

- The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
- The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.
- The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

We'll start with the common Snowball stemming method, a successor of sorts of the original Porter Stemmer which is implemented in NLTK.

In [17]:
# Instantiate a Snowball stemmer
sb_stemmer = SnowballStemmer('english')

Note that p_stemmer requires all tokens to be type str. p_stemmer returns the string parameter in stemmed form, so we need to loop through our stopped_tokens:

In [18]:
stemmed_tokens = [sb_stemmer.stem(token) for token in stopped_tokens]
print(stemmed_tokens)

['need', 'set', 'jumper', 'cabl', 'new', 'car', 'good', 'review', 'good', 'price', 'use', 'time', 'alreadi', 'suppos', 'complaint', 'say', '12', 'feet', 'realli', 'ideal', 'length', 'sure', 'pull', 'front', 'bumper', 'front', 'bumper', 'plenti', 'long', 'lot', 'time', 'besid', 'anoth', 'car', 'get', 'realli', 'close', 'recommend', 'someth', 'littl', 'longer', '12', 'great', 'brand', 'get', '16', 'version', 'though']


## Putting together a document-term matrix

In order to create an LDA model we'll need to put the 3 steps from above (tokenizing, stopping, stemming) together to create a list of documents (list of lists) to then generate a document-term matrix (unique terms as rows, documents or reviews as columns). This matrix will tell us how frequently each term occurs with each individual document.

In [19]:
%%time

num_reviews = df.shape[0]

doc_set = [df.reviewText[i] for i in range(num_reviews)]

texts = []

for doc in doc_set:
    # putting our three steps together
    tokens = tokenizer.tokenize(doc.lower())
    stopped_tokens = [token for token in tokens if not token in merged_stopwords]
    stemmed_tokens = [sb_stemmer.stem(token) for token in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

CPU times: user 16.2 s, sys: 110 ms, total: 16.3 s
Wall time: 16.4 s


In [21]:
print(texts[0]) # examine review 1

['need', 'set', 'jumper', 'cabl', 'new', 'car', 'good', 'review', 'good', 'price', 'use', 'time', 'alreadi', 'suppos', 'complaint', 'say', '12', 'feet', 'realli', 'ideal', 'length', 'sure', 'pull', 'front', 'bumper', 'front', 'bumper', 'plenti', 'long', 'lot', 'time', 'besid', 'anoth', 'car', 'get', 'realli', 'close', 'recommend', 'someth', 'littl', 'longer', '12', 'great', 'brand', 'get', '16', 'version', 'though']


## Transform tokenized documents into an id-term dictionary

Gensim's Dictionary method encapsulates the mapping between normalized words and their integer ids. 

Note: a term will have an id of some number and in the subsequent bag of words step we can see that id will have a count associated with it.

In [22]:
# Gensim's Dictionary encapsulates the mapping between normalized words and their integer ids.
texts_dict = corpora.Dictionary(texts)
texts_dict.save('auto_review.dict') # lets save to disk for later use
# Examine each token’s unique id
print(texts_dict)

Dictionary(19216 unique tokens: ['12', '16', 'alreadi', 'anoth', 'besid']...)


To see the mapping between words and their ids we can use the token2id method:

In [24]:
print("IDs 1 through 10: {}".format(sorted(texts_dict.token2id.items(),
                                           key=operator.itemgetter(1),
                                           reverse = False)[:10]))

IDs 1 through 10: [('12', 0), ('16', 1), ('alreadi', 2), ('anoth', 3), ('besid', 4), ('brand', 5), ('bumper', 6), ('cabl', 7), ('car', 8), ('close', 9)]


Let's try to guess the original work and examine the count difference between our #1 most frequent term and our #10 most frequent term:

print(df.reviewText.str.contains("complaint").value_counts()) print() print(df.reviewText.str.contains("lot").value_counts())

We have a lot of unique tokens, let's see what happens if we ignore tokens that appear in less than 30 documents or more than 15% documents. Granted this is arbitrary but a quick search shows tons of methods for reducing noise.

In [26]:
texts_dict.filter_extremes(no_below=30, no_above=0.15) # inlace filter
print(texts_dict)
print("top terms:")
print(sorted(texts_dict.token2id.items(), key=operator.itemgetter(1), reverse = False)[:10])

Dictionary(2464 unique tokens: ['12', '16', 'alreadi', 'anoth', 'besid']...)
top terms:
[('12', 0), ('16', 1), ('alreadi', 2), ('anoth', 3), ('besid', 4), ('brand', 5), ('bumper', 6), ('cabl', 7), ('close', 8), ('complaint', 9)]


We went from 19216 unique tokens to 2462 after filtering. Looking at the top 10 tokens it looks like we got more specific subjects opposed to adjectives.

#### Creating bag of words
Next let's turn texts_dict into a bag of words instead. doc2bow converts a document (a list of words) into the bag-of-words format (list of (token_id, token_count) tuples).

In [27]:
corpus = [texts_dict.doc2bow(text) for text in texts]
len(corpus)

20473

The corpus is 20473 long, the amount of reviews in our dataset and in our dataframe. Let's dump this bag-of-words into a file to avoid parsing the entire text again:

In [28]:
%%time 
# Matrix Market format https://radimrehurek.com/gensim/corpora/mmcorpus.html, why exactly? I don't know
gensim.corpora.MmCorpus.serialize('amzn_auto_review.mm', corpus)

CPU times: user 770 ms, sys: 35.1 ms, total: 805 ms
Wall time: 810 ms


### Training an LDA model

As a topic modeling newbie this part is unsatisfying to me. In this unsupervised learning application I can see how a lot of people would arbitrarily set a number of topics, similar to centroids in k-means clustering, and then have a human evaluate if the topics "make sense". You can go very deep very quickly by researching this online. For now let's plead ignorance and go through with a simple model FULL of assumptions!

Training an LDA model using our BOW corpus as training data.


The number of topics is arbitrary, I'll use the browse taxonomy visible off https://www.amazon.com/automotive to inform the number we choose:

1. Performance Parts & Accessories
2. Replacement Parts
3. Truck Accessories
4. Interior Accessories
5. Exterior Accessories
6. Tires & Wheels
7. Car Care
8. Tools & Equipment
9. Motorcycle & Powersports Accessories
10. Car Electronics
11. Enthusiast Merchandise

I think these categories could be compressed into 5 general topics. We might consider rolling number 9 into 4 & 5, and rolling the products in number 3 across other accessory categories and so on.

In [29]:
%%time 
lda_model = gensim.models.LdaModel(corpus,alpha='auto', 
                                   num_topics=5,
                                   id2word=texts_dict, 
                                   passes=20)
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = texts_dict, passes=20)

CPU times: user 2min 29s, sys: 252 ms, total: 2min 29s
Wall time: 2min 29s


Note: Gensim offers a fantastic multicore implementation of LDAModel that reduced my training time by 75%, but it does not have the auto alpha parameter available.

#### Inferring Topics
Below are the top 5 words associated with 5 random topics. The float next to each word is the weight showing how much the given word influences this specific topic. In this case, we see that for topic 4, light and battery are the most telling words. We might interpret that topic 4 might be close to Amazon's Tools & Equipment category which has a sub-category titled "Jump Starters, Battery Chargers & Portable Power". Similarly we might infer topic 1 refers to Car Care, maybe sub category "Exterior Care".

In [30]:
# For `num_topics` number of topics, return `num_words` most significant words
lda_model.show_topics(num_topics=5,num_words=5)

[(0,
  '0.017*"light" + 0.013*"blade" + 0.012*"instal" + 0.010*"wiper" + 0.009*"tire"'),
 (1,
  '0.017*"towel" + 0.016*"clean" + 0.013*"dri" + 0.013*"wash" + 0.010*"wax"'),
 (2, '0.016*"hose" + 0.009*"fit" + 0.009*"water" + 0.008*"rv" + 0.008*"tape"'),
 (3,
  '0.014*"oil" + 0.012*"drive" + 0.011*"filter" + 0.009*"engin" + 0.008*"app"'),
 (4,
  '0.032*"batteri" + 0.019*"power" + 0.018*"charg" + 0.015*"light" + 0.011*"plug"')]

Note that LDA is a probabilistic mixture of mixtures (or admixture) model for grouped data. The observed data (words) within the groups (documents) are the result of probabilistically choosing words from a specific topic (multinomial over the vocabulary), where the topic is itself drawn from a document-specific multinomial that has a global Dirichlet prior. This means that words can belong to various topics in various degrees. For example, the word 'pressure' might refer to a category/topic of automotive wash products and a category of tire products (in the case where we think the topics are about classes of products).

##### Querying the LDA Model
We cannot pass an arbitrary string to our model and evaluate what topics are most associated with it.

In [31]:
raw_query = 'portable air compressor'

query_words = raw_query.split()
query = []
for word in query_words:
    # ad-hoc reuse steps from above
    q_tokens = tokenizer.tokenize(word.lower())
    q_stopped_tokens = [word for word in q_tokens if not word in merged_stopwords]
    q_stemmed_tokens = [sb_stemmer.stem(word) for word in q_stopped_tokens]
    query.append(q_stemmed_tokens[0]) # different frome above, this is not a lists of lists!

In [32]:
print(query)

['portabl', 'air', 'compressor']


In [33]:
# translate words in query to ids and frequencies. 
id2word = gensim.corpora.Dictionary()
_ = id2word.merge_with(texts_dict) # garbage

In [34]:
# translate this document into (word, frequency) pairs
query = id2word.doc2bow(query)
print(query)

[(595, 1), (2140, 1), (2369, 1)]


If we run this constructed query against our trained model we will get each topic and the likelihood that the query relates to that topic. Remember we arbitrarily specified 11 topics when we made the model. When we organize this list to find the most relative topics, we see some intuitive results. We see that our query of 'battery powered inflator' relates most to a topic we thought might align to Amazon's Tools & Equipment category which has a sub-category titled "Jump Starters, Battery Chargers & Portable Power".

In [35]:
a = list(sorted(lda_model[query], key=lambda x: x[1])) # sort by the second entry in the tuple
print(a)

[(3, 0.026814938), (1, 0.031968977), (2, 0.03333397), (0, 0.04890519), (4, 0.8589769)]


In [37]:
lda_model.print_topic(a[0][0]) #least related

'0.014*"oil" + 0.012*"drive" + 0.011*"filter" + 0.009*"engin" + 0.008*"app" + 0.008*"chang" + 0.008*"price" + 0.007*"vehicl" + 0.007*"code" + 0.006*"amazon"'

In [38]:
lda_model.print_topic(a[-1][0]) #most related

'0.032*"batteri" + 0.019*"power" + 0.018*"charg" + 0.015*"light" + 0.011*"plug" + 0.011*"devic" + 0.010*"phone" + 0.010*"unit" + 0.010*"connect" + 0.009*"charger"'

#### What can we do with this in production?
We could take these inferred topics and analyze the sentiment of their corresponding documents (reviews) to find out what customers are saying (or feeling) about specific products. We can also use an LDA model to extract representative statements or quotes, enabling us to summarize customers’ opinions about products, perhaps even displaying them on the site.We could also use LDA to model groups of customers to topics which are groups of products that frequently occur within some customer's orders over time.