# Topic Modeling Amazon Reviews

## Imports

In [None]:
import json
import os
import pandas as pd
try:
  from bertopic import BERTopic
except:
  !pip install bertopic[all]
  os.kill(os.getpid(), 9)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Open Texts
I am going to use the top Nike reviews that I saved back earlier for my topic model. This code below loads all the top Nike reviews into a list.

In [None]:
texts = []
with open("/content/drive/MyDrive/Masters/Unsupervised Text Classification/MSDS_Lab1_TopReviews.jsonl") as top_reviews_file:
    for review in top_reviews_file:
        data = json.loads(review)
        texts.append(data["reviewText"])

In [None]:
texts[0]

"I've had this watch for over a year now and wear it daily. I love it and the features it provides."

In [None]:
len(texts)

3549

## Train using BERTopic

In [None]:
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(texts)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/111 [00:00<?, ?it/s]

2023-10-17 21:55:48,635 - BERTopic - Transformed documents to Embeddings
2023-10-17 21:56:07,926 - BERTopic - Reduced dimensionality
2023-10-17 21:56:09,013 - BERTopic - Clustered reduced embeddings


## Extract Topics

Ignore the -1 indexed topic, as that refers to an outlier.

In [None]:
freq = topic_model.get_topic_info();
freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1217,-1_the_and_to_they,"[the, and, to, they, shoes, for, my, is, them,...",[i have horribly aching feet for the last few ...
1,0,343,0_running_shoes_for_and,"[running, shoes, for, and, are, these, run, th...",[First off I'm a bigger guy at 6' and 190 ish ...
2,1,120,1_size_11_10_small,"[size, 11, 10, small, 105, but, wear, in, 12, ...",[I normally wear size 9 but I had to get these...
3,2,102,2_sandals_these_are_comfortable,"[sandals, these, are, comfortable, and, they, ...",[They are extremely comfortable and fit perfec...
4,3,90,3_nike_the_nikes_of,"[nike, the, nikes, of, is, to, and, it, that, ...",[I haven't had a pair of Nike in years I just ...


In [None]:
topic_model.get_topic(0)

[('running', 0.03154847487079992),
 ('shoes', 0.01850835454215261),
 ('for', 0.017582386830825723),
 ('and', 0.016296741696702487),
 ('are', 0.015837375269122985),
 ('these', 0.014935020716740448),
 ('run', 0.014649570540735034),
 ('the', 0.013091906476024259),
 ('shoe', 0.013030644458058887),
 ('to', 0.012786535545074118)]

In [None]:
unique_topics = len(set(topics))
print(f"There are  {unique_topics} unique topics.")

There are  55 unique topics.


## Inspect Reviews

In [None]:
from collections import defaultdict

In [None]:
topic_groups = defaultdict(list)
for topic, text in zip(topics, texts):
    topic_groups[topic].append(text)

In [None]:
sorted_topics = sorted([(k, v) for k, v in topic_groups.items() if k != -1], key=lambda x: len(x[1]), reverse=True)[:10]

for topic, reviews in sorted_topics:
    print(f"Topic {topic}:")
    for review in reviews[:12]:
        print(f"- {review}")
    print("\n")

Topic 0:
- Although I'm rarely a fan of Nike, I bought this pair to try it out and i absolutely liked it. Very suitable for causal wear, but I'm not sure how good of a running pair this one is. I got the dark grey color with florescent green stripes.Overall very cool pair
- love these shoes use them as gym shoes i would recomend thes to anyone that is looking for a durable shoe
- not a bad entry level running shoe. good for a narrower foot like mine but i still got the m (standard) width.
- bought for my boyfriend for running.  he said they are comfortable and run true to size.  (meaning buy half size bigger since they are Nike's).  Nice looking shoe and light weight.
- I haven't taken them to the track yet but shoes feel very light and supportive.  Should be good for the price.
- As always this shoes were what I was expecting, very confortable and very good quality too. Another thing that I ike is that they are very ligth for running.
- I give this classification because it meets my e

In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

# PRODUCT TAKEAWAYS

While our Topic Model still includes a bit of noise (Terms such as: Nike, are, 11, 105, for, etc.), there are quite a few insights that can be drawn from it.

For example, in Topic 2, we have words "Sandals, these, are, comfortable, and". While theres are some filler words here, it is pretty insightful to know that people are describing sandals as comfortable. To make a more bold claim, the fact that "and" follows comfortable could suggest another good term is being used to describe them, as a term like "but" would suggest a negative term following. We can also look to topic 3 to see that there is a trend of complaints against Nike for "loud" and "squeaking" products. We could use this information to consider working on the noise that the shoes make.