# Preface

In this notebook, we perform Topic Modeling using spaCy, incluing a variety of preprocessing steps to get us there. The **Data** for this project comes from the link: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/.

Note that only the Appliances Dataset, which contains 602,777 reviews for appliances. This was saved locally then downloaded into the notebook.

Then, we perform an investigation of these results, with the goals of being able to make sense of/interpret the topics that were output. This was the task for the work after 2/28/24. I was interested in not only the topics themselves, but what they were describing, and this was done be examining the documents that had very high representations of a specific topic (indicating that they were at the center of the cluster?)

# Below is work done before 2/28/24...

# Imports

In [4]:
import pandas as pd
import spacy
import re
import random
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models

from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
#Load the English Tokenizer, tagger, etc...
nlp = spacy.load('en_core_web_lg')

### Contants

In [None]:
RANDOM_SEED = 123
random.seed(RANDOM_SEED)

# Read, View Data

In [5]:
#Read in the .json data
appliances_data = pd.read_json('Appliances.json', lines=True)

In [6]:
#Head of df
appliances_data.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,5,2.0,False,"11 27, 2013",A3NHUQ33CFH3VM,1118461304,{'Format:': ' Hardcover'},Greeny,Not one thing in this book seemed an obvious o...,Clear on what leads to innovation,1385510400,
1,5,,False,"11 1, 2013",A3SK6VNBQDNBJE,1118461304,{'Format:': ' Kindle Edition'},Leif C. Ulstrup,I have enjoyed Dr. Alan Gregerman's weekly blo...,Becoming more innovative by opening yourself t...,1383264000,
2,5,,False,"10 10, 2013",A3SOFHUR27FO3K,1118461304,{'Format:': ' Hardcover'},Harry Gilbert Miller III,Alan Gregerman believes that innovation comes ...,The World from Different Perspectives,1381363200,
3,5,,False,"10 9, 2013",A1HOG1PYCAE157,1118461304,{'Format:': ' Hardcover'},Rebecca Ripley,"Alan Gregerman is a smart, funny, entertaining...",Strangers are Your New Best Friends,1381276800,
4,5,10.0,False,"09 7, 2013",A26JGAM6GZMM4V,1118461304,{'Format:': ' Hardcover'},Robert Morris,"As I began to read this book, I was again remi...","How and why it is imperative to engage, learn ...",1378512000,


We see a lot of columns, however the only one that we will one for this is reviewText. Lets read the first few...

In [None]:
appliances_data.head()['reviewText']

This is weird...It seems that these are about books

In [None]:
#So they arent ALL about books, but some of them are. Lets leave them in for now, but we should note it for the analysis
appliances_data['reviewText']

# Cleaning the Reviews

### A few problems with the reviews are noted and handled here...

- There are missing values in the list of reviews
- '\n' characters are in the reviews
- '\</a>' tags are in the reviews
- Some other html are in the reviews...'\<a data-hook="product-link-linked" class="a-link-normal" href="/The-Necessity-of-Strangers-The-Intriguing-Truth-About-Insight-Innovation-and-Success/dp/1118461304/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">' for instance

In [None]:
#See number of missing reviews
print(appliances_data['reviewText'].isna().sum())

#Filter out these reviews 
appliances_data = appliances_data.dropna(subset=['reviewText'])

#Make sure this worked
print(appliances_data['reviewText'].isna().sum())

In [None]:
#Convert the reviews to a list, it will be easier to handle the \n and \' here
reviews_list = appliances_data['reviewText'].tolist()

#I only want the first 10,000 of these reviews, as 600k was taking a long time to 
#process through
random.shuffle(reviews_list)
reviews_list = reviews_list[0:9999]



#Create empty list to add the cleaned reviews to
cleaned_reviews = []

#Iterate thru this list and remove these problems from the strings in the list. 
reviews_list = [review.replace('\n','').replace('</a>', '') for review in reviews_list]

for review in reviews_list:
    review = review.replace('\n','')
    review = re.sub(r'<\s*[^<>]+\s*>', '', review)
    cleaned_reviews.append(review)

# Turn Reviews Into spaCy Documents and Tokens


### Function to do this: tokenize_reviews()

**Input**:  
- *review*: A review

**Output**:
- *tokens*: A list of tokens from the reviews

In [None]:
def tokenize_reviews(review):
    
    #Tokenize and preprocess review
    doc = nlp(review)
    
    #Lemmatize and remove stop words from the doc
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    
    return tokens

In [None]:
#Init. empty list that will be filled when preprocessing each doc
processed_docs = []

#Counter for our progress:
i=0

#Preprocess each doc
for review in cleaned_reviews:
    processed_docs.append(tokenize_reviews(review))
    i+=1
    
    #Print the progress in a way that it only stays on 1 line
    print(f'Preprocessing Iteration {i}/10,000', end='\r')


# Build Topic Modeling Model Using LDA

LDA (Latent Dirichlet Allocation) is a widely-used statistical model for topic modeling, where abstract topics in a collection of documents are discovered. The goal of **Topic Modeling** is to automatically identify the underlying topics that are found in a large corpus of text data.  

Each document is represented as a mixture of topics, where each topic is characterized by a distribution over words.

### Create dictionary

This dictionary maps the word IDs to words from our processed_docs. It has the following attributes and methods...

- token2id: A dictionary that maps tokens (words) to their unique integer IDs.
- id2token: A dictionary that maps integer IDs to their corresponding tokens.
- dfs: A dictionary that stores the document frequencies of tokens in the corpus.
- num_docs: The total number of documents in the corpus.
- num_pos: The total number of tokens in the corpus.
- num_nnz: The total number of non-zero entries in the dfs dictionary.

In [None]:
dictionary = corpora.Dictionary(processed_docs)

In [None]:
#View the contents of this dictionary
print(dictionary.num_docs)
print(dictionary.num_pos)
print(dictionary.num_nnz)

 ### Create corpus from the dictionary
 
 It converts the list of tokens into a bag of words format. It creates a compact and efficient representation of the text data that can be used as input for the models. It captures the frequency of each word, but does seem to be a rather elementary task.

In [None]:
corpus = [dictionary.doc2bow(text) for text in processed_docs]

# Train the LDA Model

#### Some Parameters: 

- corpus: The bag-of-words corpus 
- id2word: The mapping of word IDs to words, typically created using Gensim's Dictionary class.
- num_topics: The number of topics to extract from the text data. This is the number of clusters.
- passes: The number of passes (iterations) over the corpus during training. More passes can lead to better results but also increase computation time.

Others...

- alpha: The hyperparameter controlling the sparsity of the document-topic distributions. By default, this is set to 'symmetric', which means all documents have the same prior.
- eta: The hyperparameter controlling the sparsity of the topic-word distributions. By default, this is set to 'symmetric', which means all topics have the same prior.
- chunksize: The number of documents to load into memory at a time. This can affect memory usage and training speed.
- minimum_probability: The minimum probability value for a topic or word in the topic-word distribution to be considered in the output. Low values can help filter out noise.
- iterations: The maximum number of iterations for each document. This can affect the quality of the inferred topics.

In [None]:
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=10, random_state=RANDOM_SEED)

In [None]:
for topic_id, topic in lda_model.print_topics():
    print(f"Topic {topic_id}: {topic}")

# Visualize the topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

# Below is work done AFTER 2/28...

In my 2/28 capstone class, we discussed a few ways to take this analysis to another level. Below is a summary of that discussion...

- "Relate the topics back to the documents". A document is a distribution over topics. We can output the representations/distributions of topics (clusters) on each document. Doing this, we can gain some insight about these topics. What are the documents that have the highest proportion of topic $j$ saying about topic $j$?
- Look into the PCA that was needed to create the above visual. What were the inputs used to generate these PCs?
- **Understand that the key and goal in unsupervised learning is *DISCOVERY*. I have a good model now, however I now need to use it to discover some things...**
document is typically represented as a distribution over topics

# What is with the PCA?

The PCA takes the input (I believe) of topic-term probabilites...This comes from the documentation:  https://pyldavis.readthedocs.io/en/latest/modules/API.html , although this does not have the documentaion for the gensim.prepare method.

# Relating the Topics Back to the Documents

### First step, for each document, $d$, get the distribution of each topic on $d$

In [None]:
#Init. an empty list that we will fill with these distributions
topic_distributions = []

#Iterate thru the corpus, add these dists for each document.
for document in corpus: 
    
    #Add the topic ditribution to the list
    topic_distributions.append(lda_model.get_document_topics(document))

In [None]:
#Look at this data structure's length
len(topic_distributions)

In [None]:
#Look at first few values
topic_distributions[:5]

It seems that topic 7 has a pretty high representation in document 4. Lets loop thru all of these and create a new data structure that JUST has topics with a high concentration of the distribution of a document.

### Get the docs that contain a very high proportion of a certain topic

In [None]:
#Init. empty df that will be filled. We do a df for easier retrieval of info
#Columns...
    #Index Number: The document number that the document with a high concentration of a given topic is.
    #High_Conc_Topic: The topic that is very highly represented in the document
    #Proportion: The proportion that the topic is in the given document
high_dist_amounts_df = pd.DataFrame(columns=['Index Number', 'High_Conc_Topic', 'Proportion'])

#Init a counter that we reference when adding the high conc. topic distributions.
doc_number = 0

#Traverse thru the topic distributions
for topic_dist in topic_distributions:
    
    #Traverse thru the tuples in each topic dist
    for tup in topic_dist:
        
        #If the proportion is > than .90 we will consider it to be a high concentration
        if tup[1] >= .90:
            
            #Append the tuple that gives us the topic and the associated proportion. I also want to add the document number, the index, to this so we will use a dict.
            high_dist_amounts_df.loc[len(high_dist_amounts_df)] = [doc_number, tup[0], tup[1]]
            
    #Increment the counter
    doc_number+=1

In [None]:
high_dist_amounts_df

### ***Important note: The Topic numbers in this df are between 0-9 and correspond to the LDA output, but NOT the graph of the clusters using PCA. To translate between the df values and the graph, the graph's cluster numbers are 1 higher than the topic numbers in the df. So, 1 in the graph is topic 0 in the df, 2 in the graph is 1 in the df and etc...***

#### Perform some basic analysis on this df...

In [None]:
#Table of topic values
high_dist_amounts_df['High_Conc_Topic'].value_counts()

In [None]:
#Average proportion per topic
high_dist_amounts_df.groupby('High_Conc_Topic')['Proportion'].mean()

#### This is interesting. Topics 3 and 4 are, on average, EXTREMELY well represented in the documents where they have a distribution value of over .9.

#### Now, for each of the 10 topics, I would like to examine the documents with these high concentrations of one topic. I will use a function to do this...


# Function to Examine the Documents with a high concentration of one Topic (From the data structures we have created above)

**Inputs**:
- *high_concentration_df*: The dataframe that contains the Indices, Topic, and Proportion of the corresponding document that is represented by the topic.
- *processed_docs*: A list of the processed docs
- *topic_number*: The number of topics that we want to examine

**Output**:
- *corresponding_processed_docs*: The processed documents at the indices of documents with a high concentration of the topic topic_number

In [None]:
def examine_high_conc_docs(high_concentration_df, processed_docs, topic_number):
    
    #Get the indices in the df that are in the same row as all of the topic_number topic entries in the topic column.
    indices_of_topic = high_concentration_df[high_concentration_df['High_Conc_Topic'] == topic_number]['Index Number'].to_list()
    
    #Init. list that will be filled with the processed documents at the indices in the indices_of_topic list
    corresponding_processed_docs = []
    
    #For each of these indices, get the document from processed_docs at the same index and add it to my list that will be returned
    for index in indices_of_topic:
        corresponding_processed_docs.append(processed_docs[int(index)])
    
    return corresponding_processed_docs

# One-by-One, Examine the Documents for each of the 10 Topics

One of the easiest ways to do this is through a WordCloud, so along with printing the lists, we will also display the WordCloud for each of the topics (remember, just the docs with the high distributions for that given topic)

## Topic 0: 

### TOPIC 0 SUMMARY: 

**SEEMS TO BE DESCRIBING ITEMS THAT WORK WELL, THROUGH WORD LIKE "WORK", "EASY", "EASY INSTALL", ETC... IT IS INTERESTING THAT THESE SEEM TO ALSO BE ACCOMPANIED BY A YOUTUBE VIDEO, WHICH MAY FURTHER ENHANCE THE EASE OF USE AND SETTING UP THESE PRODUCTS. THERE SEEMS TO BE DRYERS, REFRIGERATORS, ETC. MENTIONED, BUT THESE ALL SEEM TO HAVE VERY EASY INSTALLATIONS**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic0_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 0)

#Print these in a readable way.
for doc in topic0_high_conc_docs:
    print(' '. join(doc))

In [None]:
#Generate wordcloud by adding these tokens to one string. Need this for the Wordcloud

#Empty string that will be filled with all of the tokens
topic0_tokens = ""

for doc in topic0_high_conc_docs:
    topic0_tokens+= " ".join(doc) + " "

#Create wordcloud
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(topic0_tokens)

# Display the generated image:
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

## Topic 1: 

### TOPIC 1 SUMMARY: 

**SEEMS TO BE DESCRIBING ITEMS THAT DO NOT WORK WELL, AS WE SEE WORDS LIKE "FIX" AND "PROBLEM". IT IS INTERESTING THAT THERE ARE FAR LESS DOCUMENTS HERE, POSSIBLY INDICATING THAT THERE WERE NOT THAT MANY REVIEWS WITH NEGATIVE SENTIMENTS, OR THIS TOPIC DOES NOT SEEM TO BE VERY SPECIFIC.**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic1_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 1)

#Print these in a readable way.
for doc in topic1_high_conc_docs:
    print(' '. join(doc))

In [None]:
#Generate wordcloud by adding these tokens to one string. Need this for the Wordcloud

#Empty string that will be filled with all of the tokens
topic1_tokens = ""

for doc in topic1_high_conc_docs:
    topic1_tokens+= " ".join(doc) + " "

#Create wordcloud
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(topic1_tokens)

# Display the generated image:
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

## Topic 2: 

### TOPIC 2 SUMMARY: 

**UNLIKE THE TWO PRIOR TOPICS, TOPIC 2 SEEMS TO PRETTY CLEARLY INVOLVE REVIEWS OF A REFRIDGERATOR/WATER FILTER. IT IS VERY INTERESTING HOW SO FAR, THESE TOPICS HAVE BOTH INCLUDED DIFFERENT SENTIMENTS OF REVIEWS, AND ALSO DIFFERENT PRODUCT REVIEWS ALL TOGETHER. THERE ARE QUITE A FEW REVIEWS OF THESE FRIDGES/FILTERS, WHICH COULD BE A TESTIMENT TO THE AMOUNT OF THEIR REVIEWS THAT ARE IN THE CORPUS.**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic2_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 2)

#Print these in a readable way.
for doc in topic2_high_conc_docs:
    print(' '. join(doc))

In [None]:
#Generate wordcloud by adding these tokens to one string. Need this for the Wordcloud

#Empty string that will be filled with all of the tokens
topic2_tokens = ""

for doc in topic2_high_conc_docs:
    topic2_tokens+= " ".join(doc) + " "

#Create wordcloud
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(topic2_tokens)

# Display the generated image:
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

## Topic 3: 

### TOPIC 3 SUMMARY: 

**THESE REVIEWS FOR SOME REASON WERE ALL THE EXACT SAME, WHICH MAY HAVE BEEN DUE TO THE FACT THAT THERE WERE SOME REVIEWS THAT WERE COPIED(...?) IF I WERE TO DO THIS AGAIN, I CERTAINLY WOULD HAVE ACCOUNTED FOR THIS. IT SEEMS THAR A DRYER IS BEING REVIEWED IN THESE, HOWEVER I WILL SKIP ANY FURTHER ANALYSIS OF THIS TOPIC**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic3_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 3)

#Print these in a readable way.
for doc in topic3_high_conc_docs:
    print(' '. join(doc))

#### WELL... THIS IS WEIRD...WE SEE PRETTY MUCH THE SAME REVIEWS MANY TIMES. LETS STOP THE ANALYSIS FOR THIS TOPIC HERE...

## Topic 4: 

### TOPIC 4 SUMMARY: 

**THESE REVIEWS FOR SOME REASON WERE ALL THE EXACT SAME, WHICH MAY HAVE BEEN DUE TO THE FACT THAT THERE WERE SOME REVIEWS THAT WERE COPIED(...?) IF I WERE TO DO THIS AGAIN, I CERTAINLY WOULD HAVE ACCOUNTED FOR THIS. **

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic4_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 4)

#Print these in a readable way.
for doc in topic4_high_conc_docs:
    print(' '. join(doc))

## Topic 5: 

### TOPIC 5 SUMMARY: 

**THERE ARE ONLY A VIEW DOCS HERE, HOWEVER THEY DO REVEAL SOME INTERESTING INFO. FOR ONE, I AM NOT MAKING A WORDCLOUD HERE BECAUSE IT WOULD BE DOMINATED BY THE TOKEN 'BLAH', WHICH IS PRETTY FUNNY. THESE REVIEWS SEEM TO BE QUITE GOOD, AND THERE IS ALSO A CUSTOMER SERVICE ASPECT TO THESE REVIEWS. I WONDER WHAT CUSTOMER SERVICE HAD TO DO WITH THIS PRODUCT? THIS IS A PRETTY INTERESTING FINDING, THOUGH...**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic5_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 5)

#Print these in a readable way.
for doc in topic5_high_conc_docs:
    print(' '. join(doc))

## Topic 6: 

### TOPIC 6 SUMMARY: 

**REVIEWERS HERE SEEMED TO RAVE ABOUT THE SLEEK LOOK OF PRODUCTS. NOW, WE HAVE ANOTHER SET OF POSITIVE REVIEWS, JUST LIKE TOPIC 0, BUT WE HAVE ONES WHERE REVIEWERS ARE IMPRESSED BY A PRODUCT'S APPEARANCE! PRETTY INTERESTING HOW THE TOPIC MODELING OUTPUT HAS DIFFERENT TOPICS FOR DIFFERENT ASPECTS OF PRODUCTS THAT WERE LIKED.**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic6_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 6)

#Print these in a readable way.
for doc in topic6_high_conc_docs:
    print(' '. join(doc))

In [None]:
#Generate wordcloud by adding these tokens to one string. Need this for the Wordcloud

#Empty string that will be filled with all of the tokens
topic6_tokens = ""

for doc in topic6_high_conc_docs:
    topic6_tokens+= " ".join(doc) + " "

#Create wordcloud
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(topic6_tokens)

# Display the generated image:
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

## Topic 7: 

### TOPIC 7 SUMMARY: 

**IN THIS TOPIC, WE HAVE MORE POSITIVE REVIEWS ABOUT A WASHER AND A FRIDGE. THIS TIME, THOUGH, THE REVIEWERS ARE IMPRESSED BY THE ICE TRAY AND ICE-MAKING ABILITIES OF A FREEZER! THAT IS THE OVERWHELMING MAJORITY OF REVIEWS HERE.**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic7_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 7)

#Print these in a readable way.
for doc in topic7_high_conc_docs:
    print(' '. join(doc))

In [None]:
#Generate wordcloud by adding these tokens to one string. Need this for the Wordcloud

#Empty string that will be filled with all of the tokens
topic7_tokens = ""

for doc in topic7_high_conc_docs:
    topic7_tokens+= " ".join(doc) + " "

#Create wordcloud
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(topic7_tokens)

# Display the generated image:
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

## Topic 8: 

### TOPIC 8 SUMMARY: 

**NOT A LOT OF DOCS HERE, BUT WE SEE SOME NEGATIVE REVIEWS! THIS TOPIC IS ANOTHER THAT MAY BE FALLING VICTIM TO FEW DOCUMENTS BEING VERY LARGELY REPRESENTED BY THIS TOPIC. LIKE TOPIC 1, THERE ARE FEW DOCS HERE.**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic8_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 8)

#Print these in a readable way.
for doc in topic8_high_conc_docs:
    print(' '. join(doc))

## Topic 9: 

### TOPIC 9 SUMMARY: 

**LASTLY, WE HAVE A HOSE THAT IS PRETTY AWFUL, WITH SEVERAL NEGATIVE REVIEWS!**

#### Print the list of documents (just the cleaned tokens)

In [None]:
topic9_high_conc_docs = examine_high_conc_docs(high_dist_amounts_df, processed_docs, 9)

#Print these in a readable way.
for doc in topic9_high_conc_docs:
    print(' '. join(doc))

# SUMMARY OF EXPLORATION INTO THE TOPIC MODELING RESULTS...

Our goal of performing the Topic Modeling was *discovery*, where I wanted to discover different topics and themes that were present in a large corpus of Amazon appliance reviews. I was successfully able to create a topic modeling model and visualize its outputs, however the exploration did not stop there. I then wanted to dive deep into each of the 10 topics to discover similarities between them, and specifically *what* these topics were. This was done be examaning, for all topics, the documents that contained a very large proportion of one specific topic. These documents were thus very representative of a given topic. For all of these lists of documents & topics, they were examined and, when enough documents were present, visualized. I found interesting trends and themes in each of these topics, and they are listed throughout the 2nd half of this file.