# Week 13 Assignment: Topic Modeling

**DATA110**  
*Brian Roepke*  

Using the same women's clothing dataset from Midterm #2, perform Topic Modeling using LDA using either packages: Gensim  or sklearn on the reviews text.  


Given that there are 6 departments, you can use  6 topics.  Bonus if you apply coherence computations against multiple models for model selection, in order to determine the optimal number of topics.



References:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py

In [None]:
import numpy as np
import pandas as pd
import re
import itertools
import string
import warnings
warnings.filterwarnings('ignore')

#from textblob import TextBlob
#from textblob import Word

import gensim
from gensim import corpora, models
from gensim.models import CoherenceModel
from gensim.models import nmf
from gensim.models import lsimodel

import sklearn as sk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics 
from sklearn.decomposition import PCA

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# NLTK Imports and Downloads
import nltk
from nltk import word_tokenize
from nltk.sentiment.util import *
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import FreqDist

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("ClothingReviews.csv")
df.head()

In [None]:
df.dropna(subset=['Department Name', 'Class Name', 'Review Text'], inplace=True)

In [None]:
# fill the NA values with 0
df['Title'].fillna('', inplace=True)

In [None]:
# count of nulls
df.isnull().sum()

In [None]:
df['Text'] = df['Title'] + ' ' + df['Review Text']

In [None]:
df.drop(columns=['Title', 'Review Text'], inplace=True)

In [None]:
# Add column 'text_len' that counts the length for the derived field
df['text_len'] = df.apply(lambda row: len(row['Text']), axis = 1)

In [None]:
len_before = df.shape[0]
df.drop_duplicates(inplace=True)
len_after = df.shape[0]

print("Before =", len_before)
# drop duplicates
print("After =", len_after)
print('')
print("Total Removed =", len_before - len_after)

In [None]:
wordlen = df['Text'].str.split().map(lambda x: len(x))
wordlen.describe()

In [None]:
df['Text'][2]

In [None]:
def process_string(text, stem="None"):
    
    final_string = ""
    
    text = text.lower()
    
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    text = text.split()
    useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
    useless_words = useless_words + ['.', ',', '!', "'"]
    text_filtered = [word for word in text if not word in useless_words]
    
    if stem == 'Stem':
        stemmer = PorterStemmer() 
        text_stemmed = [stemmer.stem(y) for y in text_filtered]
    elif stem == 'Lem':
        lem = WordNetLemmatizer()
        text_stemmed = [lem.lemmatize(y) for y in text_filtered]
    else:
        text_stemmed = text_filtered
    
    for word in text_stemmed:
        final_string += word + " "
    
    return final_string

In [None]:
df['Text_Processed'] = df['Text'].apply(lambda x: process_string(x, stem='Lem'))

In [None]:
df['Text_Processed'][2]

# Topic Modeling

In [None]:
clean_docs = df['Text_Processed'].to_list()

# first 5 docs
clean_docs[:5]

In [None]:
texts = "".join(clean_docs)
word_tokens = word_tokenize(texts)
#fdist.most_common(50)

plt.figure(figsize=(15, 5)) 
fdist = FreqDist(word_tokens)
fdist.plot(50);

# Determine the Optimal Number of Clusters

## Elbow Method

To select the best number of clusters, we'll use the Elbow method.  Per [Wikipedia](https://en.wikipedia.org/wiki/Elbow_method_(clustering))

> *In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of principal components to describe a data set.*

[Tutorial: How to determine the optimal number of clusters for k-means clustering](https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f)

In [None]:
#vectorization of features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(clean_docs)

X.shape

In [None]:
Sum_of_squared_distances = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(X)
    Sum_of_squared_distances.append(km.inertia_)

In [None]:
ax = sns.lineplot(x=K, y=Sum_of_squared_distances)
ax.lines[0].set_linestyle("--")
plt.xlabel('k')
plt.ylabel('Sum of Squared Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Conclusion:** Based on this method, the appropriate number of clusters is not totally clear. 

## Silhouette Score

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html?highlight=silhouette_score#sklearn.metrics.silhouette_score

In [None]:
def get_silhouette_score(X, k):
    for n_clusters in range(2, k):
        clusterer = KMeans(n_clusters=n_clusters, random_state=42)
        y = clusterer.fit_predict(X)

        message = "For n_clusters = {} The average silhouette_score is: {}"
        print(message.format(n_clusters, silhouette_score(X, y)))
        
get_silhouette_score(X, 10)     

**Conclusion:** 

# Clustering Model

In [None]:
true_k = 6
kmeans = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, random_state=42)
kmeans.fit(X)


print("Top terms per cluster:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

results_dict = {}


for i in range(true_k):
    terms_list = []
    
    for ind in order_centroids[i, :15]:  
        terms_list.append(terms[ind])
    
    results_dict[f'Custer{i}'] = terms_list
    
df_clusters = pd.DataFrame.from_dict(results_dict)
df_clusters

In [None]:
# assign the data labels back to the dataframe
df['clusters'] = kmeans.labels_

df.sample(10, random_state=42)

In [None]:
new_docs = ['This dress is gorgeous and I love it and would gladly reccomend it to all of my friends.',
            'This skirt has really horible quality and I hate it!',
            'A super cute top with the perfect fit.',
            'The most gorgeous pair of jeans I have seen.',
            'this item is too little and tight.']

pred = kmeans.predict(vectorizer.transform(new_docs))
print(pred)

# GenSim

In [None]:
clean_docs[:1]

In [None]:
tokenized_docs = [word_tokenize(word) for word in clean_docs]

In [None]:
tokenized_docs[:1]

In [None]:
# create a dictionary from the corpus
dictionary = gensim.corpora.Dictionary(tokenized_docs)
print(dictionary)

# Term Document Frequency 
# convert our entire corpus to a list of vectors:
bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# View the first doc
print(bow_corpus[0])

In [None]:
doc = bow_corpus[1]
for i in range(len(doc)):
    print (f"Word {doc[i][0]} ({dictionary[doc[i][0]]}) appears {doc[i][1]} times")

# Topic Modeling

In [None]:
NUM_TOPICS = 6

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=bow_corpus,
                                           id2word=dictionary,
                                           num_topics=NUM_TOPICS, 
                                           random_state=42,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)
### View the topics in LDA model
topics = lda_model.print_topics()
for topic in topics:
    print(topic)

In [None]:
new_doc = 'This dress is gorgeous and I love it and would gladly reccomend it to all of my friends.'
new_doc = process_string(new_doc)
new_doc = word_tokenize(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)

print(new_doc_bow)
print(lda_model.get_document_topics(new_doc_bow))

# Model Perplexity and Coherence Score
Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is.

In [None]:
def model_scoring (model, corpus, text, dictionary, perplex=False):

    # Compute Perplexity
    # a measure of how good the model is. lower the better.
    if perplex:
        print('\nPerplexity: ', model.log_perplexity(corpus))  

    # Compute Coherence Score
    coherence_model = CoherenceModel(model=model, 
                                         texts=text, 
                                         dictionary=dictionary, 
                                         coherence='c_v')
    
    coherence_lda = coherence_model.get_coherence()
    print('\nCoherence Score: ', coherence_lda)

In [None]:
model_scoring(lda_model, bow_corpus, tokenized_docs, dictionary, perplex=True)

In [None]:
nmf_model = gensim.models.nmf.Nmf(corpus=bow_corpus, 
                      num_topics=NUM_TOPICS, 
                      id2word=dictionary, 
                      chunksize=2000, 
                      passes=10, 
                      random_state=42)

model_scoring(nmf_model, bow_corpus, tokenized_docs, dictionary)

In [None]:
lsi = gensim.models.lsimodel.LsiModel(corpus=bow_corpus, 
                                num_topics=NUM_TOPICS, 
                                id2word=dictionary)

model_scoring(lsi, bow_corpus, tokenized_docs, dictionary)

## Topic Visualization

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(lda_model, bow_corpus, dictionary, sort_topics=False)

pyLDAvis.display(lda_viz)