<a href="https://colab.research.google.com/github/diwu437/diwu-github.io/blob/master/Document_Clustering_and_Topic_Modeling_ipynb%E2%80%9D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Clustering and Topic Modeling

In this project, I used unsupervised learning models to cluster unlabeled documents into different groups, visualize the results and identify their latent topics/structures.

## Contents

* [Part 1: Load Data](#Part-1:-Load-Data)
* [Part 2: Tokenizing and Stemming](#Part-2:-Tokenizing-and-Stemming)
* [Part 3: TF-IDF](#Part-3:-TF-IDF)
* [Part 4: K-means clustering](#Part-4:-K-means-clustering)
* [Part 5: Topic Modeling - Latent Dirichlet Allocation](#Part-5:-Topic-Modeling---Latent-Dirichlet-Allocation)


# Part 0: Setup Google Drive Environment

In [0]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# https://drive.google.com/open?id=192JMR7SIqoa14vrs7Z9BXO3iK89pimJL
file = drive.CreateFile({'id':'192JMR7SIqoa14vrs7Z9BXO3iK89pimJL'}) # replace the id with id of file
file.GetContentFile('data.tsv')  

# Part 1: Load Data

In [0]:
import numpy as np
import pandas as pd
import nltk

import gensim
# REGULAR EXPRESSION
import re

from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

In [0]:
# Load data into dataframe
df = pd.read_csv('data.tsv', sep='\t', header=0, error_bad_lines=False)

In [58]:
df.head(20)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,3653882,R3O9SGZBVQBV76,B00FALQ1ZC,937001370,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",Watches,5,0,0,N,Y,Five Stars,Absolutely love this watch! Get compliments al...,2015-08-31
1,US,14661224,RKH8BNC3L5DLF,B00D3RGO20,484010722,Kenneth Cole New York Women's KC4944 Automatic...,Watches,5,0,0,N,Y,I love thiswatch it keeps time wonderfully,I love this watch it keeps time wonderfully.,2015-08-31
2,US,27324930,R2HLE8WKZSU3NL,B00DKYC7TK,361166390,Ritche 22mm Black Stainless Steel Bracelet Wat...,Watches,2,1,1,N,Y,Two Stars,Scratches,2015-08-31
3,US,7211452,R31U3UH5AZ42LL,B000EQS1JW,958035625,Citizen Men's BM8180-03E Eco-Drive Stainless S...,Watches,5,0,0,N,Y,Five Stars,"It works well on me. However, I found cheaper ...",2015-08-31
4,US,12733322,R2SV659OUJ945Y,B00A6GFD7S,765328221,Orient ER27009B Men's Symphony Automatic Stain...,Watches,4,0,0,N,Y,"Beautiful face, but cheap sounding links",Beautiful watch face. The band looks nice all...,2015-08-31
5,US,6576411,RA51CP8TR5A2L,B00EYSOSE8,230493695,Casio Men's GW-9400BJ-1JF G-Shock Master of G ...,Watches,5,0,0,N,Y,No complaints,"i love this watch for my purpose, about the pe...",2015-08-31
6,US,11811565,RB2Q7DLDN6TH6,B00WM0QA3M,549298279,Fossil Women's ES3851 Urban Traveler Multifunc...,Watches,5,1,1,N,Y,Five Stars,"for my wife and she loved it, looks great and ...",2015-08-31
7,US,49401598,R2RHFJV0UYBK3Y,B00A4EYBR0,844009113,INFANTRY Mens Night Vision Analog Quartz Wrist...,Watches,1,1,5,N,N,I was about to buy this thinking it was a ...,I was about to buy this thinking it was a Swis...,2015-08-31
8,US,45925069,R2Z6JOQ94LFHEP,B00MAMPGGE,263720892,G-Shock Men's Grey Sport Watch,Watches,5,1,2,N,Y,Perfect watch!,Watch is perfect. Rugged with the metal &#34;B...,2015-08-31
9,US,44751341,RX27XIIWY5JPB,B004LBPB7Q,124278407,Heiden Quad Watch Winder in Black Leather,Watches,4,0,0,N,Y,Great quality and build,Great quality and build.<br />The motors are r...,2015-08-31


In [0]:
# Remove missing value
df.review_body.dropna(inplace=True)

In [34]:
# use the first 1000 data as our training data
data = df.loc[:1000, 'review_body'].tolist()
data

['Absolutely love this watch! Get compliments almost every time I wear it. Dainty.',
 'I love this watch it keeps time wonderfully.',
 'Scratches',
 'It works well on me. However, I found cheaper prices in other places after making the purchase',
 "Beautiful watch face.  The band looks nice all around.  The links do make that squeaky cheapo noise when you swing it back and forth on your wrist which can be embarrassing in front of watch enthusiasts.  However, to the naked eye from afar, you can't tell the links are cheap or folded because it is well polished and brushed and the folds are pretty tight for the most part.<br /><br />I love the new member of my collection and it looks great.  I've had it for about a week and so far it has kept good time despite day 1 which is typical of a new mechanical watch",
 'i love this watch for my purpose, about the people complaining should of done their research better before buying. dumb people.',
 'for my wife and she loved it, looks great and a 

# Part 2: Tokenizing and Stemming

In [0]:
# Use nltk's English stopwords. Load stopwords and stemmer function from NLTK library.
stopwords = nltk.corpus.stopwords.words('english')

In [0]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# tokenization and stemming
def tokenization_and_stemming(text):
    # exclude stop words and tokenize the document, generate a list of string 
    tokens = [word.lower() for word in nltk.word_tokenize(text) if word not in stopwords]

    filtered_tokens = []
    
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
            
    # stemming
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [14]:
# tokenization and stemming
tokenization_and_stemming(data[0])

['absolut',
 'love',
 'watch',
 'get',
 'compliment',
 'almost',
 'everi',
 'time',
 'i',
 'wear',
 'dainti']

Use our defined functions to analyze (i.e. tokenize, stem) our reviews.

In [0]:
# 1. do tokenization and stemming for all the documents
# 2. also just do tokenization for all the documents
# the goal is to create a mapping from stemmed words to original tokenized words for result interpretation.
docs_stemmed = []
docs_tokenized = []
for i in data:
    tokenized_and_stemmed_results = tokenization_and_stemming(i)
    docs_stemmed.extend(tokenized_and_stemmed_results)
    
    tokenized_results = tokenization(i)
    docs_tokenized.extend(tokenized_results)

In [44]:
# create a mapping from stemmed words to original words
vocab_frame_dict = {docs_stemmed[x]:docs_tokenized[x] for x in range(len(docs_stemmed))}
vocab_frame_dict

{'absolut': 'absolutely',
 'love': 'loved',
 'watch': 'watch',
 'get': 'get',
 'compliment': 'compliments',
 'almost': 'almost',
 'everi': 'every',
 'time': 'time',
 'i': 'i',
 'wear': 'wear',
 'dainti': 'dainty',
 'keep': 'keep',
 'wonder': 'wonderful',
 'scratch': 'scratches',
 'it': 'it',
 'work': 'work',
 'well': 'well',
 'howev': 'however',
 'found': 'found',
 'cheaper': 'cheaper',
 'price': 'price',
 'place': 'place',
 'make': 'make',
 'purchas': 'purchased',
 'beauti': 'beautiful',
 'face': 'face',
 'the': 'the',
 'band': 'band',
 'look': 'looks',
 'nice': 'nice',
 'around': 'around',
 'link': 'links',
 'squeaki': 'squeaky',
 'cheapo': 'cheapo',
 'nois': 'noise',
 'swing': 'swing',
 'back': 'back',
 'forth': 'forth',
 'wrist': 'wrist',
 'embarrass': 'embarrassing',
 'front': 'front',
 'enthusiast': 'enthusiasts',
 'nake': 'naked',
 'eye': 'eyes',
 'afar': 'afar',
 'ca': 'ca',
 "n't": "n't",
 'tell': 'tell',
 'cheap': 'cheap',
 'fold': 'folds',
 'polish': 'polishing',
 'brush': '

# Part 3: TF-IDF

In [0]:
# define vectorizer parameters
# Here i set minimum document frequency at 0.01 and maximum document frequency at 0.99 and used built-in stop wards. For this project, i used 1-gram only.
# The model allow up to 1000 words.
tfidf_model = TfidfVectorizer(max_df=0.99, max_features=1000,
                                 min_df=0.01, stop_words='english',
                                 use_idf=True, tokenizer=tokenization_and_stemming, ngram_range=(1,1))

tfidf_matrix = tfidf_model.fit_transform(data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix.shape[0]) + \
      " reviews and " + str(tfidf_matrix.shape[1]) + " terms.")

Save the terms identified by TF-IDF.

In [0]:
# words
tf_selected_words = tfidf_model.get_feature_names()

In [0]:
# print out words
tf_selected_words

# Part 4: K-means clustering

In [0]:
# k-means clustering
from sklearn.cluster import KMeans

num_clusters = 5

# number of clusters
km = KMeans(n_clusters=5)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [25]:
tfidf_matrix

<1000x245 sparse matrix of type '<class 'numpy.float64'>'
	with 7777 stored elements in Compressed Sparse Row format>

## 4.1. Analyze K-means Result

In [0]:
# create DataFrame films from all of the input files.
product = { 'review': df[:1000].product_title, 'cluster': clusters}
frame = pd.DataFrame(product, columns = ['review', 'cluster'])

In [0]:
frame.head(10) # reviews with cluster id assigned

In [0]:
print ("Number of reviews included in each cluster:")
frame['cluster'].value_counts().to_frame()

In [0]:
# mean of tf-idf for each feature under each cluster
# The purpose there is to find the top 5 importance words for each cluster by searching 6 words with highest mean tf-idf in their cluster
km.cluster_centers_ 

In [56]:
print ("<Document clustering result by K-means>")

order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

Cluster_keywords_summary = {}
for i in range(num_clusters):
    print ("Cluster " + str(i) + " words:", end='')
    Cluster_keywords_summary[i] = []
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        Cluster_keywords_summary[i].append(vocab_frame_dict[tf_selected_words[ind]])
        print (vocab_frame_dict[tf_selected_words[ind]] + ",", end='')
    print ()
    
    cluster_reviews = frame[frame.cluster==i].review.tolist()
    print ("Cluster " + str(i) + " reviews (" + str(len(cluster_reviews)) + " reviews): ")
    print (", ".join(cluster_reviews))
    print ()

<Document clustering result by K-means>
Cluster 0 words:loved,watch,wife,looks,husband,'s,
Cluster 0 reviews (122 reviews): 
Invicta Women's 15150 "Angel" 18k Yellow Gold Ion-Plated Stainless Steel and Brown Leather Watch, Kenneth Cole New York Women's KC4944 Automatic Silver Automatic Mesh Bracelet Analog Watch, Casio Men's GW-9400BJ-1JF G-Shock Master of G Rangeman Digital Solar Black Carbon Fiber Insert Watch, Casio - G-Shock - Gulfmaster - Black - GWN1000C-1A, Domire Fashion Accessories Trial Order New Quartz Fashion Weave Wrap Around Leather Bracelet Lady Woman Butterfly Wrist Watch, Casio Men's Slim Solar Multi-Function Analog-Digital Watch, Batman Kids' BAT4072 Black Rubber Batman Logo Strap Watch, Timex Easy Reader Day-Date Leather Strap Watch, Casio F108WH Water Resistant Digital Blue Resin Strap Watch, Stuhrling Original Women's 956.02 Symphony Gold-Tone Watch with Brown Genuine Leather Band, Seiko Men's SNKK27 Seiko 5 Stainless Steel Automatic Watch, Swiss Legend Women's 110

# Part 5: Topic Modeling - Latent Dirichlet Allocation


LDA is a probabilistic model of text used to find topics that describe a corpus. It trades off two conflicting goals:
For each document, allocate its words to as few topics as possible.
For each topic, assign high probability to as few terms as possible.
Trading off these goals finds groups of tightly co-occurring words. A correlated model might work. 



In [0]:
# Use LDA for clustering
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=500)

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
# LDA requires integer values
tfidf_model_lda = CountVectorizer(max_df=0.99, max_features=500,
                                 min_df=0.01, stop_words='english',
                                 tokenizer=tokenization_and_stemming, ngram_range=(1,1))

tfidf_matrix_lda = tfidf_model_lda.fit_transform(data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix_lda.shape[0]) + \
      " reviews and " + str(tfidf_matrix_lda.shape[1]) + " terms.")

  'stop_words.' % sorted(inconsistent))


In total, there are 1000 reviews and 245 terms.


In [50]:
# document topic matrix for tfidf_matrix_lda
lda_output = lda.fit_transform(tfidf_matrix_lda)
print(lda_output.shape)
print(lda_output)

(1000, 500)
[[0.00025    0.00025    0.00025    ... 0.00025    0.00025    0.00025   ]
 [0.0005     0.0005     0.0005     ... 0.0005     0.0005     0.0005    ]
 [0.002      0.002      0.002      ... 0.002      0.002      0.002     ]
 ...
 [0.001      0.001      0.001      ... 0.001      0.001      0.001     ]
 [0.0005     0.0005     0.0005     ... 0.0005     0.0005     0.0005    ]
 [0.00033333 0.00033333 0.00033333 ... 0.00033333 0.00033333 0.00033333]]


In [51]:
# topics and words matrix
topic_word = lda.components_
print(topic_word.shape)
print(topic_word)

(500, 245)
[[2.00000000e-03 2.00000000e-03 2.00000000e-03 ... 2.00000000e-03
  2.00000000e-03 2.00000000e-03]
 [2.00000000e-03 2.00000000e-03 2.00000000e-03 ... 2.00000000e-03
  2.00000000e-03 2.00000000e-03]
 [2.00000000e-03 2.00000000e-03 2.00000000e-03 ... 2.00000000e-03
  2.00000000e-03 2.00000000e-03]
 ...
 [2.00000000e-03 2.00200000e+00 2.00000000e-03 ... 2.00000000e-03
  2.00000000e-03 1.00200000e+00]
 [2.00000000e-03 2.00000000e-03 2.00000000e-03 ... 2.00000000e-03
  2.00000000e-03 2.00000000e-03]
 [2.00000000e-03 1.07061859e+00 2.00000000e-03 ... 2.00000000e-03
  1.10400242e+00 2.00000000e-03]]


In [57]:
# column names
topic_names = ["Topic" + str(i) for i in range(lda.n_components)]

# index names
doc_names = ["Doc" + str(i) for i in range(len(data))]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topic_names, index=doc_names)

# get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['topic'] = topic

df_document_topic.head()

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,Topic11,Topic12,Topic13,Topic14,Topic15,Topic16,Topic17,Topic18,Topic19,Topic20,Topic21,Topic22,Topic23,Topic24,Topic25,Topic26,Topic27,Topic28,Topic29,Topic30,Topic31,Topic32,Topic33,Topic34,Topic35,Topic36,Topic37,Topic38,Topic39,...,Topic461,Topic462,Topic463,Topic464,Topic465,Topic466,Topic467,Topic468,Topic469,Topic470,Topic471,Topic472,Topic473,Topic474,Topic475,Topic476,Topic477,Topic478,Topic479,Topic480,Topic481,Topic482,Topic483,Topic484,Topic485,Topic486,Topic487,Topic488,Topic489,Topic490,Topic491,Topic492,Topic493,Topic494,Topic495,Topic496,Topic497,Topic498,Topic499,topic
Doc0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28
Doc1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,450
Doc2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
Doc3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,238
Doc4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,355


In [53]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
280,43
134,43
285,40
450,31
0,23
...,...
336,1
152,1
154,1
335,1


In [0]:
# topic word matrix
print(lda.components_)
# topic-word matrix
df_topic_words = pd.DataFrame(lda.components_)

# column and index
df_topic_words.columns = tfidf_model_lda.get_feature_names()
df_topic_words.index = topic_names

df_topic_words.head()

In [55]:
# print top n keywords for each topic
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    # for each topic, we have words weight
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model=tfidf_model_lda, lda_model=lda, n_words=15)        

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word '+str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic '+str(i) for i in range(df_topic_words.shape[0])]
df_topic_words

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,year,gave,fit,finish,fine,fell,feel,featur,fast,far,fair,face,expens,expect,excel
Topic 1,year,gave,fit,finish,fine,fell,feel,featur,fast,far,fair,face,expens,expect,excel
Topic 2,valu,great,watch,look,n't,purchas,hope,amaz,better,qualiti,deal,invicta,fast,far,excel
Topic 3,disappoint,open,watch,finish,replac,purchas,pin,comfort,somewhat,came,dress,expens,fell,feel,featur
Topic 4,look,watch,gift,got,featur,big,heavi,husband,nice,button,lot,dress,love,fair,finish
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Topic 495,year,gave,fit,finish,fine,fell,feel,featur,fast,far,fair,face,expens,expect,excel
Topic 496,year,gave,fit,finish,fine,fell,feel,featur,fast,far,fair,face,expens,expect,excel
Topic 497,watch,'s,easi,year,wear,display,say,comfort,small,use,ve,like,amaz,dial,fit
Topic 498,year,gave,fit,finish,fine,fell,feel,featur,fast,far,fair,face,expens,expect,excel
