# Document Clustering and Topic Modeling

*In* this simple NLP project, we use unsupervised learning models to cluster unlabeled documents into different groups, visualize the results and identify their latent topics/structures.

## Contents

* [Part 1: Load Data](#Part-1:-Load-Data)
* [Part 2: Tokenizing and Stemming](#Part-2:-Tokenizing-and-Stemming)
* [Part 3: TF-IDF](#Part-3:-TF-IDF)
* [Part 4: K-means clustering](#Part-4:-K-means-clustering)
* [Part 5: Topic Modeling - Latent Dirichlet Allocation](#Part-5:-Topic-Modeling---Latent-Dirichlet-Allocation)


# Part 1: Load Data

In [2]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import nltk
import gensim
# REGULAR EXPRESSION
import re
import os

from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yinruideng/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yinruideng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
## Load data into dataframe
## The original dataset is about 1 gigabyte, so I simplified the 
## data to contain 1000 reviews.
df = pd.read_csv('review_simplified.csv', header=0, error_bad_lines=False)

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,0,US,3653882,R3O9SGZBVQBV76,B00FALQ1ZC,937001370,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",Watches,5,0,0,N,Y,Five Stars,Absolutely love this watch! Get compliments al...,2015-08-31
1,1,US,14661224,RKH8BNC3L5DLF,B00D3RGO20,484010722,Kenneth Cole New York Women's KC4944 Automatic...,Watches,5,0,0,N,Y,I love thiswatch it keeps time wonderfully,I love this watch it keeps time wonderfully.,2015-08-31
2,2,US,27324930,R2HLE8WKZSU3NL,B00DKYC7TK,361166390,Ritche 22mm Black Stainless Steel Bracelet Wat...,Watches,2,1,1,N,Y,Two Stars,Scratches,2015-08-31
3,3,US,7211452,R31U3UH5AZ42LL,B000EQS1JW,958035625,Citizen Men's BM8180-03E Eco-Drive Stainless S...,Watches,5,0,0,N,Y,Five Stars,"It works well on me. However, I found cheaper ...",2015-08-31
4,4,US,12733322,R2SV659OUJ945Y,B00A6GFD7S,765328221,Orient ER27009B Men's Symphony Automatic Stain...,Watches,4,0,0,N,Y,"Beautiful face, but cheap sounding links",Beautiful watch face. The band looks nice all...,2015-08-31


In [5]:
# Remove missing value
df.review_body.dropna(inplace=True)

In [6]:
# use the first 1000 data as our training data
data = df.loc[:1000, 'review_body'].tolist()


# Part 2: Tokenizing and Stemming

Load stopwords and stemmer function from NLTK library.
Stop words are words like "a", "the", or "in" which don't convey significant meaning.
Stemming is the process of breaking a word down into its root.

In [7]:
# Use nltk's English stopwords.
stopwords = nltk.corpus.stopwords.words('english')

print ("We use " + str(len(stopwords)) + " stop-words from nltk library.")
print (stopwords[:10])

We use 179 stop-words from nltk library.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


Here are two helper function:
* 1: Do both tokenization and stemming.
* 2: Do only the tokenization.

In [8]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# tokenization and stemming
def tokenization_and_stemming(text):
    # exclude stop words and tokenize the document, generate a list of string 
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word not in stopwords]

    filtered_tokens = []
    
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
            
    # stemming
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

# tokenization without stemming
def tokenization(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word not in stopwords]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [9]:
# tokenization and stemming
tokenization_and_stemming(data[0])

['absolut',
 'love',
 'watch',
 'get',
 'compliment',
 'almost',
 'everi',
 'time',
 'i',
 'wear',
 'dainti']

Use our defined functions to analyze (i.e. tokenize, stem) our synoposes.

In [10]:
# 1. do tokenization and stemming for all the documents
# 2. also just do tokenization for all the documents
# the goal is to create a mapping from stemmed words to original tokenized words for result interpretation.
docs_stemmed = []
docs_tokenized = []
for i in data:
    tokenized_and_stemmed_results = tokenization_and_stemming(i)
    docs_stemmed.extend(tokenized_and_stemmed_results)
    
    tokenized_results = tokenization(i)
    docs_tokenized.extend(tokenized_results)

In [11]:
# create a mapping from stemmed words to original words
vocab_frame_dict = {docs_stemmed[x]:docs_tokenized[x] for x in range(len(docs_stemmed))}
(vocab_frame_dict)

{'absolut': 'absolutely',
 'love': 'loved',
 'watch': 'watch',
 'get': 'get',
 'compliment': 'compliments',
 'almost': 'almost',
 'everi': 'every',
 'time': 'time',
 'i': 'i',
 'wear': 'wear',
 'dainti': 'dainty',
 'keep': 'keep',
 'wonder': 'wonderful',
 'scratch': 'scratches',
 'it': 'it',
 'work': 'work',
 'well': 'well',
 'howev': 'however',
 'found': 'found',
 'cheaper': 'cheaper',
 'price': 'price',
 'place': 'place',
 'make': 'make',
 'purchas': 'purchased',
 'beauti': 'beautiful',
 'face': 'face',
 'the': 'the',
 'band': 'band',
 'look': 'looks',
 'nice': 'nice',
 'around': 'around',
 'link': 'links',
 'squeaki': 'squeaky',
 'cheapo': 'cheapo',
 'nois': 'noise',
 'swing': 'swing',
 'back': 'back',
 'forth': 'forth',
 'wrist': 'wrist',
 'embarrass': 'embarrassing',
 'front': 'front',
 'enthusiast': 'enthusiasts',
 'nake': 'naked',
 'eye': 'eyes',
 'afar': 'afar',
 'ca': 'ca',
 "n't": "n't",
 'tell': 'tell',
 'cheap': 'cheap',
 'fold': 'folds',
 'polish': 'polishing',
 'brush': '

# Part 3: TF-IDF

TF: Term Frequency

IDF: Inverse Document Frequency

***example:***

document1: "Arthur da Jason"

document 2: "Jason da da huang"

document1: tf-idf [1, 0.5, 0.5, 0];  document2: tf-idf [0, 1, 0.5, 1]  

2-gram: 

document 1: Arthur da, da Jason; document 2: Jason da, da da, da huang bigram

3-gram:

document 1: Athur da Jason;  document 2: Jason da da, da da huang

[Arhur, da, Jason...]

In [12]:
# define vectorizer parameters
# TfidfVectorizer will help us to create tf-idf matrix
# max_df : maximum document frequency for the given word
# min_df : minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop_words : built-in stop words
# tokenizer: how to tokenize the document
# ngram_range: (min_value, max_value), eg. (1, 3) means the result will include 1-gram, 2-gram, 3-gram
tfidf_model = TfidfVectorizer(max_df=0.99, max_features=1000,
                                 min_df=0.01, stop_words='english',
                                 use_idf=True, tokenizer=tokenization_and_stemming, ngram_range=(1,1))

tfidf_matrix = tfidf_model.fit_transform(data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix.shape[0]) + \
      " synoposes and " + str(tfidf_matrix.shape[1]) + " terms.")

In total, there are 999 synoposes and 245 terms.


In [13]:
# check the parameters
tfidf_model.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.99,
 'max_features': 1000,
 'min_df': 0.01,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenization_and_stemming(text)>,
 'use_idf': True,
 'vocabulary': None}

Save the terms identified by TF-IDF.

In [14]:
# words
tf_selected_words = tfidf_model.get_feature_names()
print(tfidf_matrix)

  (0, 233)	0.3384164830529139
  (0, 221)	0.2913189484763953
  (0, 72)	0.484500633709986
  (0, 49)	0.46322326733332947
  (0, 230)	0.16534921069822942
  (0, 124)	0.2614715212707558
  (0, 3)	0.5055523483064737
  (1, 221)	0.685552290396239
  (1, 230)	0.3891114213552997
  (1, 124)	0.6153132201597241
  (3, 165)	0.3821208029236373
  (3, 126)	0.40943116621421516
  (3, 156)	0.5262736640331138
  (3, 162)	0.33627953034780667
  (3, 105)	0.45123755233282153
  (3, 240)	0.3044447264847374
  (4, 57)	0.15318142128849724
  (4, 93)	0.11064443182646445
  (4, 112)	0.22132632688591347
  (4, 81)	0.1748324461017668
  (4, 234)	0.16473736215364093
  (4, 228)	0.17843208926962345
  (4, 95)	0.10773397375148983
  (4, 136)	0.4030879419998217
  (4, 29)	0.2951346090793672
  :	:
  (994, 230)	0.22676097118144217
  (995, 12)	0.20029683517925337
  (995, 120)	0.20982895208613636
  (995, 67)	0.22211798813702993
  (995, 102)	0.23165010504391287
  (995, 151)	0.23536700662054022
  (995, 195)	0.22823088704458447
  (995, 210)	0.

In [15]:
# print out words
tf_selected_words

["'m",
 "'s",
 'abl',
 'absolut',
 'accur',
 'actual',
 'adjust',
 'alarm',
 'alreadi',
 'alway',
 'amaz',
 'amazon',
 'anoth',
 'arm',
 'arriv',
 'automat',
 'awesom',
 'bad',
 'band',
 'batteri',
 'beauti',
 'best',
 'better',
 'big',
 'bit',
 'black',
 'blue',
 'bought',
 'box',
 'br',
 'bracelet',
 'brand',
 'break',
 'bright',
 'broke',
 'button',
 'buy',
 'ca',
 'came',
 'case',
 'casio',
 'chang',
 'cheap',
 'clasp',
 'classi',
 'clock',
 'color',
 'come',
 'comfort',
 'compliment',
 'cool',
 'cost',
 'crown',
 'crystal',
 'dark',
 'date',
 'daughter',
 'day',
 'deal',
 'definit',
 'deliveri',
 'design',
 'dial',
 'differ',
 'difficult',
 'disappoint',
 'display',
 'dress',
 'durabl',
 'easi',
 'easili',
 'end',
 'everi',
 'everyday',
 'everyth',
 'exact',
 'excel',
 'expect',
 'expens',
 'face',
 'fair',
 'far',
 'fast',
 'featur',
 'feel',
 'fell',
 'fine',
 'finish',
 'fit',
 'function',
 'gave',
 'gift',
 'gold',
 'good',
 'got',
 'great',
 'hand',
 'happi',
 'hard',
 'heavi

# (Optional) Calculate Document Similarity

In [16]:
# use cosine similarity to check the similarity for two documents
from sklearn.metrics.pairwise import cosine_similarity
cos_matrix = cosine_similarity(tfidf_matrix)
print (cos_matrix)

[[1.         0.42494052 0.         ... 0.44653381 0.         0.04002091]
 [0.42494052 1.         0.         ... 0.44725476 0.         0.09418003]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.44653381 0.44725476 0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.04002091 0.09418003 0.         ... 0.         0.         1.        ]]


# Part 4: K-means clustering

In [17]:
# k-means clustering
from sklearn.cluster import KMeans

num_clusters = 5

# number of clusters
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()
print(clusters)

[0, 0, 2, 2, 2, 0, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 2, 2, 2, 2, 3, 2, 1, 2, 2, 4, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 0, 3, 2, 2, 4, 0, 2, 0, 2, 2, 0, 2, 1, 0, 2, 2, 2, 2, 0, 2, 1, 2, 2, 2, 4, 2, 4, 2, 0, 0, 2, 2, 2, 2, 2, 1, 2, 0, 2, 2, 2, 2, 4, 4, 4, 1, 1, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 4, 1, 2, 2, 0, 4, 2, 2, 2, 2, 2, 4, 1, 2, 2, 2, 0, 2, 0, 2, 2, 1, 2, 2, 2, 3, 0, 0, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 2, 2, 2, 4, 2, 2, 2, 1, 4, 2, 2, 2, 2, 2, 2, 1, 2, 4, 0, 0, 3, 2, 4, 4, 2, 4, 2, 0, 2, 2, 1, 2, 2, 3, 2, 3, 2, 3, 0, 4, 2, 1, 2, 2, 1, 2, 1, 2, 4, 2, 2, 2, 2, 1, 2, 0, 4, 2, 4, 3, 2, 2, 2, 2, 2, 3, 4, 3, 2, 2, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 3, 2, 4, 2, 1, 2, 2, 3, 4, 2, 1, 2, 3, 2, 2, 1, 2, 2, 2, 1, 3, 2, 2, 0, 3, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 0, 0, 3, 2, 2, 2, 1, 2, 2, 2, 2, 4, 0, 2, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 1, 2, 4, 1, 2, 2, 2, 3, 2, 2, 0, 1, 2, 2, 0, 1, 1, 0, 2, 2, 4, 2, 2, 0, 2, 1, 2, 2, 0, 2, 2, 3, 2, 1, 2, 2, 0, 2, 2, 4, 4, 2, 0, 4, 2, 

## 4.1. Analyze K-means Result

In [18]:
# create DataFrame films from all of the input files.
product = { 'review': df[:999].product_title, 'cluster': clusters}
frame = pd.DataFrame(product, columns = ['review', 'cluster'])

In [19]:
frame.head(10)

Unnamed: 0,review,cluster
0,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",0
1,Kenneth Cole New York Women's KC4944 Automatic...,0
2,Ritche 22mm Black Stainless Steel Bracelet Wat...,2
3,Citizen Men's BM8180-03E Eco-Drive Stainless S...,2
4,Orient ER27009B Men's Symphony Automatic Stain...,2
5,Casio Men's GW-9400BJ-1JF G-Shock Master of G ...,0
6,Fossil Women's ES3851 Urban Traveler Multifunc...,1
7,INFANTRY Mens Night Vision Analog Quartz Wrist...,2
8,G-Shock Men's Grey Sport Watch,1
9,Heiden Quad Watch Winder in Black Leather,2


In [20]:
print ("Number of films included in each cluster:")
frame['cluster'].value_counts().to_frame()

Number of films included in each cluster:


Unnamed: 0,cluster
2,650
0,113
1,98
4,75
3,63


In [21]:
print ("<Document clustering result by K-means>")

#km.cluster_centers_ denotes the importances of each items in centroid.
#We need to sort it in decreasing-order and get the top k items.
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

Cluster_keywords_summary = {}
for i in range(num_clusters):
    print ("Cluster " + str(i) + " words:", end='')
    Cluster_keywords_summary[i] = []
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        Cluster_keywords_summary[i].append(vocab_frame_dict[tf_selected_words[ind]])
        print (vocab_frame_dict[tf_selected_words[ind]] + ",", end='')
    print ()
    
    cluster_reviews = frame[frame.cluster==i].review.tolist()
    print ("Cluster " + str(i) + " reviews (" + str(len(cluster_reviews)) + " reviews): ")
    print (", ".join(cluster_reviews))
    print ()

<Document clustering result by K-means>
Cluster 0 words:loved,watch,wife,husband,looks,'s,
Cluster 0 reviews (113 reviews): 
Invicta Women's 15150 "Angel" 18k Yellow Gold Ion-Plated Stainless Steel and Brown Leather Watch, Kenneth Cole New York Women's KC4944 Automatic Silver Automatic Mesh Bracelet Analog Watch, Casio Men's GW-9400BJ-1JF G-Shock Master of G Rangeman Digital Solar Black Carbon Fiber Insert Watch, Domire Fashion Accessories Trial Order New Quartz Fashion Weave Wrap Around Leather Bracelet Lady Woman Butterfly Wrist Watch, Batman Kids' BAT4072 Black Rubber Batman Logo Strap Watch, Timex Easy Reader Day-Date Leather Strap Watch, Casio F108WH Water Resistant Digital Blue Resin Strap Watch, Stuhrling Original Women's 956.02 Symphony Gold-Tone Watch with Brown Genuine Leather Band, Seiko Men's SNKK27 Seiko 5 Stainless Steel Automatic Watch, Swiss Legend Women's 11044D-01 Neptune Black Dial Watch with Silicone Band, Michael Kors Womens MK5145 - Runway Chronograph, LEGO Star W

# Part 5: Topic Modeling - Latent Dirichlet Allocation

In [22]:
# Use LDA for clustering
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5, learning_method = 'online')

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
# LDA requires integer values
tfidf_model_lda = CountVectorizer(max_df=0.99, max_features=500,
                                 min_df=0.01, stop_words='english',
                                 tokenizer=tokenization_and_stemming, ngram_range=(1,1))

tfidf_matrix_lda = tfidf_model_lda.fit_transform(data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix_lda.shape[0]) + \
      " synoposes and " + str(tfidf_matrix_lda.shape[1]) + " terms.")

In total, there are 999 synoposes and 245 terms.


In [24]:
# document topic matrix for tfidf_matrix_lda
lda_output = lda.fit_transform(tfidf_matrix_lda)
print(lda_output.shape)
print(lda_output)

(999, 5)
[[0.02543196 0.02513806 0.89867858 0.02526836 0.02548303]
 [0.05136517 0.05010455 0.79578137 0.05065455 0.05209437]
 [0.2        0.2        0.2        0.2        0.2       ]
 ...
 [0.06723404 0.06666776 0.7299378  0.06860748 0.06755292]
 [0.10000153 0.10000079 0.59905753 0.10093992 0.10000023]
 [0.05027305 0.05356063 0.79421285 0.05162596 0.0503275 ]]


In [25]:
# topics and words matrix
topic_word = lda.components_
print(topic_word.shape)
print(lda_output)

(5, 245)
[[0.02543196 0.02513806 0.89867858 0.02526836 0.02548303]
 [0.05136517 0.05010455 0.79578137 0.05065455 0.05209437]
 [0.2        0.2        0.2        0.2        0.2       ]
 ...
 [0.06723404 0.06666776 0.7299378  0.06860748 0.06755292]
 [0.10000153 0.10000079 0.59905753 0.10093992 0.10000023]
 [0.05027305 0.05356063 0.79421285 0.05162596 0.0503275 ]]


In [26]:
# column names
topic_names = ["Topic" + str(i) for i in range(lda.n_components)]

# index names
doc_names = ["Doc" + str(i) for i in range(len(data))]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topic_names, index=doc_names)

# get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['topic'] = topic

df_document_topic.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,topic
Doc0,0.03,0.03,0.9,0.03,0.03,2
Doc1,0.05,0.05,0.8,0.05,0.05,2
Doc2,0.2,0.2,0.2,0.2,0.2,0
Doc3,0.03,0.03,0.11,0.03,0.8,4
Doc4,0.01,0.01,0.54,0.45,0.01,2
Doc5,0.04,0.04,0.84,0.04,0.04,2
Doc6,0.46,0.03,0.03,0.45,0.03,0
Doc7,0.03,0.03,0.88,0.03,0.03,2
Doc8,0.18,0.01,0.01,0.41,0.39,3
Doc9,0.02,0.02,0.36,0.58,0.02,3


In [27]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
3,259
2,257
4,217
0,191
1,75


In [28]:
# install pyLDAvis
!pip install pyLDAvis

Processing /Users/yinruideng/Library/Caches/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414/pyLDAvis-2.1.2-py2.py3-none-any.whl
Collecting funcy
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
[K     |████████████████████████████████| 552kB 1.4MB/s eta 0:00:01
Building wheels for collected packages: funcy
  Building wheel for funcy (setup.py) ... [?25ldone
[?25h  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32040 sha256=abe77d7d6c57268ad3534c0eb007ce742f2dd78a04c37c7bcaae41001771530e
  Stored in directory: /Users/yinruideng/Library/Caches/pip/wheels/20/5a/d8/1d875df03deae6f178dfdf70238cca33f948ef8a6f5209f2eb
Successfully built funcy
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.14 pyLDAvis-2.1.2


In [29]:
# topic word matrix
print(lda.components_)
# topic-word matrix
df_topic_words = pd.DataFrame(lda.components_)

# column and index
df_topic_words.columns = tfidf_model_lda.get_feature_names()
df_topic_words.index = topic_names

df_topic_words.head()

[[10.77004285 82.89262476  0.20931888 ... 16.84757218  0.21500793
  12.10843036]
 [ 0.2050182   0.21128167  0.20305231 ...  0.23233059  0.21199673
   2.40374859]
 [33.19821352 32.16417747  0.24181739 ...  0.4096604   2.21346417
  28.31672027]
 [ 3.75945693 65.77516092  9.69876875 ...  0.2072508  44.49087783
  12.43924776]
 [ 0.21772896 50.223563    6.02417804 ...  0.20385897 24.03620931
   0.23375566]]


Unnamed: 0,'m,'s,abl,absolut,accur,actual,adjust,alarm,alreadi,alway,...,weight,went,wife,wind,wish,work,worn,worth,wrist,year
Topic0,10.770043,82.892625,0.209319,0.216249,0.213654,0.208857,9.557196,15.779351,5.699592,1.896583,...,0.544169,0.20632,19.750692,0.203742,4.095446,8.408726,2.684927,16.847572,0.215008,12.10843
Topic1,0.205018,0.211282,0.203052,3.687835,0.20515,0.204079,0.204438,0.217729,0.203883,4.061474,...,0.205105,0.211651,0.201493,0.201311,0.208417,0.212194,0.201179,0.232331,0.211997,2.403749
Topic2,33.198214,32.164177,0.241817,13.089021,3.614839,0.255155,0.326065,0.204318,0.500981,7.103969,...,2.850333,4.997812,0.205958,10.490469,2.164433,92.747212,0.212111,0.40966,2.213464,28.31672
Topic3,3.759457,65.775161,9.698769,0.21735,8.814046,0.37591,3.502937,0.211501,7.168224,0.290005,...,6.185351,8.137742,0.215017,2.000245,5.449923,25.471074,5.152407,0.207251,44.490878,12.439248
Topic4,0.217729,50.223563,6.024178,0.223503,0.201941,15.403987,10.933247,1.16454,0.223949,0.405008,...,5.375177,0.205418,0.208345,1.409033,0.211807,4.903158,5.138584,0.203859,24.036209,0.233756


In [30]:
# print top n keywords for each topic
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    # for each topic, we have words weight
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model=tfidf_model_lda, lda_model=lda, n_words=15)        

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word '+str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic '+str(i) for i in range(df_topic_words.shape[0])]
df_topic_words

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,watch,like,'s,love,look,easi,face,color,light,want,littl,pretti,n't,realli,read
Topic 1,product,excel,cheap,qualiti,broke,price,came,great,fast,amaz,watch,deliveri,color,pleas,fell
Topic 2,watch,good,time,work,n't,love,day,look,qualiti,replac,buy,got,month,week,'m
Topic 3,watch,br,look,band,great,n't,like,'s,hand,wear,time,good,second,wrist,price
Topic 4,watch,nice,love,'s,perfect,fit,beauti,expect,realli,gift,order,wrist,awesom,easili,size
