## Topic Modeling NLP  

Two approaches are mainly used for topic modeling: Latent Dirichlet Allocation and Non-Negative Matrix factorization.

In [22]:
import pandas as pd  
import numpy as np

import random

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF

In [2]:
reviews_datasets = pd.read_csv(r'amazon-fine-food-reviews/Reviews.csv')  
reviews_datasets = reviews_datasets.head(20000)  
reviews_datasets.dropna()  

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,5,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,5,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,5,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [3]:
reviews_datasets.head()  

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
reviews_datasets['Text'][350]  

'These chocolate covered espresso beans are wonderful!  The chocolate is very dark and rich and the "bean" inside is a very delightful blend of flavors with just enough caffine to really give it a zing.'

We specify to only include those words that appear in less than 80% of the document and appear in at least 2 documents. We also remove all the stop words as they do not really contribute to topic modeling.

In [6]:
count_vect = CountVectorizer(max_df=0.8, 
                             min_df=2, 
                             stop_words='english')  
# create a document-term matrix
doc_term_matrix = count_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))  

Each of 20k documents is represented as 14546 dimensional vector, which means that our vocabulary has 14546 words.

In [7]:
doc_term_matrix.shape

(20000, 14546)

### Latent Dirichlet Allocation (LDA)

We will use LDA to create topics along with the probability distribution for each word in our vocabulary for each topic.

In [9]:
LDA = LatentDirichletAllocation(n_components=5, # specifies the number of categories, or topics, that we want our text to be divided into.
                                random_state=42)

In [10]:
LDA.fit(doc_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=5, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

Let's randomly fetch words from our vocabulary. We know that the count vectorizer contains all the words in our vocabulary. We can use the get_feature_names() method and pass it the ID of the word that we want to fetch.

In [12]:
# randomly fetches 10 words from our vocabulary
for i in range(10):  
    random_id = random.randint(0,len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])

tightly
elite
critters
dander
ordinarily
intenso
ideal
philly
touching
vice


We find 10 words with the highest probability for the first topic.

In [13]:
# Get first topic
first_topic = LDA.components_[0]  

The first topic contains the probabilities of 14546 words for topic 1. 

In [14]:
# To sort the indexes according to probability values
top_topic_words = first_topic.argsort()[-10:]  

In [15]:
for i in top_topic_words:  
    print(count_vect.get_feature_names()[i])

water
great
just
drink
sugar
good
flavor
taste
like
tea


The words show that the first topic might be about tea.

Let's print the 10 words with highest probabilities for all the five topics.

In [16]:
for i,topic in enumerate(LDA.components_):  
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['water', 'great', 'just', 'drink', 'sugar', 'good', 'flavor', 'taste', 'like', 'tea']


Top 10 words for topic #1:
['br', 'chips', 'love', 'flavor', 'chocolate', 'just', 'great', 'taste', 'good', 'like']


Top 10 words for topic #2:
['just', 'drink', 'orange', 'sugar', 'soda', 'water', 'like', 'juice', 'product', 'br']


Top 10 words for topic #3:
['gluten', 'eat', 'free', 'product', 'like', 'dogs', 'treats', 'dog', 'br', 'food']


Top 10 words for topic #4:
['cups', 'price', 'great', 'like', 'amazon', 'good', 'br', 'product', 'cup', 'coffee']




The output shows that the second topic might contain reviews about chocolates, etc. Similarly, the third topic might again contain reviews about sodas or juices.

We will assign the probability of all the topics from document-term matrix to each document. 

In [17]:
topic_values = LDA.transform(doc_term_matrix)  
topic_values.shape  

(20000, 5)

Each of the document has 5 columns where each column corresponds to the probability value of a particular topic. And add a new column for topic.

In [18]:
# To find the topic index with maximum value
reviews_datasets['Topic'] = topic_values.argmax(axis=1) 

In [19]:
reviews_datasets.head()  

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Topic
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,3
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,1
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,1
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,1


We can see a new column for the topic in the output.

### Non-Negative Matrix Factorization (NMF) for Topic Modeling

Non-negative matrix factorization is also a supervised learning technique which performs clustering as well as dimensionality reduction. It can be used in combination with TF-IDF scheme to perform topic modeling

In [23]:
tfidf_vect = TfidfVectorizer(max_df=0.8,
                             min_df=2,
                             stop_words='english')  

# Generate document term matrix 
doc_term_matrix = tfidf_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))  

We can create a probability matrix that contains probabilities of all the words in the vocabulary for all the topics.

In [24]:
nmf = NMF(n_components=5, 
          random_state=42) 

In [25]:
nmf.fit(doc_term_matrix )

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=5, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

In [26]:
# let's randomly get 10 words from our vocabulary
for i in range(10):  
    random_id = random.randint(0,len(tfidf_vect.get_feature_names()))
    print(tfidf_vect.get_feature_names()[random_id])

idiocy
cinammon
redefine
gsd
disagree
climb
luv
ate
coupled
quintessential


Retrieve the probability vector of words for the first topic and will retrieve the indexes of the ten words with the highest probabilities.

In [27]:
first_topic = nmf.components_[0]  
top_topic_words = first_topic.argsort()[-10:]  

In [28]:
# retrieve the actual words using these indexes
for i in top_topic_words:  
    print(tfidf_vect.get_feature_names()[i])

really
chocolate
love
flavor
just
product
taste
great
good
like


The words for the topic 1 shows that topic 1 might contain reviews for chocolates. 

In [29]:
# Ten words with highest probabilities for each of the topics
for i,topic in enumerate(nmf.components_):  
    print(f'Top 10 words for topic #{i}:')
    print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['really', 'chocolate', 'love', 'flavor', 'just', 'product', 'taste', 'great', 'good', 'like']


Top 10 words for topic #1:
['like', 'keurig', 'roast', 'flavor', 'blend', 'bold', 'strong', 'cups', 'cup', 'coffee']


Top 10 words for topic #2:
['com', 'amazon', 'orange', 'switch', 'water', 'drink', 'soda', 'sugar', 'juice', 'br']


Top 10 words for topic #3:
['bags', 'flavor', 'drink', 'iced', 'earl', 'loose', 'grey', 'teas', 'green', 'tea']


Top 10 words for topic #4:
['old', 'love', 'cat', 'eat', 'treat', 'loves', 'dogs', 'food', 'treats', 'dog']




The words for topic 1 shows that this topic contains reviews about coffee. Similarly, the words for topic 2 depicts that it contains reviews about sodas and juices. Topic 3 again contains reviews about drinks. Finally, topic 4 may contain reviews about animal food since it contains words such as "cat", "dog", "treat", etc.

In [30]:
# add the topics to the data set
topic_values = nmf.transform(doc_term_matrix)  
reviews_datasets['Topic'] = topic_values.argmax(axis=1)  

In [31]:
reviews_datasets.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Topic
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,4
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,4
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,0


### Conclusion

Topic modeling is one of the most sought after research areas in NLP. It is used to group large volumes of unlabeled text data. We reviewed two approaches to topic modeling have been explained. We saw how Latent Dirichlet Allocation and Non-Negative Matrix Factorization can be used for topic modeling.