**LDA for Topic Modeling in Python**

The data set contains user reviews for different products in the food category. We will use LDA to group the user reviews into 5 categories.The first step, as always, is to import the data set along with the required libraries.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import pandas as pd
import numpy as np
reviews_datasets = pd.read_csv('drive/My Drive/data/tm/Reviews.csv')
reviews_datasets = reviews_datasets.head(20000)


In [5]:
#dropna is used to remove missing values.
reviews_datasets.dropna()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
19995,19996,B002C50X1M,A1XRXZI5KOMVDD,"KAF1958 ""amandaf0626""",0,0,4,1307664000,Crispy and tart,Deep River Salt & Vinegar chips are thick and ...
19996,19997,B002C50X1M,A7G9M0IE7LABX,Kevin,0,0,5,1307059200,Exceeded my expectations. One of the best chip...,I was very skeptical about buying a brand of c...
19997,19998,B002C50X1M,A38J5PRUDESMZF,ray,0,0,5,1305763200,"Awesome Goodness! (deep river kettle chips, sw...",Before you turn to other name brands out there...
19998,19999,B002C50X1M,A17TPOSAG43GSM,Herrick,0,0,3,1303171200,"Pretty good, but prefer other jalapeno chips","I was expecting some ""serious flavor"" as it wa..."


In [7]:
reviews_datasets['Text'].describe

<bound method NDFrame.describe of 0        I have bought several of the Vitality canned d...
1        Product arrived labeled as Jumbo Salted Peanut...
2        This is a confection that has been around a fe...
3        If you are looking for the secret ingredient i...
4        Great taffy at a great price.  There was a wid...
                               ...                        
19995    Deep River Salt & Vinegar chips are thick and ...
19996    I was very skeptical about buying a brand of c...
19997    Before you turn to other name brands out there...
19998    I was expecting some "serious flavor" as it wa...
19999    I purchased the Salt and Vinegar chips and hav...
Name: Text, Length: 20000, dtype: object>

Before we can apply LDA, we need to create vocabulary of all the words in our data.
We could do so with the help of a count vectorizer.

In [0]:
#CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(reviews_datasets['Text'])

Now let's look at our document term matrix


In [9]:
doc_term_matrix

<20000x26618 sparse matrix of type '<class 'numpy.int64'>'
	with 1064912 stored elements in Compressed Sparse Row format>

In [11]:
df = pd.DataFrame(doc_term_matrix.toarray(), columns=count_vect.get_feature_names())
print(df)

       00  000  0003  000kwh  002  ...  zupas  zuppa  zwieback  çaykur  ît
0       0    0     0       0    0  ...      0      0         0       0   0
1       0    0     0       0    0  ...      0      0         0       0   0
2       0    0     0       0    0  ...      0      0         0       0   0
3       0    0     0       0    0  ...      0      0         0       0   0
4       0    0     0       0    0  ...      0      0         0       0   0
...    ..  ...   ...     ...  ...  ...    ...    ...       ...     ...  ..
19995   0    0     0       0    0  ...      0      0         0       0   0
19996   0    0     0       0    0  ...      0      0         0       0   0
19997   0    0     0       0    0  ...      0      0         0       0   0
19998   0    0     0       0    0  ...      0      0         0       0   0
19999   0    0     0       0    0  ...      0      0         0       0   0

[20000 rows x 26618 columns]


Next, we will use LDA to create topics along with the probability distribution for each word in our vocabulary for each topic.

In [10]:
from sklearn.decomposition import LatentDirichletAllocation
LDA = LatentDirichletAllocation(n_components=5, random_state=42)
LDA.fit(doc_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=5, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

 randomly fetches 10 words from our vocabulary

In [13]:


import random
for i in range(10):
  random_id = random.randint(0,len(count_vect.get_feature_names()))
  print(count_vect.get_feature_names()[random_id])


refreshed
thrives
pinacolada
interpreter
picture
cee
phobias
surveys
proof
excellentprice


Let's find 10 words with the highest probability for the first topic. To get the first
topic, we use the components_ attribute and pass a 0 index as the value:

In [0]:
first_topic = LDA.components_[0]

Once sorted, the 10 words with the highest probabilities will now belong to the last
10 indexes of the array. The following script returns the indexes of the 10 words with
the highest probabilities:

In [0]:
top_topic_words = first_topic.argsort()[-10:]

These indexes can then be used to retrieve the value of the words from
the count_vect object, which can be done like this:

In [20]:
for i in top_topic_words:
  print(count_vect.get_feature_names()[i])

that
you
are
is
to
of
it
br
and
the


Let's print the 10 words with highest probabilities for all the five topics

In [22]:
for i,topic in enumerate(LDA.components_):
  print(f'Top 10 words for topic #{i}:')
  print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
  print('\n')

Top 10 words for topic #0:
['that', 'you', 'are', 'is', 'to', 'of', 'it', 'br', 'and', 'the']


Top 10 words for topic #1:
['they', 'br', 'is', 'for', 'of', 'my', 'it', 'to', 'and', 'the']


Top 10 words for topic #2:
['was', 'for', 'is', 'of', 'in', 'this', 'to', 'it', 'and', 'the']


Top 10 words for topic #3:
['br', 'tea', 'to', 'of', 'this', 'is', 'coffee', 'and', 'it', 'the']


Top 10 words for topic #4:
['ounce', 'pack', 'pasta', 'product', 'amazon', 'href', 'gp', 'http', 'www', 'com']




As a final step, we will add a column to the original data frame that will store the
topic for the text. To do so, we can use LDA.transform() method and pass it our
document-term matrix. This method will assign the probability of all the topics to
each document.

In [23]:

topic_values = LDA.transform(doc_term_matrix)
topic_values.shape


(20000, 5)

To find the topic index with maximum value, we can call the argmax() method and
pass 1 as the value for the axis parameter.
we add a new column for topic in the data frame and assigns the
topic value to each row in the column

In [0]:

reviews_datasets['Topic'] = topic_values.argmax(axis=1)


Let's now see how the data set looks:


In [25]:
reviews_datasets.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Topic
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,1
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,2
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,3
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,2


**NMF for Topic Modeling in Python**

In [26]:
import pandas as pd
import numpy as np
reviews_datasets = pd.read_csv('drive/My Drive/data/tm/Reviews.csv')
reviews_datasets = reviews_datasets.head(20000)
reviews_datasets.dropna()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
19995,19996,B002C50X1M,A1XRXZI5KOMVDD,"KAF1958 ""amandaf0626""",0,0,4,1307664000,Crispy and tart,Deep River Salt & Vinegar chips are thick and ...
19996,19997,B002C50X1M,A7G9M0IE7LABX,Kevin,0,0,5,1307059200,Exceeded my expectations. One of the best chip...,I was very skeptical about buying a brand of c...
19997,19998,B002C50X1M,A38J5PRUDESMZF,ray,0,0,5,1305763200,"Awesome Goodness! (deep river kettle chips, sw...",Before you turn to other name brands out there...
19998,19999,B002C50X1M,A17TPOSAG43GSM,Herrick,0,0,3,1303171200,"Pretty good, but prefer other jalapeno chips","I was expecting some ""serious flavor"" as it wa..."


In the previous section we used count vectorizer, but in this section we will use
TFIDF vectorizer since NMF works with TFIDF. We will create a document term
matrix with TFIDF.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()
doc_term_matrix = tfidf_vect.fit_transform(reviews_datasets['Text'])

Once the document term matrix is generated, we can create a probability matrix that
contains probabilities of all the words in the vocabulary for all the topics. To do so,
we can use the NMF class from the sklearn.decomposition module.

In [28]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=5, random_state=42)
nmf.fit(doc_term_matrix )

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

randomly get 10 words from our vocabulary

In [29]:
import random
for i in range(10):
  random_id = random.randint(0,len(tfidf_vect.get_feature_names()))
  print(tfidf_vect.get_feature_names()[random_id])

mid
foldiers
mir
hartshorn
platoon
parts
junkyards
cannister
garfield
hesitantly


Next, we will retrieve the probability vector of words for the first topic and will
retrieve the indexes of the ten words with the highest probabilities:

In [0]:
first_topic = nmf.components_[0]
top_topic_words = first_topic.argsort()[-10:]

These indexes can now be passed to the tfidf_vect object to retrieve the actual words.

In [31]:
for i in top_topic_words:
  print(tfidf_vect.get_feature_names()[i])

in
tea
was
of
to
and
this
is
the
it


Lets's now print the ten words with highest probabilities for each of the topics:

In [32]:
for i,topic in enumerate(nmf.components_):
  print(f'Top 10 words for topic #{i}:')
  print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
  print('\n')

Top 10 words for topic #0:
['in', 'tea', 'was', 'of', 'to', 'and', 'this', 'is', 'the', 'it']


Top 10 words for topic #1:
['you', 'of', 'chips', 'to', 'and', 'the', 'them', 'these', 'are', 'they']


Top 10 words for topic #2:
['with', 'juice', 'in', 'or', 'that', 'to', 'you', 'of', 'the', 'br']


Top 10 words for topic #3:
['and', 'bold', 'strong', 'cups', 'of', 'is', 'this', 'the', 'cup', 'coffee']


Top 10 words for topic #4:
['and', 'to', 'treats', 'we', 'my', 'her', 'he', 'food', 'she', 'dog']




we add the topics to the data set and displays the first five rows:

In [33]:

topic_values = nmf.transform(doc_term_matrix)
reviews_datasets['Topic'] = topic_values.argmax(axis=1)
reviews_datasets.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Topic
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,4
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,0
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,0
