### Topic modeling
Topic modeling is an unsupervised technique that intends to analyze large volumns of text data by clustering the documents into groups. <br>
Here the text data does not have any labels attached to it. Rather, topic modeling tries to group docs into clusters based on similar characteristics. <br>
Example is clustering a large number of newspaper articles that belong in same category.

Two methods:
* Latent Dirichlet Allocation (LDA)
* Non-Negative Matrix factorization

#### Latent Dirichlet Allocation (LDA)
Based on two general assumptions:
* Docs that have similar words usually have same topic
* Docs that have groups of words frequently occurring together usually have same topic.

Mathematically, two assumptions can be represented as:
* Docs are probability distributions over latent topics.
* Topics are probability distributions over words.

<i>Download datasets from: <a href="https://www.kaggle.com/sdxingaijing/topic-model-lda-algorithm/data?select=Reviews.csv">Here</a></i>

In [1]:
import pandas as pd
import numpy as np

In [3]:
reviews_datasets = pd.read_csv('datasets/reviews.csv')
reviews_datasets = reviews_datasets.head(20000)
reviews_datasets.dropna()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
19995,19996,B002C50X1M,A1XRXZI5KOMVDD,"KAF1958 ""amandaf0626""",0,0,4,1307664000,Crispy and tart,Deep River Salt & Vinegar chips are thick and ...
19996,19997,B002C50X1M,A7G9M0IE7LABX,Kevin,0,0,5,1307059200,Exceeded my expectations. One of the best chip...,I was very skeptical about buying a brand of c...
19997,19998,B002C50X1M,A38J5PRUDESMZF,ray,0,0,5,1305763200,"Awesome Goodness! (deep river kettle chips, sw...",Before you turn to other name brands out there...
19998,19999,B002C50X1M,A17TPOSAG43GSM,Herrick,0,0,3,1303171200,"Pretty good, but prefer other jalapeno chips","I was expecting some ""serious flavor"" as it wa..."


In [4]:
reviews_datasets['Text'][350]

'These chocolate covered espresso beans are wonderful!  The chocolate is very dark and rich and the "bean" inside is a very delightful blend of flavors with just enough caffine to really give it a zing.'

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))

Here we use `CountVectorizer` to create a document-term matrix. we specify to only include those words that appear in less that 80% of document and appear in atleast 2 documents. also remove all stop words


In [7]:
# each of 20k docs representated as 14546 dim vector which means that our vocab has 14546 words
doc_term_matrix

<20000x14546 sparse matrix of type '<class 'numpy.int64'>'
	with 594703 stored elements in Compressed Sparse Row format>

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(doc_term_matrix)

LatentDirichletAllocation(n_components=5, random_state=42)

we use `LatentDirichletAllocation` to perform LDA on document-term matrix. Parameter `n_components` is number of categories or topics that we want our text to be divided into. `random_state` is like `seed`

In [9]:
import random
for i in range(10):
    random_id = random.randint(0, len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])

trouble
dispense
woof
qualities
mcdougalls
puffed
eyeball
manufacture
brash
mastiff


In [10]:
# finding 10 words with highest probability for first topic
first_topic = lda.components_[0]

In [12]:
# first topic containts probailities of 14546 words for topic 1. sort with argsort(). oncec sorted, 10 words with the highest probabilities will belong to last 10 indexes of the array.
top_topic_words = first_topic.argsort()[-10:]
top_topic_words

array([14106,  5892,  7088,  4290, 12596,  5771,  5187, 12888,  7498,
       12921])

In [13]:
# last word with highest probability
for i in top_topic_words:
    print(count_vect.get_feature_names()[i])

water
great
just
drink
sugar
good
flavor
taste
like
tea


In [15]:
for i,topic in enumerate(lda.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['water', 'great', 'just', 'drink', 'sugar', 'good', 'flavor', 'taste', 'like', 'tea']


Top 10 words for topic #1:
['br', 'chips', 'love', 'flavor', 'chocolate', 'just', 'great', 'taste', 'good', 'like']


Top 10 words for topic #2:
['just', 'drink', 'orange', 'sugar', 'soda', 'water', 'like', 'juice', 'product', 'br']


Top 10 words for topic #3:
['gluten', 'eat', 'free', 'product', 'like', 'dogs', 'treats', 'dog', 'br', 'food']


Top 10 words for topic #4:
['cups', 'price', 'great', 'like', 'amazon', 'good', 'br', 'product', 'cup', 'coffee']




In [16]:
# add column to original data frame that will store topic for the text.
topic_values = lda.transform(doc_term_matrix)
topic_values.shape

(20000, 5)

In [17]:
reviews_datasets['topic'] = topic_values.argmax(axis=1)

In [18]:
reviews_datasets.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,topic
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,3
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,1
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,1
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,1
