## Topic Modelling

Using topic modeling, we can discover labels, by clustering topics common in documents. 

### Latent Dirichlet Allocation Theory:-

Dirichlet Algorithm is a probablity distribution off which LDA is based. LDA was first published as a graphical mmodel for topic discovery in 2003. 

Assumptions:-

1. Similar topics use similar words 
2. Latent topics can then be found by searching group of words that frequently occur together

We need to give a set of K-topics at the start only





### Latent Dirichlet Allocation

In [9]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer # For preprocessing
from sklearn.decomposition import LatentDirichletAllocation # For topic modeling

npr = pd.read_csv('npr.csv')
print(npr.head()) # No label column
print(len(npr))


                                             Article
0  In the Washington of 2016, even when the polic...
1    Donald Trump has used Twitter  —   his prefe...
2    Donald Trump is unabashedly praising Russian...
3  Updated at 2:50 p. m. ET, Russian President Vl...
4  From photography, illustration and video, to d...
11992


In [10]:
# PREPROCESSING

cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english') 
# max_df: ignore terms that appear in more than 90% of the documents
# min_df: ignore terms that appear in less than 2 documents 
# stop_words: ignore common words like 'the', 'a', 'an'
dtm = cv.fit_transform(npr['Article']) # Document-Term Matrix

In [11]:
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [15]:
# Grab the vocabulary of word

print("Number of words in the vocabulary:", len(cv.get_feature_names())) # Number of words in the vocabulary
import random
print("Random Word from vocabulary:", cv.get_feature_names()[random.randint(0, len(cv.get_feature_names()))]) 

Number of words in the vocabulary: 54777
Random Word from vocabulary: congregation




In [19]:
# Grab the topics
print(len(LDA.components_)) # Number of topics

topic = LDA.components_[0] # The topics
top_ten_words = topic.argsort()[-10:]
#ARGSORT -> Index positions sorted from least to greatest
# We neeed top 10 values (10 greatest values)
# So we extract last 10 values of argsort using [-10:]

for i in top_ten_words:
    print(cv.get_feature_names()[i])

7
new
percent




government
company
million
care
people
health
said
says


In [21]:
# Grab the highest probability words for each topic

for i,topic in enumerate(LDA.components_):
    print("Topic {}: ".format(i))
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n\n')

Topic 0: 
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']



Topic 1: 
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']



Topic 2: 
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']



Topic 3: 
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']



Topic 4: 
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']



Topic 5: 
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', 'just', 'like']



Topic 6: 
['student', 'years', 'data', 'science', 'university', 'people', 'time', 'sc