# Topic Modeling Overview
* Topic Modeling allows us efficiently analyse large volumes of text by clustering documents into topics. A large amount of text data is unlabeled meaning we won't be able to apply previous supervised learning approaches to create machine learning models for the data!.
* If we have unlabeled data, then we can attempt to "discover" labels.
* In this case of text data, this means attempting to discover clusters of documents, grouped together by topic.
* A very important idea to keep in mind here is that we don't know the "correct" topic or "right answer"!
* All we know is that the documents clustered together share similar topic ideas. It's upto the user to identify what these topics represents.

<font color='purple'>For Topic Modeling we have two algorithms:

**1. Latent Dirichlet Allocation**
    
**2. Non-Negative Matrix Factorization**
    
### Latent Dirichlet Allocation
**Johann Peter Gustav Lejeune Dirichlet** was a German mathmaticanin 1800s who contributed widely to the field of modern mathematics. There is probability distribution named after him `Dirichlet Distribution`. Latent Dirichlet Allocation is based off this probabilty distribution.

In 2003 LDA was first published as a graphical model for topic discovery in Journal of machine Learning Research by David Blei, Andrew Ng and Michael I. Jordan.
    
### Non-negative Matrix Factorization
A non-negative Matrix factorization is an unsupervised alogorithm that simultaneously performs dimensionality reduction and clustering. We can use it with conjunction with Tk-IDF to model topics across documents. 
    

## Now we will perform top modeling with the help of Latent Dirichlet Allocation

### Load the data

In [130]:
import pandas as pd 
import numpy as np


In [131]:
npr=pd.read_csv("npr.csv")
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


As you can see we have no labels in this data set 

In [132]:
npr["Article"][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [133]:
len(npr)

11992

### Preprocessing 

In [134]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [135]:
# Transform the data
dtm=cv.fit_transform(npr["Article"])

In [136]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

### LDA 

In [137]:
# Import lda 
from sklearn.decomposition import LatentDirichletAllocation
lda=LatentDirichletAllocation(n_components=7,random_state=42)

lda.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [142]:
len(cv.get_feature_names())

54777

In [151]:
cv.get_feature_names()[50000]

'transcribe'

In [152]:
for i in range(10):
    j=np.random.randint(0,54777)
    print(cv.get_feature_names()[j])

surveilled
speculators
oblivious
vignesh
malpractice
recompense
emailed
estrada
flamethrower
radius


In [153]:
len(lda.components_)

7

In [155]:
lda.components_.shape

(7, 54777)

In [160]:
lda.components_[0]

array([8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
       1.43006821e-01, 1.42902042e-01, 1.42861626e-01])

In [161]:
## Finding words in the topic #0 with highest probability
single_topic=lda.components_[0]
top_ten_words=single_topic.argsort()[-10:]

In [162]:
for index in top_ten_words:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says


Seems like our first topic belongs to the public health sector

In [163]:
## Now grab the highest probability words for each topic
for i,topic in enumerate(lda.components_):
    print(f"Top 15 words for topic # {i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print("\n")
    print("\n")

Top 15 words for topic # 0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']




Top 15 words for topic # 1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']




Top 15 words for topic # 2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']




Top 15 words for topic # 3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']




Top 15 words for topic # 4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']




Top 15 words for topic # 5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people

### Attaching Discovered Topic Labels to Original Articles

In [164]:
len(npr)

11992

In [165]:
dtm.shape

(11992, 54777)

In [166]:
topic_results=lda.transform(dtm)

In [167]:
topic_results.shape

(11992, 7)

In [171]:
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [173]:
topic_results[0].argmax()

1

### Attach new column of topic to the original data

In [174]:
npr["topic"]=topic_results.argmax(axis=1)

In [176]:
npr.head()

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


In [177]:
npr["Article"][4]

'From photography, illustration and video, to data visualizations and immersive experiences, visuals are an important part of our storytelling at NPR. Interwoven with the written and the spoken word, images  —   another visual language  —   can create deeper understanding and empathy for the struggles and triumphs we face together. We told a lot of stories in 2016  —   far more than we can list here. So, instead, here’s a small selection of our favorite pieces, highlighting some of the work we’re most proud of, some of the biggest stories we reported, and some of the stories we had the most fun telling. Transport yourself to Rocky Mountain National Park, with all its sights and sounds, in an immersive geology lesson with Oregon State University geology professor Eric Kirby, who discusses the geologic history of the Rockies in   video. ”Today, Indians use much less energy per person than Americans or Chinese people. Many of its 1. 2   population live on roughly $2 a day. But what if all