In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('npr.csv')

In [3]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [4]:
# Printing article
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [5]:
# Checking out the len of the article
len(npr)

11992

# Data preprocessing

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
cv = CountVectorizer(max_df = 0.95, min_df = 2, stop_words='english' )
# max_df = gets rid of frequent/common words. Range (0-1)
# min_df = min word frequency. U can pass int too.
# min_df=2 means if a word has shown up at least twice in the doc, discaard it.

In [8]:
# fit transforming the data set, because this is an unsupervised learning
dtm = cv.fit_transform(npr['Article'])
#dtm = document transform matrix

In [9]:
dtm

#54777 terms
# (doc by words)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3033388 stored elements and shape (11992, 54777)>

# Performing LDA

In [10]:
# LDA is built-in in sklearn, import it
from sklearn.decomposition import LatentDirichletAllocation

In [11]:
LDA = LatentDirichletAllocation(n_components=7, random_state = 42)
# n_component -> how many topics we want

(We are working with a large number of documents. LDA is an iterative process, so it will continue updating the weights for each word and topic repeatedly until they stabilize)

In [12]:
# fit LDA into dtm
LDA.fit(dtm) # This should take a while

## Now we have 3 steps left:
1. Grab the vocabulary of words
2. Grab the topics
3. Grab the highest probability words per topic

Step 1. Grab the topic

get_feature_names() holds an instance of every single word. If you print the len, you will get 54777

In [15]:
len(cv.get_feature_names_out())

54777

In [17]:
type(cv.get_feature_names_out()) # A list of all the words in npr document

numpy.ndarray

In [20]:
cv.get_feature_names_out()[43000] # You can grab a word by index

'scabbed'

Step 2. Grab the topics

In [21]:
len(LDA.components_) # It will return the number of topics

7

In [22]:
type(LDA.components_) # This is actually a NumPy array

numpy.ndarray

In [23]:
LDA.components_.shape # It's an array of 7 topics by 54777 words

(7, 54777)

In [24]:
LDA.components_ # checking the actual components

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

Now, we will use this information along with our ability to access the vocabulary (cv.get_feature_names_out()) to get the highest probability for per topic.

In [26]:
single_topic = LDA.components_[0] # The very first topic

In [27]:
single_topic.argsort() #argsort() return the index that would sort this array

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

Let me make it clear with an example

In [29]:
import numpy as np

In [30]:
arr = np.array([10,200,2])

In [31]:
arr.argsort() # this is the index position that will sort this array (lowest -> highest)

array([2, 0, 1])

Back to actual task

In [32]:
# ARGSORT --> INDEX POSITIONS SORTED FROM LEAST--> GREATEST
# WE'RE LOOKING FOR TOP 10 VALUES (10 GREATEST VALUES)
# LAST 10 VALUES FROM THE LIST ARGSORT

single_topic.argsort()[-10:] #GRAB THE LAST 10 VALUES OF .ARGSORT() RETURNS

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [33]:
top_ten_words = single_topic.argsort()[-10:]

In [34]:
for i in top_ten_words:
  print(cv.get_feature_names_out()[i])

new
percent
government
company
million
care
people
health
said
says


Upon seeing the words, it feels like the article has something to do with govt, new rules, military and healthcare service. So maybe it's an article of a country. Let's clarify by identifying more words.

In [35]:
top_twenty_words = single_topic.argsort()[-20:]

for i in top_twenty_words:
  print(cv.get_feature_names_out()[i])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


I can see words like president, state, tax, insurance, federal, government, and percent. Now it's becoming clearer that the article might be about health insurance and politics.

LDA is now revealing an underlying topic. It's indicating that these specific words are likely to appear in this particular single_topic. Let's set up a simple loop to print out the top 15 words for each of the seven topics.



Code explanation: for each topic LDA.components_ (which was an array, 7 topics by 54777 words), we are grabbing the top 15 words, then we are doing a list comprehension, to basically get the index positions of our get_feature_names_out()

In [38]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

Notice we only printed seven topics, zero through six. You decide how many topics to have beforehand. If topics aren't clear or seem too similar, choose fewer topics. If you want more clarity, ask for more topics from the start.

Step 3. Grab the highest probability words per topic

In [39]:
dtm #we have a dtm

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3033388 stored elements and shape (11992, 54777)>

In [40]:
npr #and our npr

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."
...,...
11987,The number of law enforcement officers shot an...
11988,"Trump is busy these days with victory tours,..."
11989,It’s always interesting for the Goats and Soda...
11990,The election of Donald Trump was a surprise to...


We need to create a new column that has their topic number

In [42]:
# CREATING A LIST THE ACTUAL TOPICS OFF THE dtm (document term matrix)
topic_results = LDA.transform(dtm)

In [43]:
topic_results.shape

(11992, 7)

In [47]:
npr['Topic'] = topic_results.argmax(axis=1)

In [48]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4


Testing

In [49]:
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

Well it sure looks like something to do with policy and healthcare.