Importing all the necessary libraries.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('/content/drive/My Drive/nlp/UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv')

In [None]:
df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


* <font color = Red>Note:</font> Here we don't have any target variable becuase this is a unsupervised method. In this it clusters the documents to most likely word clusters.

In [None]:
df.shape

(11992, 1)

In [None]:
print(df.loc[0]['Article'])

In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing o

In [None]:
print(df.Article[0])

In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing o

## Preprocessing

* for preprocessing text we are using countvectorizer along with the three main parameters, which are extremely important.

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
* (In layman language ignore the words which are very common and that appear in specified percentage documnets.)

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
* This means ignore the words which are below this level for e.g. if min_df = 2 that means the word must be in atleast two documents and ignore the words which are not used in atleast two docs.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# This will ignore or remove the words which appear in 95% of corpus
# Include the words which appears atleast in two docs
# also remove the stop words

cv = CountVectorizer(max_df = 0.95, min_df = 2, stop_words = 'english')


In [None]:
dtm = cv.fit_transform(df.Article)

In [None]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## LDA

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
lda = LatentDirichletAllocation(n_components=7,random_state=42)

In [None]:
lda.fit(dtm)

In [None]:
# Grab the vocabulary of words

In [None]:
#get_feature_names_out() is holding everything or holding instance of all the words
len(cv.get_feature_names_out())

54777

In [None]:
cv.get_feature_names_out().shape

(54777,)

In [None]:
type(cv.get_feature_names_out())

numpy.ndarray

In [None]:
#here at every index location word is stored
cv.get_feature_names_out()[1300:1350]

array(['abo', 'aboard', 'abode', 'abolish', 'abolished', 'abolishing',
       'abolition', 'abolitionist', 'abolitionists', 'abominable',
       'abomination', 'abominations', 'aboraya', 'aboriginal', 'abort',
       'aborted', 'aborting', 'abortion', 'abortions', 'abortive', 'abou',
       'aboulafia', 'abound', 'abounded', 'abounds', 'abraham',
       'abrahamic', 'abrahams', 'abrahamsen', 'abramoff', 'abrams',
       'abrasion', 'abrasions', 'abrasive', 'abreast', 'abridged',
       'abrini', 'abroad', 'abrupt', 'abruptly', 'abs', 'abscess',
       'abscesses', 'absconded', 'absence', 'absences', 'absent',
       'absentee', 'absenteeism', 'absentia'], dtype=object)

In [None]:
#i could print a word or bunch of random words from the list

import random

random_word_id = random.randint(0,54777)

cv.get_feature_names_out()[random_word_id]

'dispositions'

As much as i executes the above code it will return random words stored on the index location. The generated random number act as index to get the word which is stored on that index.

In [None]:
for i in range(10):
  random_word_id = random.randint(0,54777)
  print(cv.get_feature_names_out()[random_word_id])


deeds
891
hermaphrodites
tasked
ummah
uttar
zuma
reduction
listens
collegial


In [None]:
for i in range(10):
  random_word_id = random.randint(0,54777)
  print(cv.get_feature_names_out()[random_word_id])


treating
counterterrorism
voyeurism
sliced
leapfrog
origami
camel
hawks
chuck
dieting


## Showing Top words per Topic
* Grab the topics

In [None]:
#altogether we have classified the doc into 7 topics
len(lda.components_)

7

In [None]:
type(lda.components_)

numpy.ndarray

In [None]:
lda.components_.shape

(7, 54777)

In [None]:
lda.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [None]:
# inorder to get highest probability words per topic we need to combine the lda.component_ with the cv.get_feature_names_out()

In [None]:
#lets get the first topic
single_topic = lda.components_[0]

In [None]:
#lets sort the array in the ascending order and the argsort will sort the array and return the index of the component
#lets see the basic example

sample_array = np.array([200,10,30,0])

In [None]:
print(sample_array.argsort())

[3 1 2 0]


The upper result shows that the the 0th index is the largest and the 3 index is the smallest number and hence it sorted the array in the ascending order

In [None]:
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [None]:
# Word least representative of this topic
single_topic[18302]

0.14285714309286987

In [None]:
# Word most representative of this topic
single_topic[42993]

6247.245510521074

In [None]:
# lets grab the top ten words for individual topic

#ARGSORT --> sorts the array in the ascending order , and grabs the Top 10 words.

single_topic.argsort()[-10:]  #first it sorts in ascending order and then takes the last 10 index i.e. the highest probability words

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [None]:
top_words_index = single_topic.argsort()[-10:]

In [None]:
for i in top_words_index:
  print(cv.get_feature_names_out()[i])

new
percent
government
company
million
care
people
health
said
says


In [None]:
# for top 15 words

for i in single_topic.argsort()[-15:]:
  print(cv.get_feature_names_out()[i])

companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found.

In [None]:
# lets printout the top fifteen words for each of the seven topic

for index,topic in enumerate(lda.components_):
  print(f"The top 15 words for topic #{index}:")
  print([cv.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
  print('\n')


The top 15 words for topic #0:
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


The top 15 words for topic #1:
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


The top 15 words for topic #2:
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


The top 15 words for topic #3:
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


The top 15 words for topic #4:
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


The top 15 words for topic #5:
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 't

### Attaching Discovered Topic Labels to Original Articles

In [None]:
# this is the original documnet term matrix
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [None]:
df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [None]:
# create a list of the actual topic of the document term matrix

topic_result = lda.transform(dtm)

In [None]:
type(topic_result)

numpy.ndarray

In [None]:
topic_result.shape

(11992, 7)

In [None]:
topic_result[0]

array([1.61040465e-02, 6.83341493e-01, 2.25376318e-04, 2.25369288e-04,
       2.99652737e-01, 2.25479379e-04, 2.25497980e-04])

The above results are  essentially results that are the probability of document belonging to the particular topic.



<font color = red> Note: </font> Remember we previously saw the probabilities of words belonging to a topic.
* Now we are seeing the proobability of a document belonging to a particular topic.

from the above result we can see that the doc 1 belong to topic 2 which is at the index position 1 because the probability pf the index 1 is the highest.

In [None]:
topic_result[0].round(3)  #this result is in the percentage form

array([0.016, 0.683, 0.   , 0.   , 0.3  , 0.   , 0.   ])

Looks like the very first document has the highest probability of 68% belonging to the topic number 2 which is stored at the index 1

In [None]:
df.Article[0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

Lets assign the topic number with the highest probability to each document.

In [None]:
topic_result[0].argmax()   # this retursn the index number of the maximum number

1

In [None]:
df['Topic'] = topic_result.argmax(axis = 1)

In [None]:
df.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
