# Topic Modeling - LDA

For this project you will be working with a dataset of around 12,000 Articles that have no labeled cateogry, and attempting to find 20 cateogries to assign these articles to. 


# Data Exploration

#### Task: Explore Data. Import pandas and read in the npr.csv file

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('~/data/npr.csv')

Let's view the Articles dataset.

In [3]:
data.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [4]:
data.shape # There are total 11992 articles.

(11992, 1)

Let's take a look at one of the article.

In [5]:
data['Article'][10]



- As we see each of the records represents an Article.
- The field "Article" contains the entire Article text for the respective article.
- There is no labeled column to know what this article is about.

# Preprocessing

#### Objective 
- To come up with a "Document Term Matrix (DTM), which is the input to LDA"

#### Steps
- Instantiate CountVectorizer with required hyperparameters.
- Fit data to CountVectorizer and generate Document Term Matrix (DTM).

#### Task: Use CountVectorizer Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

#### CountVectorizer

In [6]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
# Instantiate CountVectorizer with required hyperparameters
# max_df (0.9) => Pick only Words that shows up in 90% of documents
# min_df (2)   => Pick only Words that shows up atleast in 2 documents.
# Remove Stopwords
# We can use Spacy to tokenize and remove stopwords but Countvectorizer does everything for us.

cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

In [8]:
print(cv)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.9, max_features=None, min_df=2,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


Now that we have defined the CountVectorizer, now let's fit the entire data into it and generate Document Term Matrix.

In [9]:
%%time
# Create Document Term Matrix
dtm = cv.fit_transform(data['Article'])

CPU times: user 10.8 s, sys: 594 ms, total: 11.3 s
Wall time: 12.5 s


In [10]:
# It creates a sparse matrix with Number of Articles (Documents) (404289) and Number of words (Terms) (38669)
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

# Latent Drichlet Allocation

#### Steps
- 1) Fit the Model with DTM.
- 2) Get the Vocabulary of Words.
- 3) Get the topics.
- 4) Get the highest probability words per topic.

#### 1) Fit the Model with DTM

In [11]:
from sklearn.decomposition import LatentDirichletAllocation

In [12]:
# n_components => We want 7 general topics here.
# random_state => Some randome number

topic_model = LatentDirichletAllocation(n_components=7,random_state=42)

In [13]:
%%time
topic_model.fit(dtm)

CPU times: user 7min 4s, sys: 5.42 s, total: 7min 9s
Wall time: 4min 10s


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=7, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

#### 2) Get the Vocabulary of Words

In [14]:
# "get_feature_names()"" method returns a list containing all the words from "CountVectorizer"

cv.get_feature_names()[10000] # Get the 10000th word in the list.

'coelho'

#### 3) Get the topics

In [15]:
# Total Number of Topics
len(topic_model.components_)

7

In [16]:
# It returns a Numpy array of 7 topics and 54777 words.
topic_model.components_.shape

(7, 54777)

In [17]:
# The array It contains Probability of each words in those Topics
topic_model.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

#### 4) Get the highest probability words per topic

Let's understand how it works for a single topic first. After that we will generalize this for all topics.

In [18]:
single_topic = topic_model.components_[0] # Let's get the first topic

In [19]:
# This will print the array of Probability of each of the 54777 words for the First Topic
single_topic

array([8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
       1.43006821e-01, 1.42902042e-01, 1.42861626e-01])

In [20]:
# Sort the Array using "argsort" by index positions from Least to Greatest value.
# First index position word i.e 2475th word in the array has the least probability.
# Last index position word i.e 42993rd word in the array has the highest probability.
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [21]:
# ARGSORT => Index positions sorted from Least to Highest
# Top 10 Values (10 Greatest Values) => Last 10 values of ARGSORT()
top_ten_words_index = single_topic.argsort()[-10:]
top_ten_words_index

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [22]:
# Print the Top 10 words for first topic (index : probability " word")
for index in top_ten_words_index:
    print(str(index) + ' : ' + str(single_topic[index]) + ' : ' + cv.get_feature_names()[index])

33390 : 2454.8251962315935 : new
36310 : 2529.319133884274 : percent
21228 : 2533.2742253067827 : government
10425 : 2626.8155405972557 : company
31464 : 2628.992410621358 : million
8149 : 2760.197440709356 : care
36283 : 3643.8261838842077 : people
22673 : 3699.3397941251337 : health
42561 : 4608.957060251319 : said
42993 : 6247.245510521101 : says


Now generalize the above logic for all the topics

In [23]:
%%time
for index,topic in enumerate(topic_model.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC # {index}")
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n\n')
    

THE TOP 15 WORDS FOR TOPIC # 0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']



THE TOP 15 WORDS FOR TOPIC # 1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']



THE TOP 15 WORDS FOR TOPIC # 2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']



THE TOP 15 WORDS FOR TOPIC # 3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']



THE TOP 15 WORDS FOR TOPIC # 4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']



THE TOP 15 WORDS FOR TOPIC # 5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know

#### 5) Attach Topic Numbers to each of the articles.

Probability of a Document belonging to a Particular Topic

In [24]:
%%time
topic_probability = topic_model.transform(dtm)

CPU times: user 20.8 s, sys: 188 ms, total: 20.9 s
Wall time: 10.5 s


In [25]:
# Array containing the Probabilities(rounded to 2 decimal places) of 1st Article belonging to each of the 7 topics.
topic_probability[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [26]:
# Lets get the index position of the highest probability
topic_probability[0].argmax(axis=0)

1

In [27]:
# Create a New Field called "Topic" and attach the Topic Number
data['Topic'] = topic_probability.argmax(axis=1)

In [28]:
data.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


In [29]:
# Create Topic Dictionary
topic_dict = {0:'topic_0',1:'topic_1',2:'topic_2',3:'topic_3',4:'topic_4',5:'topic_5',6:'topic_6'}

In [30]:
data['Topic_Label'] = data['Topic'].map(topic_dict)

In [31]:
data.head()

Unnamed: 0,Article,Topic,Topic_Label
0,"In the Washington of 2016, even when the polic...",1,topic_1
1,Donald Trump has used Twitter — his prefe...,1,topic_1
2,Donald Trump is unabashedly praising Russian...,1,topic_1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,topic_1
4,"From photography, illustration and video, to d...",2,topic_2
