# Topic Modeling

scans documents, identifies patterns and groups them into topics

### Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) in simpler words is a way to find hidden topics in a collection of documents.
Imagine you have a bunch of articles, and you want to figure out what topics they are about without reading each one.
LDA helps by looking at the words in the articles and grouping them into topics based on how often certain words appear together.

In [1]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from gensim.models import LsiModel # LSI(Latent Semantic Indexing) model = LSA model
import gensim # topic modeling library used for LDA
import gensim.corpora as corpora # used to create the dictionary and corpus

In [2]:
data = pd.read_csv('news_articles.csv')

In [3]:
data.head()

Unnamed: 0,id,title,content
0,25626,"One Weight-Loss Approach Fits All? No, Not Eve...","Dr. Frank Sacks, a professor of nutrition at H..."
1,19551,South Carolina Stuns Baylor to Reach the Round...,South Carolina’s win over Duke was not only ...
2,25221,"U.S. Presidential Race, Apple, Gene Wilder: Yo...",(Want to get this briefing by email? Here’s th...
3,18026,"His Predecessor Gone, Gambia’s New President F...","BANJUL, Gambia — A week after he was inaugu..."
4,21063,‘Harry Potter and the Cursed Child’ Goes From ...,The biggest book of the summer isn’t a blockbu...


In [4]:
data['content'][0]

'Dr. Frank Sacks, a professor of nutrition at Harvard, likes to challenge his audience when he gives lectures on obesity. “If you want to make a great discovery,” he tells them, figure out this: Why do some people lose 50 pounds on a diet while others on the same diet gain a few pounds? Then he shows them data from a study he did that found exactly that effect. Dr. Sacks’s challenge is a question at the center of obesity research today. Two people can have the same amount of excess weight, they can be the same age, the same socioeconomic class, the same race, the same gender. And yet a treatment that works for one will do nothing for the other. The problem, researchers say, is that obesity and its precursor  —   being overweight  —   are not one disease but instead, like cancer, they are many. “You can look at two people with the same amount of excess body weight and they put on the weight for very different reasons,” said Dr. Arya Sharma, medical director of the obesity program at the

In [5]:
articles = data['content'] # Extract the 'content' column and store it in a variable called articles

In [6]:
#clean the text for lda - convert to lowercase, remove stop words, punctuation, and perform stemming

#we are using stemming here to speed up the processing time, lemmatization can be used instead for better results

def preprocess_text(text):
    # Convert to lowercase
    text = text.str.lower().apply(lambda x: re.sub(r'[^\w\s]', '', x))

    # Remove stop words
    stop_words = stopwords.words('english')
    text = text.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

    # Tokenize the text
    text = text.apply(lambda x: word_tokenize(x))
    
    # Perform stemming
    stemmer = PorterStemmer()
    text = text.apply(lambda x: [stemmer.stem(token) for token in x])
    return text

processed_articles = preprocess_text(articles)


In [7]:
processed_articles

0     [dr, frank, sack, professor, nutrit, harvard, ...
1     [south, carolina, win, duke, surpris, fan, pos...
2     [want, get, brief, email, here, good, even, he...
3     [banjul, gambia, week, inaugur, anoth, countri...
4     [biggest, book, summer, isnt, blockbust, thril...
                            ...                        
95    [want, get, brief, email, here, good, even, he...
96    [tallinn, estonia, guard, brought, ahm, abdul,...
97    [gov, scott, walker, wisconsin, activ, wiscons...
98    [social, media, shook, emot, headlin, shout, n...
99    [moment, joanna, acevedo, first, set, foot, bo...
Name: content, Length: 100, dtype: object

In [8]:
dictionary = corpora.Dictionary(processed_articles) #each word is assigned a unique id which will later allow the lda model to work with the text in a structured way 


print(dictionary)
print(dictionary.token2id) # shows the mapping of words to their unique ids

Dictionary<8693 unique tokens: ['10', '100', '108', '15', '155']...>
{'10': 0, '100': 1, '108': 2, '15': 3, '155': 4, '180': 5, '185': 6, '190': 7, '1970': 8, '20': 9, '200': 10, '2006': 11, '2007': 12, '2011': 13, '2014': 14, '220': 15, '240': 16, '25': 17, '252': 18, '265': 19, '30': 20, '300': 21, '31': 22, '40': 23, '42': 24, '5': 25, '50': 26, '51': 27, '53': 28, '55': 29, '56': 30, '59': 31, '7': 32, '75': 33, '811': 34, '9': 35, 'abl': 36, 'accept': 37, 'accumul': 38, 'act': 39, 'activ': 40, 'ad': 41, 'add': 42, 'adult': 43, 'adulthood': 44, 'aerob': 45, 'affair': 46, 'age': 47, 'ago': 48, 'alberta': 49, 'allow': 50, 'almost': 51, 'along': 52, 'alreadi': 53, 'also': 54, 'alter': 55, 'alway': 56, 'amount': 57, 'andrea': 58, 'ankylos': 59, 'ann': 60, 'anoth': 61, 'answer': 62, 'antidepress': 63, 'anyth': 64, 'apovian': 65, 'app': 66, 'appetit': 67, 'appoint': 68, 'april': 69, 'aronn': 70, 'around': 71, 'arriv': 72, 'arthriti': 73, 'arya': 74, 'asham': 75, 'ask': 76, 'assign': 77, 

the value 8693 represents the number of unique tokens in our dataset

In [9]:
doc_term = [dictionary.doc2bow(doc) for doc in processed_articles] #create the document-term matrix 
#doc2bow converts each document into a list of (token_id, token_count) tuples, a bag of words format


In [10]:
print(doc_term)

[[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 3), (21, 3), (22, 1), (23, 3), (24, 2), (25, 4), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 2), (48, 2), (49, 1), (50, 2), (51, 2), (52, 1), (53, 1), (54, 2), (55, 1), (56, 2), (57, 6), (58, 1), (59, 1), (60, 1), (61, 4), (62, 2), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 1), (69, 1), (70, 5), (71, 4), (72, 1), (73, 1), (74, 1), (75, 2), (76, 2), (77, 1), (78, 2), (79, 2), (80, 1), (81, 1), (82, 1), (83, 4), (84, 2), (85, 1), (86, 1), (87, 3), (88, 1), (89, 3), (90, 1), (91, 2), (92, 3), (93, 6), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 2), (104, 5), (105, 1), (106, 2), (107, 1), (108, 1), (109, 3), (110, 4)

In [11]:
number_of_topics = 2 #specify the number of topics you want the model to find

# LDA Model

In [12]:
lda_model = gensim.models.LdaModel(corpus=doc_term,
                                   id2word=dictionary,
                                   num_topics=number_of_topics)

In [13]:
lda_model.print_topics(num_topics=number_of_topics, num_words=5) 

#num_topics specifies how many topics to show
#num_words specifies how many words to show for each topic

[(0,
  '0.021*"mr" + 0.011*"said" + 0.006*"trump" + 0.005*"would" + 0.005*"state"'),
 (1,
  '0.018*"said" + 0.013*"mr" + 0.005*"trump" + 0.004*"like" + 0.004*"year"')]

# LSA Model

In [14]:
lsa_model = LsiModel(doc_term, num_topics = number_of_topics, id2word = dictionary)

In [15]:
print(lsa_model.print_topics(num_topics=number_of_topics, num_words=5))

[(0, '0.615*"mr" + 0.429*"said" + 0.187*"trump" + 0.130*"state" + 0.119*"would"'), (1, '-0.537*"mr" + -0.319*"trump" + 0.286*"said" + 0.242*"saudi" + 0.142*"weight"')]
