# CISB5123 Text Analytics
## Lab 9.1 - Topic Modelling

Name: Abdul Hakiim bin Ahmad Rosli (SW01081337)

Topic modeling is a type of statistical modeling used to identify topics or themes within a collection of documents. It involves automatically clustering words that tend to co-occur frequently across multiple documents, with the aim of identifying groups of words that represent distinct topics. The ultimate goal is to identify the underlying themes or topics that run through a large corpus of text data.

In this lab, you will learn how to perform topic modeling using Gensim, a popular Python library for topic modeling, and the Latent Dirichlet Allocation (LDA) algorithm to discover abstract topics within a collection of documents.

The general process of topic modeling includes:
- Import the necessary libraries (i.e. NLTK, Gensim, Scikit-learn)
- Load data
- Preprocess the data
- Create a document-term matrix
- Run topic modeling algorithm
- Interpret the results
- Refine the model

#### Import libraries

In [2]:
# For text processing
import nltk 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# For topic modelling
from gensim import corpora
from gensim.models import LdaModel

# Download nltk resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#### Load the data

In [3]:
documents = [
    "Rafael Nadal Joins Roger Federer in Missing U.S. Open",
    "Rafael Nadal Is Out of the Australian Open",
    "Biden Announces Virus Measures",
    "Biden's Virus Plans Meet Reality",
    "Where Biden's Virus Plan Stands"
]

#### Preprocess the data

In [4]:
stop_words = set(stopwords.words('english'))  # Create a set of English stopwords
lemmatizer = WordNetLemmatizer()  # Initialize a WordNet lemmatizer

def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize the text and convert to lowercase
    tokens = [token for token in tokens if token.isalnum()]  # Filter out alpha-numeric tokens
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords from the tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize each token
    return tokens  # Return the preprocessed tokens

preprocessed_documents = [preprocess_text(doc) for doc in documents]  # Preprocess each document in the list
preprocessed_documents

[['rafael', 'nadal', 'join', 'roger', 'federer', 'missing', 'open'],
 ['rafael', 'nadal', 'australian', 'open'],
 ['biden', 'announces', 'virus', 'measure'],
 ['biden', 'virus', 'plan', 'meet', 'reality'],
 ['biden', 'virus', 'plan', 'stand']]

#### Create a document-term matrix

In [5]:
# Create a Gensim Dictionary object from the preprocessed documents
dictionary = corpora.Dictionary(preprocessed_documents)

# Convert each preprocessed document into a bag-of-words representation using dictionary
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]

#### Run LDA

In [6]:
# corpus: bag-of-words representation of the documents
# num_topics: number of topics to be extracted by the model
# id2word=dictionary: dictionary mapping from word IDs to words
# passes: number of passes through the corpus during training
# train an LDA model on the corpus with 4 topics using Gensim's LdaModel class
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

#### Interpret Results

In [8]:
# Empty list to store dominant topic labels for each document
article_labels = []

# Iterate over each processed document
for i, doc in enumerate(preprocessed_documents):
    # for each document, convert to box representation
    bow = dictionary.doc2bow(doc)
    # get list of topic probabilities
    topics = lda_model.get_document_topics(bow)
    # determine topic with highest probability
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    # append to the list
    article_labels.append(dominant_topic)
    
# Import pandas
import pandas as pd
    
# Create DataFrame
df = pd.DataFrame({"Article": documents, "Topic": article_labels})

# Print the DataFrame
print("Table with Articles and Topic: ")
print(df)
print()

Table with Articles and Topic: 
                                             Article  Topic
0  Rafael Nadal Joins Roger Federer in Missing U....      0
1         Rafael Nadal Is Out of the Australian Open      0
2                     Biden Announces Virus Measures      1
3                   Biden's Virus Plans Meet Reality      1
4                    Where Biden's Virus Plan Stands      1



In [9]:
# Print the top term of each topic
print("Top Terms for each topic:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}")
    terms = [term.strip() for term in topic.split("+")]
    for term in terms:
        weight, word = term.split("*")
        print(f"- {word.strip()} (weight: {weight.strip()})")
    print()

Top Terms for each topic:
Topic 0
- "nadal" (weight: 0.131)
- "rafael" (weight: 0.131)
- "open" (weight: 0.131)
- "federer" (weight: 0.079)
- "missing" (weight: 0.079)
- "roger" (weight: 0.079)
- "join" (weight: 0.079)
- "australian" (weight: 0.079)
- "virus" (weight: 0.027)
- "biden" (weight: 0.027)

Topic 1
- "biden" (weight: 0.166)
- "virus" (weight: 0.166)
- "plan" (weight: 0.119)
- "meet" (weight: 0.071)
- "reality" (weight: 0.071)
- "measure" (weight: 0.071)
- "announces" (weight: 0.071)
- "stand" (weight: 0.071)
- "australian" (weight: 0.024)
- "open" (weight: 0.024)



Topic 1 seems to be related around politics and virus, where the weight of terms like
"biden" and "virus" are particularly high, indicating their significance in this topic.

Topic 0 seems to be related to tennis, where the weight of terms like "nadal" and "rafael"
are relatively high, suggesting a strong association with this topic.

### Using articles (‘npr.csv’) from NPR (National Public Radio) obtained from their website

#### Import libraries

In [1]:
# For text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# For topic modeling
from gensim import corpora
from gensim.models import LdaModel
import pandas as pd

# Download NLTK Resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#### Load the data

In [2]:
df = pd.read_csv('npr.csv')
documents = df['Article'].tolist()

#### Preprocess the data

In [3]:
stop_words = set(stopwords.words('english')) # Create a set of English stopwords
lemmatizer = WordNetLemmatizer() # Initialize a WordNet lemmatizer

def preprocess_text(text):
    tokens = word_tokenize(text.lower()) # Tokenize the text into words and convert to lowercase
    tokens = [token for token in tokens if token.isalnum()] # Filter out non-alphanumeric tokens
    tokens = [token for token in tokens if token not in stop_words] # Remove stopwords from the tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens] # Lemmatize each token
    return tokens # Return the preprocessed tokens

preprocessed_documents = [preprocess_text(doc) for doc in documents] # Preprocess each document in the list

print(preprocessed_documents[0])

['washington', '2016', 'even', 'policy', 'bipartisan', 'politics', 'sense', 'year', 'show', 'little', 'sign', 'ending', 'president', 'obama', 'moved', 'sanction', 'russia', 'alleged', 'interference', 'election', 'concluded', 'republican', 'long', 'called', 'similar', 'severe', 'measure', 'could', 'scarcely', 'bring', 'approve', 'house', 'speaker', 'paul', 'ryan', 'called', 'obama', 'measure', 'appropriate', 'also', 'overdue', 'prime', 'example', 'administration', 'ineffective', 'foreign', 'policy', 'left', 'america', 'weaker', 'eye', 'gop', 'leader', 'sounded', 'much', 'theme', 'urging', 'president', 'obama', 'year', 'take', 'strong', 'action', 'deter', 'russia', 'worldwide', 'aggression', 'including', 'operation', 'wrote', 'devin', 'nunes', 'chairman', 'house', 'intelligence', 'committee', 'week', 'left', 'office', 'president', 'suddenly', 'decided', 'stronger', 'measure', 'indeed', 'appearing', 'cnn', 'frequent', 'obama', 'critic', 'trent', 'frank', 'called', 'much', 'tougher', 'acti

#### Create document-term matrix

In [4]:
# Create a Gensim Dictionary object from the preprocessed documents
dictionary = corpora.Dictionary(preprocessed_documents)

# Filter out tokens that appear in less than 15 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=15, no_above=0.5)

# Convert each preprocessed document into a bag-of-words representation using the dictionary
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]

#### Run LDA

In [5]:
# Run LDA
lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15) 
# Train an LDA model on the corpus with 2 topics using Gensim's LdaModel class

#### Interpret Results

In [6]:
# empty list to store dominant topic labels for each document
article_labels = []

# iterate over each processed document
for i, doc in enumerate(preprocessed_documents):
    # for each document, convert to bag-of-words representation
    bow = dictionary.doc2bow(doc)
    # get list of topic probabilities
    topics = lda_model.get_document_topics(bow)
    # determine topic with highest probability
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    # append to the list
    article_labels.append(dominant_topic)
    
# Create DataFrame
df_result = pd.DataFrame({"Article": documents, "Topic": article_labels})

# Print the DataFrame
print("Table with Articles and Topic:")
print(df_result)
print()

Table with Articles and Topic:
                                                 Article  Topic
0      In the Washington of 2016, even when the polic...      4
1        Donald Trump has used Twitter  —   his prefe...      4
2        Donald Trump is unabashedly praising Russian...      4
3      Updated at 2:50 p. m. ET, Russian President Vl...      4
4      From photography, illustration and video, to d...      0
...                                                  ...    ...
11987  The number of law enforcement officers shot an...      1
11988    Trump is busy these days with victory tours,...      4
11989  It’s always interesting for the Goats and Soda...      2
11990  The election of Donald Trump was a surprise to...      4
11991  Voters in the English city of Sunderland did s...      1

[11992 rows x 2 columns]



In [7]:
# Print top terms for each topic
for topic_id in range(lda_model.num_topics):
    print(f"Top terms for Topic #{topic_id}:")
    top_terms = lda_model.show_topic(topic_id, topn=10)
    print([term[0] for term in top_terms])
    print()

Top terms for Topic #0:
['know', 'think', 'thing', 'life', 'woman', 'really', 'story', 'show', 'book', 'back']

Top terms for Topic #1:
['police', 'country', 'report', 'city', 'government', 'state', 'attack', 'told', 'war', 'two']

Top terms for Topic #2:
['food', 'study', 'water', 'disease', 'human', 'scientist', 'university', 'science', 'animal', 'research']

Top terms for Topic #3:
['health', 'school', 'percent', 'state', 'company', 'student', 'care', 'program', 'child', 'million']

Top terms for Topic #4:
['trump', 'clinton', 'president', 'state', 'republican', 'campaign', 'election', 'obama', 'vote', 'voter']



In [8]:
# Print the top terms for each topic with weight
print("Top Terms for Each Topic:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}:")
    terms = [term.strip() for term in topic.split("+")]
    for term in terms:
        weight, word = term.split("*")
        print(f"- {word.strip()} (weight: {weight.strip()})")
    print()

Top Terms for Each Topic:
Topic 0:
- "know" (weight: 0.005)
- "think" (weight: 0.005)
- "thing" (weight: 0.005)
- "life" (weight: 0.005)
- "woman" (weight: 0.005)
- "really" (weight: 0.004)
- "story" (weight: 0.004)
- "show" (weight: 0.003)
- "book" (weight: 0.003)
- "back" (weight: 0.003)

Topic 1:
- "police" (weight: 0.007)
- "country" (weight: 0.006)
- "report" (weight: 0.005)
- "city" (weight: 0.005)
- "government" (weight: 0.005)
- "state" (weight: 0.004)
- "attack" (weight: 0.004)
- "told" (weight: 0.004)
- "war" (weight: 0.003)
- "two" (weight: 0.003)

Topic 2:
- "food" (weight: 0.007)
- "study" (weight: 0.005)
- "water" (weight: 0.004)
- "disease" (weight: 0.004)
- "human" (weight: 0.003)
- "scientist" (weight: 0.003)
- "university" (weight: 0.003)
- "science" (weight: 0.003)
- "animal" (weight: 0.003)
- "research" (weight: 0.003)

Topic 3:
- "health" (weight: 0.009)
- "school" (weight: 0.008)
- "percent" (weight: 0.007)
- "state" (weight: 0.007)
- "company" (weight: 0.007)
- "