Name : Ahmad Nabil Bin Yusoff - IS01081782
Name : Ikmal Kamil Bin Mohd kamil - IS01081793

## Import Libraries


In [24]:
# For text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# For topic modeling
from gensim import corpora
from gensim.models import LdaModel
import pandas as pd

# Download NLTK Resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amakn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amakn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amakn\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load the Data

In [25]:
# Load the Data
df = pd.read_csv('news_dataset.csv')

 # Remove rows with missing values
df = df.dropna()

documents = df['text'].tolist()

## Preprocess the Data

In [26]:
# Preprocess the Data
stop_words = set(stopwords.words('english'))  # Create a set of English stopwords
lemmatizer = WordNetLemmatizer()  # Initialize a WordNet lemmatizer

def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize the text into words and convert to lowercase
    tokens = [token for token in tokens if token.isalnum()]  # Filter out non-alphanumeric tokens
    
    # Remove numbers and single-character tokens
    tokens = [token for token in tokens if token.isalpha() and len(token) > 1]
    
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords from the tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize each token
    return tokens  # Return the preprocessed tokens


preprocessed_documents = [preprocess_text(doc) for doc in documents]  # Preprocess each document in the list

## Create a document-term matrix

In [27]:
# Create a Gensim Dictionary object from the preprocessed documents
dictionary = corpora.Dictionary(preprocessed_documents) 

# Filter out tokens that appear in less than 15 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=15, no_above=0.5)

# Convert each preprocessed document into a bag-of-words representation using the dictionary
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]  

## Run LDA

In [28]:
# Run LDA
lda_model = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=15)  # Train an LDA model on the corpus with 2 topics using Gensim's LdaModel class


## Interpret Results

In [29]:
# Interpret Results

# empty list to store dominant topic labels for each document
article_labels = []

# iterate over each processed document
for i, doc in enumerate(preprocessed_documents):
    # for each document, convert to bag-of-words representation
    bow = dictionary.doc2bow(doc)
    # get list of topic probabilities
    topics = lda_model.get_document_topics(bow)
    # determine topic with highest probability
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    # append to the list
    article_labels.append(dominant_topic)



In [30]:
# Create DataFrame
df_result = pd.DataFrame({"Article": documents, "Topic": article_labels})

# Print the DataFrame
print("Table with Articles and Topic:")
print(df_result)
print()

Table with Articles and Topic:
                                                 Article  Topic
0      I was wondering if anyone out there could enli...      0
1      I recently posted an article asking what kind ...      0
2      \nIt depends on your priorities.  A lot of peo...      0
3      an excellent automatic can be found in the sub...      0
4      : Ford and his automobile.  I need information...      0
...                                                  ...    ...
11091  Secrecy in Clipper Chip\n\nThe serial number o...      2
11092  Hi !\n\nI am interested in the source of FEAL ...      2
11093  The actual algorithm is classified, however, t...      0
11094  \n\tThis appears to be generic calling upon th...      0
11095  \nProbably keep quiet and take it, lest they g...      0

[11096 rows x 2 columns]



In [31]:
# Print top terms for each topic
for topic_id in range(lda_model.num_topics):
    print(f"Top terms for Topic #{topic_id}:")
    top_terms = lda_model.show_topic(topic_id, topn=10)
    print([term[0] for term in top_terms])
    print()

Top terms for Topic #0:
['would', 'one', 'get', 'like', 'know', 'think', 'good', 'time', 'year', 'could']

Top terms for Topic #1:
['people', 'would', 'one', 'government', 'say', 'think', 'god', 'u', 'right', 'law']

Top terms for Topic #2:
['key', 'file', 'use', 'system', 'chip', 'encryption', 'program', 'window', 'information', 'data']

Top terms for Topic #3:
['max', 'db', 'team', 'university', 'game', 'year', 'new', 'space', 'league', 'season']



Topic 0 seems to be centered around personal development and life experiences, where the prominence of terms like "would," "think," and "time" are particularly high, indicating their significance in discussions about self-improvement and life reflections.

Topic 1 appears to be related to social and political issues, with a strong emphasis on terms like "government," "law," and "rights." These terms are relatively high in weight, suggesting a significant association with discussions on governance and societal norms.

Topic 2 is clearly focused on technology and security, where terms such as "encryption," "system," and "data" carry considerable weight, reflecting their crucial role in conversations about cybersecurity and technological advancements.

Topic 3 seems to be connected to sports and academia, particularly with a high incidence of terms like "team," "university," and "game." These terms indicate a strong link to discussions about university sports and possibly the statistical analysis within sports contexts.

In [32]:
# Print the top terms for each topic
print("Top Terms for Each Topic:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}:")
    terms = [term.strip() for term in topic.split("+")]
    for term in terms:
        weight, word = term.split("*")
        print(f"- {word.strip()} (weight: {weight.strip()})")
    print()

Top Terms for Each Topic:
Topic 0:
- "would" (weight: 0.013)
- "one" (weight: 0.011)
- "get" (weight: 0.011)
- "like" (weight: 0.009)
- "know" (weight: 0.008)
- "think" (weight: 0.007)
- "good" (weight: 0.007)
- "time" (weight: 0.007)
- "year" (weight: 0.006)
- "could" (weight: 0.006)

Topic 1:
- "people" (weight: 0.011)
- "would" (weight: 0.009)
- "one" (weight: 0.008)
- "government" (weight: 0.006)
- "say" (weight: 0.005)
- "think" (weight: 0.005)
- "god" (weight: 0.005)
- "u" (weight: 0.005)
- "right" (weight: 0.005)
- "law" (weight: 0.004)

Topic 2:
- "key" (weight: 0.015)
- "file" (weight: 0.010)
- "use" (weight: 0.010)
- "system" (weight: 0.010)
- "chip" (weight: 0.008)
- "encryption" (weight: 0.007)
- "program" (weight: 0.007)
- "window" (weight: 0.006)
- "information" (weight: 0.006)
- "data" (weight: 0.005)

Topic 3:
- "max" (weight: 0.037)
- "db" (weight: 0.018)
- "team" (weight: 0.012)
- "university" (weight: 0.007)
- "game" (weight: 0.007)
- "year" (weight: 0.006)
- "new" (

## Calculate Coherence Score

In [33]:
# import library for Coherence Score

from gensim.models.coherencemodel import CoherenceModel



# Calculate the coherence score for the LDA model

coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_documents, dictionary=dictionary, coherence='c_v')

coherence_lda = coherence_model_lda.get_coherence()



# Display the score

print(f'Topic Coherence Score (C_V): {coherence_lda:.4f}')

Topic Coherence Score (C_V): 0.5465


This coherence score of 0.5465 for this topic model suggests that the topics are moderately coherent. 