# CISB5123 - Text Analytics section 1A
## Lab Assignment 3 - Topic Modeling

**Group members:**
1. Abdul Hakiim bin Ahmad Rosli (SW01081337)
2. Muhammad Bazly bin Burhan (SW01081224)

### Import the necessary libraries

In [1]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

### Read the data (use only the 'text' column)

In [2]:
df = pd.read_csv('news_dataset.csv')
documents = df['text']

# Display the head of the documents
num_docs_to_display = 5  # Specify the number of documents to display
for i, doc in enumerate(documents[:num_docs_to_display], 1):
    print(f"Document {i}:")
    print(doc)
    print()

Document 1:
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Document 2:
I recently posted an article asking what kind of rates single, male
drivers under 25 yrs old were paying on performance cars. Here's a summary of
the replies I received.
 
 
 
 
-------------------------------------------------------------------------------
 
I'm not under 25 anymore (but is 27 close enough).
 
1992 Dodge Stealth RT/Twin Turbo (300hp model).
No tickets, no accidents, own a house, have taken defensive driving 1,
airbag, abs, security alarm, single.
 
$1500/year  $500 decut. Stat

### Perform text pre-processing

In [3]:
print(df['text'].dtype)

object


In [4]:
# Download nltk resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Create a set of English stopwords
stop_words = set(stopwords.words('english'))

# Initialize a WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if not isinstance(text, str):
        text = str(text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert text to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatize the tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens

preprocessed_documents = [preprocess_text(doc) for doc in documents]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Perform LDA using Gensim

In [5]:
# Create a Gensim Dictionary object from the preprocessed documents
dictionary = corpora.Dictionary(preprocessed_documents)

# Convert each preprocessed document into a bag-of-words representation using dictionary
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]

# Train the LDA model
num_topics = 4
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

### Evaluate the LDA model using Coherence score

In [6]:
# Calculate the coherence score for the LDA model
coherence_model_lda = CoherenceModel(model=lda_model, texts=preprocessed_documents, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

# Display the coherence score
print(f'Topic Coherence Score (C_V): {coherence_lda:.4f}')

Topic Coherence Score (C_V): 0.5484


#### Interpretation of the Coherence Score:
The coherence score of 0.5484 obtained from the LDA model provides a quantitative measure of the interpretability and semantic coherence of the generated topics. This score indicates the degree to which the top terms within each topic are semantically related and tend to co-occur in the same documents. A higher coherence score suggests that the topics are more meaningful, interpretable, and capture the underlying themes present in the document corpus effectively. In this case, the coherence score of 0.5484 implies that the LDA model has produced topics that are moderately coherent and can provide insights into the main themes discussed in the documents. However, there may still be room for improvement by fine-tuning the model parameters, preprocessing techniques, or exploring alternative topic modeling approaches to further enhance the interpretability and coherence of the discovered topics.

### Interpret the result

In [7]:
# Print the top terms for each topic
print("Top Terms for each topic:")
for idx, topic in lda_model.print_topics(num_topics=num_topics, num_words=10):
    print(f"Topic {idx+1}:")
    print(topic)
    print()

Top Terms for each topic:
Topic 1:
0.007*"key" + 0.006*"use" + 0.006*"system" + 0.006*"file" + 0.005*"one" + 0.004*"x" + 0.004*"program" + 0.004*"chip" + 0.004*"get" + 0.004*"would"

Topic 2:
0.012*"game" + 0.010*"team" + 0.006*"player" + 0.006*"year" + 0.005*"play" + 0.005*"season" + 0.004*"league" + 0.004*"new" + 0.004*"win" + 0.004*"hockey"

Topic 3:
0.056*"x" + 0.038*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.021*"b" + 0.021*"db" + 0.006*"entry" + 0.005*"mov" + 0.004*"car" + 0.003*"nan" + 0.002*"byte" + 0.002*"engine"

Topic 4:
0.008*"would" + 0.008*"one" + 0.007*"people" + 0.005*"dont" + 0.005*"think" + 0.004*"know" + 0.004*"time" + 0.004*"like" + 0.004*"u" + 0.004*"say"



In [8]:
# Print the top terms for each topic
print("Top Terms for each topic:")
for idx, topic in lda_model.print_topics(num_topics=num_topics, num_words=10):
    print(f"Topic {idx+1}:")
    terms = re.findall(r'"([^"]+)"', topic)
    print(", ".join(terms))
    print()

Top Terms for each topic:
Topic 1:
key, use, system, file, one, x, program, chip, get, would

Topic 2:
game, team, player, year, play, season, league, new, win, hockey

Topic 3:
x, maxaxaxaxaxaxaxaxaxaxaxaxaxaxax, b, db, entry, mov, car, nan, byte, engine

Topic 4:
would, one, people, dont, think, know, time, like, u, say



### Discussion

1. Top Terms for each Topic:
    - Topic 1 seems to be related to computer systems and programs, with terms like "key", "use", "system", "file", "program", "chip", and "get".
    - Topic 2 appears to be about sports or games, with terms such as "game", "team", "player", "year", "play", "season", "league", "new", "win", and "hockey".
    - Topic 3 contains a mix of terms, including a long sequence of "x" characters, along with terms like "b", "db", "entry", "mov", "car", "nan", "byte", and "engine". The presence of the long "x" sequence suggests that there might be some noise or irrelevant data in the documents.
    - Topic 4 seems to be about people's thoughts and opinions, with terms like "would", "one", "people", "dont", "think", "know", "time", "like", "u", and "say".

2. Interpretation:
    - The LDA model has successfully identified distinct topics from the given set of documents.
    - Topic 1 relates to computer systems and programs, indicating discussions or information about software, files, and system-related concepts.
    - Topic 2 is centered around sports or games, suggesting that some documents contain content related to various sports, teams, players, and seasons.
    - Topic 3 is harder to interpret due to the presence of the long "x" sequence, which might indicate some data quality issues or irrelevant content. However, the other terms in Topic 3 could potentially be related to specific technical or domain-specific concepts.
    - Topic 4 captures people's opinions, thoughts, and expressions, implying that some documents contain subjective or personal views on different subjects.

3. Considerations:
    - The quality and interpretability of the topics depend on the nature and content of the input documents.
    - Preprocessing steps, such as removing stopwords, lemmatization, and handling special characters, play a crucial role in improving topic coherence and interpretability.
    - The number of topics (num_topics) is a hyperparameter that can be tuned based on the specific dataset and desired granularity of topics.
    - Increasing the number of topics might lead to more specific and fine-grained topics, while decreasing it could result in broader and more general topics.
    - The presence of noise or irrelevant data, as observed in Topic 3, can impact the quality of the generated topics. Further data cleaning and preprocessing might be necessary to improve the results.