# **Lab 4: Topic Modeling with Latent Semantic Analysis (LSA)**

In [15]:
# Install the necessary libraries
%pip install nltk scikit-learn pandas

Note: you may need to restart the kernel to use updated packages.



### **Part 1: Loading and Preprocessing the BBC News Dataset**

In this section, we’ll load the dataset, preprocess the text by removing stopwords, tokenizing, and creating a term-document matrix using TF-IDF.


In [16]:
import pandas as pd

# TODO: Load the BBC news dataset
data = pd.read_csv("bbc_news_data.csv")

# Check the first few rows
data.head()

Unnamed: 0,filename,title,content
0,001.txt,Gallery unveils interactive tree,A Christmas tree that can receive text messag...
1,002.txt,Jarre joins fairytale celebration,French musician Jean-Michel Jarre is to perfo...
2,003.txt,Musical treatment for Capra film,The classic film It's A Wonderful Life is to ...
3,004.txt,Richard and Judy choose top books,The 10 authors shortlisted for a Richard and ...
4,005.txt,Poppins musical gets flying start,The stage adaptation of children's film Mary ...


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop_words = stopwords.words('english')

# A simple function to preprocess text
def preprocess_text(text):
    return ' '.join([word.lower() for word in text.split() if word.lower() not in stop_words])

# TODO: Apply the preprocessing function to the 'content' column
data['processed_content'] = data['content'].apply(preprocess_text)

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_df=0.05, max_features=1000, ngram_range=(1, 2))

# TODO: Create the term-document matrix
term_doc_matrix = vectorizer.fit_transform(data['processed_content'])

# Get the terms (feature names) from the vectorizer
terms = vectorizer.get_feature_names_out()

# Display the shape of the term-document matrix
print(f"Term-Document Matrix Shape: {term_doc_matrix.shape}")


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jaylee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Term-Document Matrix Shape: (1204, 1000)



### **Part 2: Applying SVD to the Term-Document Matrix**

In this section, we will apply **Singular Value Decomposition (SVD)** to reduce the term-document matrix into its latent structure and identify the topics.


In [23]:
from sklearn.decomposition import TruncatedSVD

# TODO: Define the number of components
num_components = 3

svd_model = TruncatedSVD(n_components=num_components)

# TODO: Fit the SVD model
svd_matrix = svd_model.fit_transform(term_doc_matrix)

# Show the resulting latent space (topic space)
print(f"Latent Topic Matrix Shape: {svd_matrix.shape}")

Latent Topic Matrix Shape: (1204, 3)


In [24]:
import numpy as np

# Get the top terms for each topic

num_top_words = 20 # TODO: Adjust this until you can easily identify the topics

for i, topic in enumerate(svd_model.components_):
    top_term_indices = np.argsort(topic)[-num_top_words:] # TODO: Get the indices of the top terms
    top_terms = [terms[i] for i in top_term_indices]
    print(f"Topic {i+1}: {', '.join(top_terms)}")

Topic 1: cabinet, fox, asylum, schools, virus, economy, dvd, tour, musical, women, apple, search, taxes, lords, aid, immigration, eu, kennedy, mr howard, festival
Topic 2: named best, film festival, best film, baby, theatre, dollar, dollar baby, ray, million dollar, box office, musical, category, academy, named, drama, starring, oscars, nominations, aviator, festival
Topic 3: products, electronics, viruses, ipod, mobile phone, spyware, programs, messages, nintendo, sites, program, windows, portable, search, gadget, mobiles, spam, gadgets, apple, virus



### **Part 3: Labeling the Topics**

TODO: Using the terms extracted from each topic, try to assign labels that best describe what each topic is about.

- **Topic 1**: Politics and social issues
- **Topic 2**: Movie or film
- **Topic 3**: Computer Science or sofeware engineering


### **Summary & Takeaways**

In this lab, you have:
1. Preprocessed the BBC News dataset and created a term-document matrix using TF-IDF.
2. Applied SVD to reduce the term-document matrix into a lower-dimensional space, revealing hidden topics.
3. Examined the most significant terms in each topic and interpreted their meaning.
4. Labeled the topics based on the terms and document clusters.

You now have a better understanding of how **LSA** can reveal hidden topics in a collection of text documents and how similar documents can group together based on their content.