topic modeling and assigned each course a dominant_topic. As we already have text_for_embedding lowercased, stripped of punctuation/HTML, etc., we can directly use it into vectorizers without extra cleaning.  Using the same field for embeddings and topic modeling keeps consistency. Column "text_for_embedding" concatenates cleaned title and cleaned description (and possibly skills/tags). This gives a fuller representation of each course’s content. 

In [3]:
import pandas as pd
import os
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

#Load cleaned catalog
process_dir = "../data/processed"
catalog_path = os.path.join(process_dir, "courses_combined_cleaned.csv")

df = pd.read_csv(catalog_path)
print("Loaded catalog:", df.shape)
df.head()

Loaded catalog: (1343, 9)


Unnamed: 0,global_id,platform,title,provider,level,description,clean_title,clean_description,text_for_embedding
0,edx_0,edx,How to Learn Online,edX,Beginner,Learn essential strategies for successful onli...,how to learn online,learn essential strategies for successful onli...,how to learn online learn essential strategies...
1,edx_1,edx,Programming for Everybody (Getting Started wit...,The University of Michigan,Beginner,"This course is a ""no prerequisite"" introductio...",programming for everybody getting started with...,this course is a no prerequisite introduction ...,programming for everybody getting started with...
2,edx_2,edx,CS50's Introduction to Computer Science,Harvard University,Beginner,An introduction to the intellectual enterprise...,cs50 s introduction to computer science,an introduction to the intellectual enterprise...,cs50 s introduction to computer science an int...
3,edx_3,edx,The Analytics Edge,Massachusetts Institute of Technology,Intermediate,"Through inspiring examples and stories, discov...",the analytics edge,through inspiring examples and stories discove...,the analytics edge through inspiring examples ...
4,edx_4,edx,Marketing Analytics: Marketing Measurement Str...,"University of California, Berkeley",Beginner,This course is part of a MicroMasters® Program...,marketing analytics marketing measurement stra...,this course is part of a micromasters program ...,marketing analytics marketing measurement stra...


1. LDA Topic Modeling on text_for_embedding
Here using scikit-learn’s CountVectorizer + LatentDirichletAllocation ..


In [8]:
# 2. Vectorize texts with CountVectorizer, we can adjust min_df and max_df for specific dataset size. 
# 'clean_description' is used directly for topic modeling (use "text_for_embedding" to compare).


if 'clean_title' not in df.columns:
    raise KeyError("Column 'text_for_embedding' not found. Ensure your DataFrame has this column.")

texts = df['clean_title'].fillna("").astype(str).tolist()

n_features = 10000 # adjust accordingly based 
vectorizer = CountVectorizer(max_df=0.95, min_df=5, stop_words='english', max_features=n_features)
dtm = vectorizer.fit_transform(texts)
print("Document-term matrix shape:", dtm.shape)

Document-term matrix shape: (1343, 232)


In [12]:
# Fit LDA

n_topics = 30  # choose based on desired granularity
lda = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=10,
                                learning_method='batch',
                                random_state=0,
                                verbose=1)
topic_distributions = lda.fit_transform(dtm)  # shape: (n_docs, n_topics)
print("Topic distributions shape:", topic_distributions.shape)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10
Topic distributions shape: (1343, 30)


In [13]:
# Assign dominant topic and full distributions
dominant_topics = np.argmax(topic_distributions, axis=1)
df['dominant_topic'] = dominant_topics
# Optionally store the full distribution in separate columns:
for topic_idx in range(n_topics):
    df[f"topic_{topic_idx}"] = topic_distributions[:, topic_idx]

In [14]:
# Inspect top words per topic

def print_top_words(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind]
        print(f"Topic {topic_idx}: {', '.join(top_features)}")

feature_names = vectorizer.get_feature_names_out()
print("Top words per topic:")
print_top_words(lda, feature_names, n_top_words=10)

Top words per topic:
Topic 0: fundamentals, iot, things, internet, neuroscience, political, networks, microsoft, customer, cybersecurity
Topic 1: learning, machine, operations, deep, music, tensorflow, risk, economics, production, healthcare
Topic 2: english, advanced, ap, writing, learn, scientific, sales, literature, real, intermediate
Topic 3: ibm, methods, applications, developer, meta, end, research, basic, analyst, javascript
Topic 4: foundations, design, marketing, user, ux, analytics, technology, process, product, approach
Topic 5: cybersecurity, finance, systems, accounting, decentralized, modeling, banking, investment, future, infrastructure
Topic 6: engineering, essentials, entrepreneurship, sql, materials, databases, urban, customer, artificial, coding
Topic 7: digital, fintech, cloud, aws, professional, markets, emerging, people, marketing, devops
Topic 8: introduction, blockchain, services, supply, global, chain, psychology, culture, financial, sigma
Topic 9: computer, cr