Use a Sentence-BERT (or similar) to encode each course into an embedding. Save these embeddings for fast lookup. After this runs, we have:
DataFrame with course data, a NumPy array of shape (num_courses, embedding_dim) and saved file data/processed/course_embeddings.npy.

In [2]:
import os
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

#  Load the cleaned combined catalog
process_dir = "../data/processed"
catalog_path = os.path.join(process_dir, "courses_combined_cleaned.csv") 
df = pd.read_csv(catalog_path)

# Check that text_for_embedding exists
if 'text_for_embedding' not in df.columns:
    raise KeyError("Column 'text_for_embedding' not found in DataFrame.")

print("Loaded combined courses catalog with shape:", df.shape)

Loaded combined courses catalog with shape: (1343, 9)


In [3]:
# Choose a Sentence-BERT model
model_name = 'all-MiniLM-L6-v2'  
model = SentenceTransformer(model_name)

# Encode in batches
texts = df['text_for_embedding'].fillna("").astype(str).tolist()
print("Encoding", len(texts), "courses into embeddings...")
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)  # returns a numpy array

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Encoding 1343 courses into embeddings...


Batches: 100%|██████████| 42/42 [01:44<00:00,  2.49s/it]


In [None]:
# Save embeddings to disk
emb_path = os.path.join(process_dir, "course_embeddings.npy")
np.save(emb_path, embeddings)
print("Saved embeddings to:", emb_path)

#  Save the model name/version to know later which model was used
with open(os.path.join(process_dir, "embedding_model.txt"), "w") as f:
    f.write(model_name)

Saved embeddings to: ../data/processed\course_embeddings.npy
