2. Text cleaning.
Now we have a combined dataframe consisting of 720 courses from EdX platform and 623 from Coursera. Before moving towards text embedding we need to explore how many ssinilar courses we have in the database and then clean the text (remove HTML tags, lowercasing, remove URLs, excessive whitespace)

In [4]:
import os
import pandas as pd

# read file

process_dir = "../../data/processed"
file = "courses_combined_catalog.csv"

file_path = os.path.join(process_dir, file)
df_catalog_combined= pd.read_csv(file_path)

df_catalog_combined.head()


Unnamed: 0,global_id,platform,title,provider,level,description
0,edx_0,edx,How to Learn Online,edX,Beginner,Learn essential strategies for successful onli...
1,edx_1,edx,Programming for Everybody (Getting Started wit...,The University of Michigan,Beginner,"This course is a ""no prerequisite"" introductio..."
2,edx_2,edx,CS50's Introduction to Computer Science,Harvard University,Beginner,An introduction to the intellectual enterprise...
3,edx_3,edx,The Analytics Edge,Massachusetts Institute of Technology,Intermediate,"Through inspiring examples and stories, discov..."
4,edx_4,edx,Marketing Analytics: Marketing Measurement Str...,"University of California, Berkeley",Beginner,This course is part of a MicroMasters® Program...


In [3]:
print("Working directory:", os.getcwd())

Working directory: c:\Users\jvlas\source\repos\TrioLearn\notebooks\courses


Text cleaning

prepare_catalog(df_norm) function: given a normalized DataFrame with title and description, adds clean_title and clean_description columns by cleaning text.

In [9]:
import re

def clean_text(text):
    """
    Basic text cleaning: lowercase, remove HTML-like tags, URLs, non-alphanumeric chars.
    """
    if not isinstance(text, str):
        return ""
    text = text.lower()
    # Remove HTML tags
    text = re.sub(r"<[^>]+>", " ", text)
    # Remove URLs
    text = re.sub(r"http\S+", " ", text)
    # Remove non-alphanumeric characters (keep spaces)
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    # Collapse whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [24]:


from sentence_transformers import SentenceTransformer


# combibe both datasets

def prepare_catalog_for_embedding(df, title_col='title', desc_col='description'):
    df2 = df.copy()
    # Ensure columns exist
    if title_col not in df2.columns:
        df2[title_col] = ""
    if desc_col not in df2.columns:
        df2[desc_col] = ""
    # Clean
    df2['clean_title'] = df2[title_col].fillna("").astype(str).apply(clean_text)
    df2['clean_description'] = df2[desc_col].fillna("").astype(str).apply(clean_text)
    # Combine title + description (you can also add subjects/tags if available)
    df2['text_for_embedding'] = (df2['clean_title'] + " " + df2['clean_description']).str.strip()
    
    return df2

prepared_catalog = prepare_catalog_for_embedding(df_catalog_combined,  title_col='title', desc_col='description')
prepared_catalog.head()

Unnamed: 0,global_id,platform,title,provider,level,description,clean_title,clean_description,text_for_embedding
0,edx_0,edx,How to Learn Online,edX,Beginner,Learn essential strategies for successful onli...,how to learn online,learn essential strategies for successful onli...,how to learn online learn essential strategies...
1,edx_1,edx,Programming for Everybody (Getting Started wit...,The University of Michigan,Beginner,"This course is a ""no prerequisite"" introductio...",programming for everybody getting started with...,this course is a no prerequisite introduction ...,programming for everybody getting started with...
2,edx_2,edx,CS50's Introduction to Computer Science,Harvard University,Beginner,An introduction to the intellectual enterprise...,cs50 s introduction to computer science,an introduction to the intellectual enterprise...,cs50 s introduction to computer science an int...
3,edx_3,edx,The Analytics Edge,Massachusetts Institute of Technology,Intermediate,"Through inspiring examples and stories, discov...",the analytics edge,through inspiring examples and stories discove...,the analytics edge through inspiring examples ...
4,edx_4,edx,Marketing Analytics: Marketing Measurement Str...,"University of California, Berkeley",Beginner,This course is part of a MicroMasters® Program...,marketing analytics marketing measurement stra...,this course is part of a micromasters program ...,marketing analytics marketing measurement stra...


find_similar_courses(df1, df2, threshold=0.8): identifies pairs of similar courses between two DataFrames based on cosine similarity of TF-IDF vectors of cleaned titles. It returns a DataFrame listing index pairs (idx1, idx2), similarity score, and original titles. Later we can adjust threshold to be more or less strict (e.g., 0.8 for high similarity).

Notes:
Threshld tuning: A higher threshold (e.g., 0.8–0.9) catches very similar titles; a lower threshold (e.g., 0.5–0.7) can find more approximate matches but may include false positives.
Performance: For large catalogs (thousands of courses), the pairwise approach is O(n*m). If both datasets are large, consider optimizations: e.g., approximate nearest neighbors on title embeddings (Sentence-BERT) or using sparse matrix operations to only inspect high-similarity pairs.
Alternative matching:
Use title embeddings (via Sentence-BERT) and approximate nearest neighbor (e.g., FAISS) for semantic matching beyond exact words.
Combine title and description in matching: e.g., vectorize combined text.
Cleaning: It’s generally good to clean text (remove punctuation, lowercase) before vectorizing.
Workflow: It’s fine to normalize and clean each dataset separately, then combine and run similarity matching. You can also clean combined DataFrame, but normalization per-dataset ensures correct schema first.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def find_similar_between_platforms(df, platform1='edx', platform2='coursera', 
                                   text_field='text_for_embedding', threshold=0.2):
    """
    Identify similar courses between two platforms within a combined catalog DataFrame.
    - df: DataFrame containing at least columns ['platform', 'title', 'description'].
    - platform1, platform2: names of the two platforms to compare (e.g., 'edx' and 'coursera').
    - text_field: which cleaned text column to vectorize for similarity (default: 'clean_text_for_match').
    - threshold: cosine similarity threshold (0 to 1). Pairs with similarity >= threshold are returned.

    Returns a DataFrame with columns ['platform1_index', 'platform2_index', 'similarity', 'title1', 'title2'].
    """
    # Prepare cleaned text columns
    df_clean = prepare_catalog_for_embedding(df)
    
    # Split by platform
    df1 = df_clean[df_clean['platform'] == platform1].reset_index(drop=True)
    df2 = df_clean[df_clean['platform'] == platform2].reset_index(drop=True)
    if df1.empty or df2.empty:
        print(f"No entries found for one of the platforms: {platform1} or {platform2}.")
        return pd.DataFrame(columns=['idx1', 'idx2', 'similarity', 'title1', 'title2'])
    
    # Vectorize the text_field for both subsets using a shared TF-IDF vectorizer
    # Combine corpora to ensure same feature space
    corpus = pd.concat([df1[text_field], df2[text_field]], ignore_index=True)
    vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
    vectorizer.fit(corpus)
    tfidf1 = vectorizer.transform(df1[text_field])
    tfidf2 = vectorizer.transform(df2[text_field])
    
    # Compute cosine similarity matrix between df1 and df2
    sim_matrix = cosine_similarity(tfidf1, tfidf2)
    
    # Collect matches above threshold
    matches = []
    for i in range(sim_matrix.shape[0]):
        for j in range(sim_matrix.shape[1]):
            score = sim_matrix[i, j]
            if score >= threshold and df1.loc[i, 'clean_title'] and df2.loc[j, 'clean_title']:
                matches.append({
                    'idx1': i,
                    'idx2': j,
                    'similarity': score,
                    'title1': df1.loc[i, 'title'],
                    'title2': df2.loc[j, 'title']
                })
    matches_df = pd.DataFrame(matches)
    return matches_df


In [38]:

matches = find_similar_between_platforms(df_catalog_combined, platform1='edx', platform2='coursera', threshold=0.7)
print(f"Found {len(matches)} similar course pairs between EdX and Coursera")
display(matches)


Found 11 similar course pairs between EdX and Coursera


Unnamed: 0,idx1,idx2,similarity,title1,title2
0,1,180,0.82781,Programming for Everybody (Getting Started wit...,Programming for Everybody (Getting Started wit...
1,72,53,0.715437,AI for Everyone: Master the Basics,AI For Everyone
2,72,323,0.761874,AI for Everyone: Master the Basics,Introduction to Back-End Development
3,91,136,0.761833,Introduction to Data Science,What is Data Science?
4,174,255,0.783938,Python Data Structures,Python Data Structures
5,228,433,0.720452,Linear Algebra - Foundations to Frontiers,Essential Linear Algebra for Data Science
6,382,301,0.763361,Computer Vision Fundamentals with Watson and O...,Introduction to Computers and Operating System...
7,384,290,0.763439,Introduction to Cloud Computing,Introduction to Computer Science and Programming
8,384,500,0.707846,Introduction to Cloud Computing,"Cloud Computing Applications, Part 1: Cloud Sy..."
9,468,109,0.702595,AWS: Getting Started with Cloud Security,AWS Cloud Solutions Architect


In [40]:
# os.makedirs("data/processed", exist_ok=True)
# save combined dataset as CSV
prepared_catalog.to_csv("../data/processed/courses_combined_cleaned.csv", index=False)

print(os.listdir("../data/processed"))

['courses_combined_catalog.csv', 'courses_combined_cleaned.csv']
