<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Calculate Course Similarity using BoW Features**


Estimated time needed: **45** minutes


Similarity measurement between items is the foundation of many recommendation algorithms, especially for content-based recommendation algorithms. For example, if a new course is similar to user's enrolled courses, we could recommend that new similar course to the user. Or If user A is similar to user B, then we can recommend some of user B's courses to user A (the unseen courses) because user A and user B may have similar interests.


In a previous course, you learned many similarity measurements such as `consine`, `jaccard index`, or `euclidean distance`, and these methods need to work on either two vectors or two sets (sometimes even matrices or tensors). 

In previous labs, we extracted the BoW features from course textual content. Given the course BoW feature vectors, we can easily apply similarity measurement to calculate the course similarity as shown in the below figure.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/course_sim.png)


## Objectives


After completing this lab you will be able to:


* Calculate the similarity between any two courses using BoW feature vectors


----


## Prepare and setup lab environment


First let's install and import required libraries:


In [116]:
!pip install nltk
!pip install gensim
!pip install scipy==1.10
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: Ignored the following yanked versions: 1.11.0, 1.14.0rc1[0m[31m
[0m[31mERROR: Ignored the following versions that require a different python version: 1.10.0 Requires-Python <3.12,>=3.8; 1.10.0rc1 Requires-Python <3.12,>=3.8; 1.10.0rc2 Requires-Python <3.12,>=3.8; 1.10.1 Requires-Python <3.12,>=3.8; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requi

In [117]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import gensim
import pandas as pd
import nltk as nltk

from scipy.spatial.distance import cosine
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
from gensim import corpora

%matplotlib inline

In [118]:
# also set a random state
rs = 123

### Calculate the consine similarity between two example courses


Suppose we have two simple example courses:


In [119]:
course1 = "machine learning for everyone"

In [120]:
course2 = "machine learning for beginners"

Next we can quickly tokenize them using the split() method (or using `word_tokenize()` method provided in `nltk` as we did in the previous lab).


In [121]:
tokens = set(course1.split() + course2.split())

In [122]:
tokens = list(tokens)
tokens

['machine', 'for', 'learning', 'everyone', 'beginners']

then generate BoW features (token counts) for these two courses (or using `tokens_dict.doc2bow()` method provided in `nltk`, similar to what we did in the previous lab).


In [123]:
def generate_sparse_bow(course, tokens):
    """
    Generate a sparse bag-of-words (BoW) representation for a given course.

    Parameters:
    course (str): The input course text to generate the BoW representation for.

    Returns:
    list: A sparse BoW representation where each element corresponds to the presence (1) or absence (0)
    of a word in the input course text.
    """

    # Initialize an empty list to store the BoW vector
    bow_vector = []

    # Tokenize the course text by splitting it into words
    words = course.split()

    # Iterate through all unique words (tokens) in the course
    for token in set(tokens):
        # Check if the token is present in the course text
        if token in words:
            # If the token is present, append 1 to the BoW vector
            bow_vector.append(1)
        else:
            # If the token is not present, append 0 to the BoW vector
            bow_vector.append(0)

    # Return the sparse BoW vector
    return bow_vector


In [124]:
bow1 = generate_sparse_bow(course1,tokens)
bow1

[1, 1, 1, 1, 0]

In [125]:
bow2 = generate_sparse_bow(course2,tokens)
bow2

[1, 1, 1, 0, 1]

From the above cell outputs, we can see the two vectors are very similar. Only two dimensions are different.


Now we can quickly apply the cosine similarity measurement on the two vectors:


In [126]:
cos_sim = 1 - cosine(bow1, bow2)

In [127]:
print(f"The cosine similarity between course `{course1}` and course `{course2}` is {round(cos_sim, 2) * 100}%")

The cosine similarity between course `machine learning for everyone` and course `machine learning for beginners` is 75.0%


_Practice: Try other similarity measurements such as Euclidean Distance or Jaccard index._


In [128]:
# WRITE YOUR CODE HERE
from sklearn.metrics import jaccard_score

jaccard_score(bow1, bow2)



0.6

In [129]:
from scipy.spatial.distance import euclidean

euclidean(np.array(bow1) , np.array(bow2))

1.4142135623730951

For Example: Euclidean distance between 2 points $p$ and $q$ can be summarized by this equation: $d(p,q)={\sqrt {(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}+(p_{3}-q_{3})^{2}}}$. You can use `euclidean(p,q)` function from ```scipy``` package to calculate it. 


### TASK: Find similar courses to the course `Machine Learning with Python`


Now you have learned how to calculate cosine similarity between two sample BoW feature vectors. Let's work on some real course BoW feature vectors.


In [130]:
# Load the BoW features as Pandas dataframe
bows_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/courses_bows.csv"
bows_df = pd.read_csv(bows_url)
bows_df = bows_df[['doc_id', 'token', 'bow']]

In [131]:
bows_df.head(10)

Unnamed: 0,doc_id,token,bow
0,ML0201EN,ai,2
1,ML0201EN,apps,2
2,ML0201EN,build,2
3,ML0201EN,cloud,1
4,ML0201EN,coming,1
5,ML0201EN,create,1
6,ML0201EN,data,1
7,ML0201EN,developer,1
8,ML0201EN,found,1
9,ML0201EN,fun,1


The `bows_df` dataframe contains the BoW features vectors for each course, in a vertical and dense format. It has three columns `doc_id` represents the course id, `token` represents the token value, and `bow` represents the BoW value (token count).


Then, let's load another course content dataset which contains the course title and description:


In [132]:
# Load the course dataframe
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_df = pd.read_csv(course_url)

In [133]:
course_df.head(10)

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...
1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...
2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...
3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...
5,CNSC02EN,cloud native security conference data security,introduction to data security on cloud
6,DX0106EN,data science bootcamp with r for university pr...,a multi day intensive in person data science ...
7,GPXX0FTCEN,learn how to use docker containers for iterati...,learn how to use docker containers for iterati...
8,RAVSCTEST1,scorm test 1,scron test course
9,GPXX06RFEN,create your first mongodb database,in this guided project you will get started w...


Given course ID `ML0101ENv3`, let's find out its title and description:


In [134]:
course_df[course_df['COURSE_ID'] == 'ML0101ENv3']

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


We can see it is a machine learning with Python course so we can expect any machine learning or Python related courses would be similar.


Then, let's print its associated BoW features:


In [135]:
ml_course = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
ml_course

Unnamed: 0,doc_id,token,bow
2747,ML0101ENv3,course,1
2748,ML0101ENv3,learning,4
2749,ML0101ENv3,machine,3
2750,ML0101ENv3,need,1
2751,ML0101ENv3,get,1
2752,ML0101ENv3,started,1
2753,ML0101ENv3,python,2
2754,ML0101ENv3,tool,1
2755,ML0101ENv3,tools,1
2756,ML0101ENv3,predict,1


We can see the BoW feature vector is in vertical format but normally feature vectors are in horizontal format. One way to transpose the feature vector from vertical to horizontal is to use the Pandas `pivot()` method:


In [136]:
ml_courseT = ml_course.pivot(index=['doc_id'], columns='token').reset_index(level=[0])
ml_courseT

Unnamed: 0_level_0,doc_id,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow
token,Unnamed: 1_level_1,beneficial,course,free,future,get,give,hidden,insights,learning,machine,need,predict,python,started,supervised,tool,tools,trends,unsupervised
0,ML0101ENv3,1,1,1,1,1,1,1,1,4,3,1,1,2,1,1,1,1,1,1


To compare the BoWs of any two courses, which normally have a different set of tokens, we need to create a union token set and then transpose them. We have provided a method called `pivot_two_bows` as follows:


In [137]:
def pivot_two_bows(basedoc, comparedoc):
    """
    Pivot two bag-of-words (BoW) representations for comparison.

    Parameters:
    basedoc (DataFrame): DataFrame containing the bag-of-words representation for the base document.
    comparedoc (DataFrame): DataFrame containing the bag-of-words representation for the document to compare.

    Returns:
    DataFrame: A DataFrame with pivoted BoW representations for the base and compared documents,
    facilitating direct comparison of word occurrences between the two documents.
    """

    # Create copies of the input DataFrames to avoid modifying the originals
    base = basedoc.copy()
    base['type'] = 'base'  # Add a 'type' column indicating base document
    compare = comparedoc.copy()
    compare['type'] = 'compare'  # Add a 'type' column indicating compared document

    # Concatenate the two DataFrames vertically
    join = pd.concat([base, compare])

    # Pivot the concatenated DataFrame based on 'doc_id' and 'type', with words as columns
    joinT = join.pivot(index=['doc_id', 'type'], columns='token').fillna(0).reset_index(level=[0, 1])

    # Assign meaningful column names to the pivoted DataFrame
    joinT.columns = ['doc_id', 'type'] + [t[1] for t in joinT.columns][2:]

    # Return the pivoted DataFrame for comparison
    return joinT


In [138]:
course1 = bows_df[bows_df['doc_id'] == 'ML0151EN']
course2 = bows_df[bows_df['doc_id'] == 'ML0101ENv3']

In [139]:
bow_vectors = pivot_two_bows(course1, course2)
bow_vectors

Unnamed: 0,doc_id,type,approachable,basics,beneficial,comparison,course,dives,free,future,...,relates,started,statistical,supervised,tool,tools,trends,unsupervised,using,vs
0,ML0101ENv3,compare,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
1,ML0151EN,base,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0


Similarly, we can use the cosine method to calculate their similarity:


In [140]:
similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
similarity

0.662622139954909

Now it's your turn to perform a task of finding all courses similar to the course `Machine Learning with Python`:


In [141]:
course_df[course_df['COURSE_ID'] == 'ML0101ENv3']

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


You can set a similarity threshold such as 0.5 to determine if two courses are similar enough.


_TODO: Find courses which are similar to course `Machine Learning with Python (ML0101ENv3)`, you also need to show the title and descriptions of those courses._


In [142]:
# WRITE YOUR CODE HERE

## For each course other than ML0101ENv3, use pivot_course_rows to convert it with course ML0101ENv3 into horizontal two BoW feature vectors
## Then use the cosine method to calculate the similarity
## Report all courses with similarities larger than a specific threshold (such as 0.5)

docs = set(bows_df['doc_id'])
docs = list(docs)
docs


['excourse08',
 'ST0101EN',
 'excourse93',
 'CB0103EN',
 'RP0101EN',
 'excourse87',
 'excourse61',
 'PA0103EN',
 'DW0101EN',
 'excourse92',
 'excourse56',
 'excourse62',
 'excourse64',
 'excourse47',
 'GPXX04P5EN',
 'GPXX06RFEN',
 'BD0121EN',
 'GPXX07UGEN',
 'CO0101EN',
 'SC0105EN',
 'excourse07',
 'excourse51',
 'DA0201EN',
 'excourse16',
 'GPXX03HFEN',
 'CB0101EN',
 'excourse14',
 'TMP0105EN',
 'excourse12',
 'GPXX04TNEN',
 'excourse04',
 'GPXX0UN5EN',
 'excourse84',
 'excourse85',
 'excourse50',
 'GPXX0YBFEN',
 'excourse39',
 'DX0108EN',
 'GPXX04XJEN',
 'GPXX0742EN',
 'BD0137EN',
 'excourse55',
 'GPXX0KY1EN',
 'RP0105EN',
 'BD0133EN',
 'DV0151EN',
 'excourse44',
 'excourse21',
 'TA0106EN',
 'CC0150EN',
 'GPXX0ZG0EN',
 'SC0103EN',
 'GPXX0YMEEN',
 'excourse68',
 'excourse03',
 'CL0101EN',
 'excourse27',
 'GPXX0G3KEN',
 'excourse90',
 'excourse74',
 'excourse88',
 'excourse78',
 'ML0151EN',
 'excourse15',
 'DS0301EN',
 'excourse59',
 'EE0101EN',
 'BD0143EN',
 'BD0123EN',
 'GPXX0ZYVEN',

In [143]:
similar_courses = []
course1 = bows_df[bows_df['doc_id'] =='ML0101ENv3']
for i in docs:
    if i == 'ML0101ENv3':
        continue
    course2 = bows_df[bows_df['doc_id'] == i] 
    courses = pivot_two_bows(course1, course2)
    similarity = (1 - cosine(courses.iloc[0, 2:], courses.iloc[1, 2:]))
    if similarity > 0.5:
        print(similarity)
        similar_courses.append(i)
similar_courses


0.6347547807096177
0.662622139954909
0.5490400192158565
0.5217491947499509
0.6120541193300345


['excourse47', 'ML0151EN', 'excourse60', 'ML0109EN', 'excourse46']

In [144]:
course_df[course_df['COURSE_ID'].isin(similar_courses)]

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
157,ML0109EN,machine learning dimensionality reduction,machine learning dimensionality reduction
200,ML0151EN,machine learning with r,this machine learning with r course dives into...
259,excourse46,machine learning,machine learning is the science of getting com...
260,excourse47,machine learning for all,machine learning often called artificial inte...
273,excourse60,introduction to tensorflow for artificial inte...,if you are a software developer who wants to b...


In [None]:
#!/usr/bin/env python3
"""
Create data/external/courses.csv (columns: item,title)

Sources (in priority order):
1) Official IBM catalog (course_processed.csv) via HTTPS
2) Hand-maintained overrides for common IDs (top/popular)
3) Fallback: readable title from the code (underscores/hyphens → spaces, title case)

Outputs:
- data/external/courses.csv
- logs/missing_courses.txt (any items we couldn't title from the official source)
"""
from __future__ import annotations
from pathlib import Path
import sys
import pandas as pd

# --- Paths
ROOT = Path(__file__).resolve().parents[1]
RATINGS_PATH = ROOT / "data" / "external" / "course_ratings.csv"
OUT_PATH = ROOT / "data" / "external" / "courses.csv"
LOGS_DIR = ROOT / "logs"
MISS_PATH = LOGS_DIR / "missing_courses.txt"

# --- Official catalog (same family as the course labs)
CATALOG_URL = (
    "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/"
    "IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
)

# --- Optional overrides for frequent/popular IDs (exactly as you want them displayed)
OVERRIDES = {
    "PY0101EN": "Python for Data Science",
    "DS0101EN": "Introduction to Data Science",
    "BD0101EN": "Big Data 101",
    "BD0111EN": "Hadoop 101",
    "DA0101EN": "Data Analysis with Python",
    "DS0103EN": "Data Science Methodology",
    "ML0101ENv3": "Machine Learning with Python",
    "BD0211EN": "Spark Fundamentals I",
    "DS0105EN": "Data Science Hands-On with Open Source Tools",
    "BC0101EN": "Blockchain Essentials",
    "DV0101EN": "Data Visualization with Python",
    "ML0115EN": "Deep Learning 101",
    "CB0103EN": "Build Your Own Chatbot",
    "RP0101EN": "R for Data Science",
    "ST0101EN": "Statistics 101",
    "CC0101EN": "Introduction to Cloud",
    "CO0101EN": "Docker Essentials: A Developer Introduction",
    "DB0101EN": "SQL and Relational Databases 101",
    "BD0115EN": "MapReduce and YARN",
    "DS0301EN": "Data Privacy Fundamentals",
}

def prettify(code: str) -> str:
    """Readable title from an item code as a last-resort fallback."""
    # keep common suffixes (EN, v3) but insert spaces where appropriate
    t = code.replace("_", " ").replace("-", " ")
    # Simple heuristics: split letters+digits boundaries
    import re
    t = re.sub(r"(?<=[A-Za-z])(?=\d)", " ", t)
    t = re.sub(r"(?<=\d)(?=[A-Za-z])", " ", t)
    # Title-case but preserve common acronyms
    title = t.title()
    for acro in ["AI", "ML", "SQL", "GPU", "NLP", "R", "IoT"]:
        title = title.replace(acro.title(), acro)
    return title.strip()

def main() -> int:
    if not RATINGS_PATH.exists():
        print(f"ERROR: {RATINGS_PATH} not found. Export ratings first.", file=sys.stderr)
        return 2

    LOGS_DIR.mkdir(parents=True, exist_ok=True)
    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)

    # 1) Course IDs actually present in your project
    ratings = pd.read_csv(RATINGS_PATH, usecols=["user", "item", "rating"])
    items = pd.DataFrame({"item": sorted(ratings["item"].astype(str).unique())})

    # 2) Try official catalog (COURSE_ID, TITLE)
    catalog = None
    try:
        cat_raw = pd.read_csv(CATALOG_URL, dtype=str, usecols=["COURSE_ID", "TITLE"])
        catalog = (
            cat_raw.drop_duplicates(subset=["COURSE_ID"])
            .rename(columns={"COURSE_ID": "item", "TITLE": "title"})
            .assign(item=lambda d: d["item"].astype(str))
        )
    except Exception as e:
        print(f"WARNING: could not download official catalog ({e}). Proceeding with overrides/fallbacks.", file=sys.stderr)

    # Start assembling mapping
    out = items.copy()
    out["title"] = None

    # Join from official catalog if available
    if catalog is not None:
        out = out.merge(catalog, on="item", how="left", suffixes=("", "_cat"))
        out["title"] = out["title"].where(out["title"].notna(), out.get("title_cat"))
        if "title_cat" in out:
            out = out.drop(columns=["title_cat"])

    # Apply overrides
    mask_missing = out["title"].isna()
    if mask_missing.any():
        out.loc[mask_missing, "title"] = [
            OVERRIDES.get(i, None) for i in out.loc[mask_missing, "item"]
        ]

    # Fallback prettify for any remaining gaps
    mask_missing = out["title"].isna()
    if mask_missing.any():
        out.loc[mask_missing, "title"] = [
            prettify(i) for i in out.loc[mask_missing, "item"]
        ]

    # Save and log any that still look suspicious (rare)
    out = out[["item", "title"]].drop_duplicates()
    out.to_csv(OUT_PATH, index=False)
    print(f"Wrote {len(out):,} rows → {OUT_PATH}")

    # Log entries that were not from the official catalog (for your review)
    if catalog is not None:
        from_cat = set(catalog["item"])
        guessed = out[~out["item"].isin(from_cat)]
        if not guessed.empty:
            MISS_PATH.write_text(
                "\n".join([f"{r.item},{r.title}" for r in guessed.itertuples(index=False)])
            )
            print(f"Logged {len(guessed)} non-catalog titles → {MISS_PATH}")

    return 0

if __name__ == "__main__":
    raise SystemExit(main())

<details>
    <summary>Click here for Hints</summary>
    
You can use `bows_df[bows_df['doc_id'] == 'ML0101ENv3']` to find 'ML0101ENv3' course bow. Then in a similar matter you can find bows for each course_id that's not 'ML0101ENv3'. Then you can join 2 bows by using predefined `pivot_two_bows` function and calculate the similarity as we just did using the cosine method. Print the course ids with similarity>0.5 
</details>


### Summary


Congratulations, you have finished the course similarity lab. In this lab, you used cosine and course BoW features to calculate the similarities among courses. Such similarity measurement is the core of many content-based recommender systems, which you will learn and practice in the later labs.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/)


### Other Contributors


```toggle## Change Log
```


```toggle|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
```
```toggle|-|-|-|-|
```
```toggle|2021-10-25|1.0|Yan|Created the initial version|
```


Copyright © 2021 IBM Corporation. All rights reserved.
