# **Calculate Course Similarity using BoW Features**


Similarity measurement between items is the foundation of many recommendation algorithms, especially for content-based recommendation algorithms. For example, if a new course is similar to user's enrolled courses, we could recommend that new similar course to the user. Or If user A is similar to user B, then we can recommend some of user B's courses to user A (the unseen courses) because user A and user B may have similar interests.

we have learned many similarity measurements such as consine, jaccard index, or euclidean distance, and these methods need to work on either two vectors or two sets (sometimes even matrices or tensors).

In previous labs, we extracted the BoW features from course textual content. Given the course BoW feature vectors, we can easily apply similarity measurement to calculate the course similarity as shown in the below figure.

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/course_sim.png)

## Objectives

* Calculate the similarity between any two courses using BoW feature vectors

In [1]:
!pip install nltk
!pip install gensim
!pip install scipy==1.10
!pip install pandas
!pip install matplotlib
!pip install seaborn

Collecting scipy==1.10
  Downloading scipy-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.9/58.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Downloading scipy-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.13.1
    Uninstalling scipy-1.13.1:
      Successfully uninstalled scipy-1.13.1
Successfully installed scipy-1.10.0


In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import gensim
import pandas as pd
import nltk as nltk

from scipy.spatial.distance import cosine
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
from gensim import corpora

%matplotlib inline

In [3]:
# also set a random state
rs = 123

### Calculate the consine similarity between two example courses

Suppose we have two simple example courses:

In [4]:
course1 = "machine learning for everyone"

In [5]:
course2 = "machine learning for beginners"

In [6]:
# Next we can quickly tokenize them using the split() method (or using `word_tokenize()` method provided in `nltk` as we did in the previous lab).

tokens = set(course1.split() + course2.split())

In [7]:
tokens = list(tokens)
tokens

['beginners', 'learning', 'machine', 'for', 'everyone']

In [8]:
# then generate BoW features (token counts) for these two courses (or using `tokens_dict.doc2bow()` method provided in `nltk`, similar to what we did in the previous lab).

def generate_sparse_bow(course):
    """
    Generate a sparse bag-of-words (BoW) representation for a given course.

    Parameters:
    course (str): The input course text to generate the BoW representation for.

    Returns:
    list: A sparse BoW representation where each element corresponds to the presence (1) or absence (0)
    of a word in the input course text.
    """

    # Initialize an empty list to store the BoW vector
    bow_vector = []

    # Tokenize the course text by splitting it into words
    words = course.split()

    # Iterate through all unique words (tokens) in the course
    for token in set(words):
        # Check if the token is present in the course text
        if token in words:
            # If the token is present, append 1 to the BoW vector
            bow_vector.append(1)
        else:
            # If the token is not present, append 0 to the BoW vector
            bow_vector.append(0)

    # Return the sparse BoW vector
    return bow_vector


In [9]:
bow1 = generate_sparse_bow(course1)
bow1

[1, 1, 1, 1]

In [10]:
bow2 = generate_sparse_bow(course2)
bow2

[1, 1, 1, 1]

From the above cell outputs, we can see the two vectors are very similar. Only two dimensions are different.

Now we can quickly apply the cosine similarity measurement on the two vectors:


In [11]:
cos_sim = 1 - cosine(bow1, bow2)

In [12]:
print(f"The cosine similarity between course `{course1}` and course `{course2}` is {round(cos_sim, 2) * 100}%")

The cosine similarity between course `machine learning for everyone` and course `machine learning for beginners` is 100%


_Practice: Try other similarity measurements such as Euclidean Distance or Jaccard index._
# For Example: Euclidean distance between 2 points
 and
 can be summarized by this equation:
. You can use euclidean(p,q) function from scipy package to calculate it.

In [13]:
# Try other similarity measurements such as Euclidean Distance or Jaccard index.

from scipy.spatial.distance import euclidean
from sklearn.metrics import jaccard_score

# Define the two courses
course1 = "machine learning for everyone"
course2 = "machine learning for beginners"

# Tokenize both courses
tokens = list(set(course1.split() + course2.split()))  # Combine both courses to get a common token list

# Function to generate BoW based on a common token list
def generate_bow(course, tokens):
    """
    Generate a Bag-of-Words (BoW) representation for a course based on a common token list.

    Parameters:
    course (str): The input course text.
    tokens (list): A list of unique tokens (vocabulary) for both courses.

    Returns:
    list: A BoW representation where each element is the count of the token in the course.
    """
    bow_vector = []
    words = course.split()

    # For each token in the combined token list, count its occurrences in the course text
    for token in tokens:
        bow_vector.append(words.count(token))

    return bow_vector

# Generate BoW vectors for both courses based on the combined token list
bow1 = generate_bow(course1, tokens)
bow2 = generate_bow(course2, tokens)

# 1. Cosine Similarity
cos_sim = 1 - cosine(bow1, bow2)
print(f"The cosine similarity between course `{course1}` and course `{course2}` is {round(cos_sim * 100, 2)}%")

# 2. Euclidean Distance
euclidean_dist = euclidean(bow1, bow2)
print(f"The Euclidean distance between course `{course1}` and course `{course2}` is {euclidean_dist:.2f}")

# 3. Jaccard Index (convert BoW into binary presence/absence vectors)
bow1_binary = [1 if count > 0 else 0 for count in bow1]
bow2_binary = [1 if count > 0 else 0 for count in bow2]
jaccard_sim = jaccard_score(bow1_binary, bow2_binary)
print(f"The Jaccard index between course `{course1}` and course `{course2}` is {round(jaccard_sim * 100, 2)}%")


The cosine similarity between course `machine learning for everyone` and course `machine learning for beginners` is 75.0%
The Euclidean distance between course `machine learning for everyone` and course `machine learning for beginners` is 1.41
The Jaccard index between course `machine learning for everyone` and course `machine learning for beginners` is 60.0%


### TASK: We will find similar courses to the course `Machine Learning with Python`
Now we have learned how to calculate cosine similarity between two sample BoW feature vectors. Let's work on some real course BoW feature vectors.


In [14]:
# Load the BoW features as Pandas dataframe
bows_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/courses_bows.csv"
bows_df = pd.read_csv(bows_url)
bows_df = bows_df[['doc_id', 'token', 'bow']]

In [15]:
bows_df.head(10)

Unnamed: 0,doc_id,token,bow
0,ML0201EN,ai,2
1,ML0201EN,apps,2
2,ML0201EN,build,2
3,ML0201EN,cloud,1
4,ML0201EN,coming,1
5,ML0201EN,create,1
6,ML0201EN,data,1
7,ML0201EN,developer,1
8,ML0201EN,found,1
9,ML0201EN,fun,1


The bows_df dataframe contains the BoW features vectors for each course, in a vertical and dense format. It has three columns doc_id represents the course id, token represents the token value, and bow represents the BoW value (token count).

Then, let's load another course content dataset which contains the course title and description:

In [16]:
# Load the course dataframe
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_df = pd.read_csv(course_url)

In [17]:
course_df.head(10)

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...
1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...
2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...
3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...
5,CNSC02EN,cloud native security conference data security,introduction to data security on cloud
6,DX0106EN,data science bootcamp with r for university pr...,a multi day intensive in person data science ...
7,GPXX0FTCEN,learn how to use docker containers for iterati...,learn how to use docker containers for iterati...
8,RAVSCTEST1,scorm test 1,scron test course
9,GPXX06RFEN,create your first mongodb database,in this guided project you will get started w...


In [18]:
# Given course ID `ML0101ENv3`, let's find out its title and description:

course_df[course_df['COURSE_ID'] == 'ML0101ENv3']

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


We can see it is a machine learning with Python course so we can expect any machine learning or Python related courses would be similar.


In [20]:
# Then, let's print its associated BoW features:

ml_course = bows_df[bows_df['doc_id'] == 'ML0101ENv3']
ml_course

Unnamed: 0,doc_id,token,bow
2747,ML0101ENv3,course,1
2748,ML0101ENv3,learning,4
2749,ML0101ENv3,machine,3
2750,ML0101ENv3,need,1
2751,ML0101ENv3,get,1
2752,ML0101ENv3,started,1
2753,ML0101ENv3,python,2
2754,ML0101ENv3,tool,1
2755,ML0101ENv3,tools,1
2756,ML0101ENv3,predict,1


We can see the BoW feature vector is in vertical format but normally feature vectors are in horizontal format. One way to transpose the feature vector from vertical to horizontal is to use the Pandas `pivot()` method:


In [21]:
ml_courseT = ml_course.pivot(index=['doc_id'], columns='token').reset_index(level=[0])
ml_courseT

Unnamed: 0_level_0,doc_id,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow,bow
token,Unnamed: 1_level_1,beneficial,course,free,future,get,give,hidden,insights,learning,machine,need,predict,python,started,supervised,tool,tools,trends,unsupervised
0,ML0101ENv3,1,1,1,1,1,1,1,1,4,3,1,1,2,1,1,1,1,1,1


In [22]:
# To compare the BoWs of any two courses, which normally have a different set of tokens, we need to create a union token set and then transpose them. We have provided a method called `pivot_two_bows` as follows:

def pivot_two_bows(basedoc, comparedoc):
    """
    Pivot two bag-of-words (BoW) representations for comparison.

    Parameters:
    basedoc (DataFrame): DataFrame containing the bag-of-words representation for the base document.
    comparedoc (DataFrame): DataFrame containing the bag-of-words representation for the document to compare.

    Returns:
    DataFrame: A DataFrame with pivoted BoW representations for the base and compared documents,
    facilitating direct comparison of word occurrences between the two documents.
    """

    # Create copies of the input DataFrames to avoid modifying the originals
    base = basedoc.copy()
    base['type'] = 'base'  # Add a 'type' column indicating base document
    compare = comparedoc.copy()
    compare['type'] = 'compare'  # Add a 'type' column indicating compared document

    # Concatenate the two DataFrames vertically
    join = pd.concat([base, compare])

    # Pivot the concatenated DataFrame based on 'doc_id' and 'type', with words as columns
    joinT = join.pivot(index=['doc_id', 'type'], columns='token').fillna(0).reset_index(level=[0, 1])

    # Assign meaningful column names to the pivoted DataFrame
    joinT.columns = ['doc_id', 'type'] + [t[1] for t in joinT.columns][2:]

    # Return the pivoted DataFrame for comparison
    return joinT


In [23]:
course1 = bows_df[bows_df['doc_id'] == 'ML0151EN']
course2 = bows_df[bows_df['doc_id'] == 'ML0101ENv3']

In [24]:
bow_vectors = pivot_two_bows(course1, course2)
bow_vectors

Unnamed: 0,doc_id,type,approachable,basics,beneficial,comparison,course,dives,free,future,...,relates,started,statistical,supervised,tool,tools,trends,unsupervised,using,vs
0,ML0101ENv3,compare,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
1,ML0151EN,base,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0


In [25]:
# Similarly, we can use the cosine method to calculate their similarity:

similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
similarity

0.6626221399549089

In [26]:
# finding all courses similar to the course `Machine Learning with Python`:
course_df[course_df['COURSE_ID'] == 'ML0101ENv3']

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
158,ML0101ENv3,machine learning with python,machine learning can be an incredibly benefici...


## set a similarity threshold such as 0.5 to determine if two courses are similar enough.


In [27]:
# Find courses which are similar to course Machine Learning with Python (ML0101ENv3), you also need to show the title and descriptions of those courses.

from scipy.spatial.distance import cosine

# Set similarity threshold
similarity_threshold = 0.5

# Define the target course
target_course_id = 'ML0101ENv3'

# Get BoW features of the target course
target_bow = bows_df[bows_df['doc_id'] == target_course_id]

# Store similarities with course titles and descriptions
similar_courses = []

# Loop through all unique course ids in the bows_df
for course_id in bows_df['doc_id'].unique():
    if course_id != target_course_id:
        # Get BoW features for the current course
        compared_bow = bows_df[bows_df['doc_id'] == course_id]

        # Pivot the BoW vectors of the target course and the current course
        bow_vectors = pivot_two_bows(target_bow, compared_bow)

        # Calculate cosine similarity between the two courses
        similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])

        # If the similarity is above the threshold, retrieve the course details
        if similarity > similarity_threshold:
            course_details = course_df[course_df['COURSE_ID'] == course_id][['TITLE', 'DESCRIPTION']].values[0]
            similar_courses.append((course_id, similarity, course_details[0], course_details[1]))

# Sort similar courses by similarity in descending order
similar_courses = sorted(similar_courses, key=lambda x: x[1], reverse=True)

# Display the similar courses
for course_id, similarity, title, description in similar_courses:
    print(f"Course ID: {course_id}")
    print(f"Title: {title}")
    print(f"Description: {description}")
    print(f"Similarity: {similarity:.2f}")
    print("-" * 50)


Course ID: ML0151EN
Title: machine learning with r
Description: this machine learning with r course dives into the basics of machine learning using an approachable  and well known  programming language  you ll learn about supervised vs unsupervised learning  look into how statistical modeling relates to machine learning  and do a comparison of each 
Similarity: 0.66
--------------------------------------------------
Course ID: excourse47
Title: machine learning for all
Description: machine learning  often called artificial intelligence or ai  is one of the most exciting areas of technology at the moment  we see daily news stories that herald new breakthroughs in facial recognition technology  self driving cars or computers that can have a conversation just like a real person  machine learning technology is set to revolutionise almost any area of human life and work  and so will affect all our lives  and so you are likely to want to find out more about it  machine learning has a reputat