# Course Recommeder System

*In case you do not have sentence_transformers installed in your environment, please execute the cell below to install it.*

In [1]:
#pip install -U sentence-transformers

Necessary imports for the notebook.

In [2]:
import pandas as pd
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from sklearn.metrics.pairwise import cosine_similarity
import json
import random

## 1. Data Loading

Load of the necessary data for this notebook:

- **df_course**: information of the Udemy courses extracted using the API from Udemy (see Notebook "Udemy_api").
    - *Title*: course name.
    - *Category*: course category.
    - *SubCategory*: course subcategory.
    - *Description*: course description. This is what will be used to pair courses with users.
    - *Rating*: course rating from Udemy users.
    - *Price*: course price.
    - *IsPaid*: boolean. FALSE is not free, TRUE is free.
    - *Subscribers*: number of people subscribe to the course. 
    - *Level*: difficulty of the course.
    - *URL*: link to the course Udemy page.

- **df_users**: this xlsx contains the answer and ground truth from various users.  
    - *id*: user id.
    - *user_answer*: user answers to the questionnaire. 
    - *ground_truth_10*: top 10 user real course ranking from the 20 most suitable courses.
    - *ground_truth_5*: top 5 user real course ranking from the 20 most suitable courses.

In [3]:
# Read the courses catalog
df_courses = pd.read_csv("courses_udemy.csv")
# Read the user information
df_users = pd.read_excel("user_info.xlsx")

Display of the *df_courses* dataset.

In [4]:
# Display of the df_courses
df_courses.head()

Unnamed: 0.1,Unnamed: 0,Title,Category,SubCategory,Description,Rating,Price,IsPaid,Subscribers,Level,URL
0,0,Web Design for Web Developers: Build Beautiful...,Development,Web Development,IMPORTANT NOTE: The material of this course is...,4.408938,Free,False,660246,All Levels,/course/web-design-secrets/
1,1,Introduction To Python Programming,Development,Programming Languages,\tDo you want to become a programmer? \t...,4.414743,Free,False,856821,Beginner,/course/pythonforbeginnersintro/
2,2,Java Tutorial for Complete Beginners,Development,Programming Languages,\t Learn to program in the Java progr...,4.477733,Free,False,1816465,All Levels,/course/java-tutorial/
3,3,Deep Learning Prerequisites: The Numpy Stack i...,Development,Data Science,"Welcome! This is Deep Learning, Machine Learni...",4.591549,Free,False,52116,All Levels,/course/numpy-python/
4,4,Javascript Essentials,Development,Web Development,Learn the Javascript essentials for web develo...,4.495833,Free,False,375148,Beginner,/course/javascript-essentials/


Display of the *df_users* dataset.

In [5]:
# Display of the df_users
df_users.head()

Unnamed: 0,id,user_answer,ground_truth_10,ground_truth_5
0,1,My favourite subject in school is Art. I belie...,"[15, 1967, 765, 564, 272, 71, 966, 1589, 1303,...","[15, 1967, 765, 564, 272]"
1,2,My favourite subject in school is Fine Arts.\n...,"[15, 1967, 272, 71, 1589, 1902, 966, 2513, 546...","[15, 1967, 272, 71, 1589]"
2,3,My favourite subject is Music.\nI use technolo...,"[1875, 2552, 2683, 2342, 958, 2032, 1912, 2155...","[1875, 2552, 2683, 2342, 958]"
3,4,My favorite subject in school right now is def...,"[1233, 2302, 130, 600, 1165, 2418, 2609, 1710,...","[1233, 2302, 130, 600, 1165]"
4,5,Economics is my favorite subject in school. Un...,"[1195, 1379, 166, 2114, 931, 1916, 1598, 1762,...","[1195, 1379, 166, 2114, 931]"


# 2. Course Description Embeddings

To effectively process and analyze the descriptions of Udemy courses, we will employ the LLMs provided by HuggingFace. Specifically, we will utilize the `SentenceTransformer()` class from the *sentence_transformers* library to leverage the capabilities of the **all-MiniLM-L6-v2** model for embedding our course descriptions ([more information about the model](https://arxiv.org/abs/2002.10957)).

In an attempt to further enhance the recommendation system, we explored the application of a summarization algorithm. This was motivated by the fact that the **all-MiniLM-L6-v2** sentence transformer is trained on shorter sentences. Considering that some course descriptions can be lengthy, we hypothesized that summarizing the descriptions could improve the model's performance by providing a more concise context for training.

However, after thorough experimentation, we discovered that the summarization process significantly decreased the quality of our recommendations. Consequently, we decided to discard this approach in favor of maintaining the original course descriptions. Nonetheless, we have retained the code in this notebook for documentation purposes.

------

### *Discarded code*

Summarization of the courses description only for those that have more than 400 tokens separated with a space. At the end, we are going to save the courses DataFrame with the corresponding summarization in a csv in order to not have to do the summarization every time we execute the notebook.

In [6]:
# Create the summarizer
#summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
# Creation of summ_Description
#df_courses["sum_Description"] = ""
# Iterate over the rows of the DataFrame
#for index, row in df_courses.iterrows():
    # Select the description of the course
    #description = str(row["Description"])
    # Check if the description is too long in order to summarize it
    #if len(description.split()) > 400:
        #try: 
            # Set the value to sum_Description
            #df_courses.at[index, "sum_Description"] = summarizer(description, max_length=300, min_length=30, do_sample=False)[0]["summary_text"]
        #except Exception as e:
            # In case the description is still too big, we are gonna keep going
            #print("The course with index ", index, " exceeds the maximun amount of token permited by the summarizer. Original description is going to be keep.")
            #df_courses.at[index, "sum_Description"] = row["Description"]
    #else:
        # Pass the original description if it is not big enough
        #df_courses.at[index, "sum_Description"] = row["Description"]
#Save the new csv into a csv
#df_courses.to_csv("courses_udemy_sum.csv")
# Summarize the course descriptions
#df_courses.head()

-----

Creation of the courses embedding. 

In [7]:
# Read the courses catalog description
df_courses_sum = pd.read_csv("courses_udemy.csv")
# Course description list
courses_desc_sum = df_courses_sum["Description"]
# Vectorize the courses description
# Model instance
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Encode the course description
course_embeddings = model.encode(np.array(courses_desc_sum))

Save the course embeddings in a picke file.

In [8]:
# Open a file in write binary mode
with open('course_embeddings_all_mini.pkl', 'wb') as file:
    # Use pickle to dump the list into the file
    pickle.dump(course_embeddings, file)

We save the course embeddings in a pickle file because this step only needs to be done once. This optimizes the run time of our front-end app.

# 3. Users Course Predictions

Read the pickle file to extract the course embeddings. *This is how it will be done in the front-end app*.

In [9]:
# Course embeddings variable
course_embeddings = []
# Open the file in read binary mode
with open('course_embeddings_all_mini.pkl', 'rb') as file:
    # Use pickle to load the list of course embeddings from the file
    course_embeddings = pickle.load(file)

In order to calculate the predicted courses per user, we are going to wrap the code in a function called `course_pred()`. This function follows these steps:

1. Vectorizes the user answers by using the same sentence transformer model we have used for the course description.
2. Calculates the *cosine_similarity* between every course embedded and the user answers embedded.
3. Generates a dictionary with all the courses and the corresponding *cosine_similarity* to the user.
4. Selects the 20 most relevant courses for that user (those with the highest cosine_similarity).
5. Select the top N courses you want to return.

In [10]:
def course_pred(user_ans, courses, top):
    # Model instance
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    # Encode the user answer
    user_ans_emb = model.encode(user_ans)
    # Dic with cosine_similarity per course
    array_cos_sim = {}
    
    # Loop in courses
    for i in range(len(courses)):
        # Pair of vectors to compare
        comp_pair = [user_ans_emb, courses[i]]
        # Calculation of the cosine_similarity
        array_cos_sim[i] = (cosine_similarity(comp_pair[0].reshape(1, -1), comp_pair[1].reshape(1, -1)))
        
    # Sort the dictionary
    sorted_dict = dict(sorted(array_cos_sim.items(), key=lambda item: item[1], reverse=True))
    # Select the the top 20 courses
    top_20_items = dict(list(sorted_dict.items())[:20])
    # Extract the keys from the top 20 dictionary
    top_20_keys = list(top_20_items.keys())
    # Select only the number of predictions you want to return
    sel_keys = top_20_keys[0:top]
    # Return the selected ids courses
    return sel_keys

Furthermore, we will introduce a variation of the `course_pred()` function to serve as our benchmark model. This new function, called `course_pred_rand()`, will randomize the the 20 most relevant courses before selecting the top N recommendations. By randomizing the recommendations, we aim to establish a baseline performance for our recommendation system.

In [11]:
def course_pred_rand(user_ans, courses, top):
    # Model instance
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    # Encode the user answer
    user_ans_emb = model.encode(user_ans)
    # Dic with cosine_similarity per course
    array_cos_sim = {}
    
    # Loop in courses
    for i in range(len(courses)):
        # Pair of vectors to compare
        comp_pair = [user_ans_emb, courses[i]]
        # Calculation of the cosine_similarity
        array_cos_sim[i] = (cosine_similarity(comp_pair[0].reshape(1, -1), comp_pair[1].reshape(1, -1)))
        
    # Sort the dictionary
    sorted_dict = dict(sorted(array_cos_sim.items(), key=lambda item: item[1], reverse=True))
    # Select the the top 20 most relevant courses
    top_20_items = dict(list(sorted_dict.items())[:20])
    # Extract the keys from the top 20 dictionary
    top_20_keys = list(top_20_items.keys())
    # Shuffle the top_20_keys
    random.shuffle(top_20_keys)
    # Select only the number of predictions you want to return
    sel_keys = top_20_keys[0:top]
    # Return the selected ids courses
    return sel_keys

Use these `course_pred()` and `course_pred_rand()` functions to calculate all the recommended courses for the users in the *df_users* and save it in a new column.

Top 10 ranked courses.

In [12]:
# Creation of the top 10 courses predictions column
df_users["predictions_10"] = ""
df_users["predictions_rand_10"] = ""
# Iterate over the rows of the DataFrame
for index, row in df_users.iterrows():
    # Calculate the predictions courses: cosine_similarity and random
    predictions = course_pred(row["user_answer"], course_embeddings, 10)
    predictions_rand = course_pred_rand(row["user_answer"], course_embeddings, 10)
    # Set the value to predictions
    df_users.at[index, "predictions_10"] = predictions
    df_users.at[index, "predictions_rand_10"] = predictions_rand

Top 5 ranked courses.

In [13]:
# Creation of the top 5 courses predictions column
df_users["predictions_5"] = ""
df_users["predictions_rand_5"] = ""
# Iterate over the rows of the DataFrame
for index, row in df_users.iterrows():
    # Calculate the predictions courses
    predictions = course_pred(row["user_answer"], course_embeddings, 5)
    predictions_rand = course_pred_rand(row["user_answer"], course_embeddings, 5)    
    # Set the value to predictions
    df_users.at[index, "predictions_5"] = predictions
    df_users.at[index, "predictions_rand_5"] = predictions_rand

Display the *df_users* DataFrame with the prediction columns.

In [14]:
# Display the df_users with the predictions
df_users.head()

Unnamed: 0,id,user_answer,ground_truth_10,ground_truth_5,predictions_10,predictions_rand_10,predictions_5,predictions_rand_5
0,1,My favourite subject in school is Art. I belie...,"[15, 1967, 765, 564, 272, 71, 966, 1589, 1303,...","[15, 1967, 765, 564, 272]","[15, 71, 1967, 966, 1237, 1650, 130, 272, 765,...","[1145, 1337, 1237, 1902, 765, 1594, 1632, 71, ...","[15, 71, 1967, 966, 1237]","[1612, 1902, 1632, 1967, 1589]"
1,2,My favourite subject in school is Fine Arts.\n...,"[15, 1967, 272, 71, 1589, 1902, 966, 2513, 546...","[15, 1967, 272, 71, 1589]","[1841, 15, 1902, 1967, 71, 1650, 1589, 1237, 1...","[1524, 140, 1585, 1902, 1841, 1237, 1650, 531,...","[1841, 15, 1902, 1967, 71]","[531, 2513, 546, 2190, 1145]"
2,3,My favourite subject is Music.\nI use technolo...,"[1875, 2552, 2683, 2342, 958, 2032, 1912, 2155...","[1875, 2552, 2683, 2342, 958]","[1199, 1875, 985, 634, 2552, 2067, 1301, 1912,...","[1306, 2342, 2610, 2155, 2552, 2032, 2683, 191...","[1199, 1875, 985, 634, 2552]","[985, 1306, 2552, 2610, 2234]"
3,4,My favorite subject in school right now is def...,"[1233, 2302, 130, 600, 1165, 2418, 2609, 1710,...","[1233, 2302, 130, 600, 1165]","[2302, 130, 2609, 596, 600, 1233, 2360, 1165, ...","[2365, 1165, 41, 2051, 1557, 2302, 596, 1420, ...","[2302, 130, 2609, 596, 600]","[1233, 600, 130, 1762, 2365]"
4,5,Economics is my favorite subject in school. Un...,"[1195, 1379, 166, 2114, 931, 1916, 1598, 1762,...","[1195, 1379, 166, 2114, 931]","[41, 166, 1195, 931, 2682, 1598, 1916, 1762, 1...","[166, 275, 1669, 1629, 1379, 1916, 1762, 648, ...","[41, 166, 1195, 931, 2682]","[41, 1255, 1629, 275, 2114]"


# 4. Model Evaluation

To evaluate the performance of our recommendation system, we will employ the NDCG (normalized discounted cumulative gain) metric. NDCG is a widely used measure to assess the effectiveness of ranking systems, considering the positions of relevant items within the ranked list. Its calculation takes into account the relevance of each item and the position at which it appears in the recommendation list (see more [here](https://medium.com/@readsumant/understanding-ndcg-as-a-metric-for-your-recomendation-system-5cd012fb3397#:~:text=Normalized%20Discounted%20Cumulative%20Gain%20or,relevant%20products%20are%20ranked%20higher.)).

In our evaluation, we will specifically focus on the NDCG scores generated by our model for the top 5 and top 10 recommended courses. By examining these scores, we can gauge the accuracy and effectiveness of our model in suggesting relevant and valuable courses to users.

To make it easier, we have created the function `calculate_dcg()` that calculates the DCG given a list of scores.

In [15]:
# Function to calculate DCG given a list of relevance scores
def calculate_dcg(scores):
    # Create the DCG variable
    dcg = 0
    # Calculate the DCG
    for i, score in enumerate(scores, start=1):
        dcg += score / np.log2(i + 1)
    # Return the DCG
    return dcg

By using this function, we are going to score our model on the top 10 and top 5 predictions. Moreover, we are going to also score our random recommender since it is our benchmark model that we are going to compare against.

NDCGs for the top 10 courses.

In [16]:
# New column for the NDCG of each user
df_users["pred_ndcg_10"] = ""
df_users["random_ndcg_10"] = ""

for index, row in df_users.iterrows():
    # Select the data for this user
    ground_truth_str = row['ground_truth_10']
    ground_truth = json.loads(ground_truth_str) # Necessary to conver the ground_truth_10 to an array
    predictions = row['predictions_10']
    predictions_rand = row['predictions_rand_10']
    # Create the relevance scores in order to calculate the DCG
    #Initialize the dictionary and score
    relevance_scores = {}
    score = len(predictions)
    # Iterate over the ground_truth values to assign the relevance score
    for value in ground_truth:
        if value not in relevance_scores:
            relevance_scores[value] = score
            score -= 1
    # Score the ground truth and the predicted courses
    ground_truth_scores = [relevance_scores.get(item, 0) for item in ground_truth]
    prediction_scores = [relevance_scores.get(item, 0) for item in predictions]
    prediction_rand_scores = [relevance_scores.get(item, 0) for item in predictions_rand]
    # Calculate the DCG and IDCG
    dcg = calculate_dcg(prediction_scores)
    dcg_rand = calculate_dcg(prediction_rand_scores)
    idcg = calculate_dcg(ground_truth_scores)
    # Calculate the NDCG
    ndcg = dcg / idcg if idcg != 0 else 0
    # Calculate the NDCG for the random
    ndcg_rand= dcg_rand / idcg if idcg != 0 else 0
    # Pass the value to the new column for the NDCG
    df_users.at[index, "pred_ndcg_10"] = ndcg
    df_users.at[index, "random_ndcg_10"] = ndcg_rand

NDCG for the top 5 courses.

In [17]:
# New column for the NDCG of each user
df_users["pred_ndcg_5"] = ""
df_users["random_ndcg_5"] = ""

for index, row in df_users.iterrows():
    # Select the data for this user
    ground_truth_str = row['ground_truth_5']
    ground_truth = json.loads(ground_truth_str) # Necessary to conver the ground_truth_10 to an array
    predictions = row['predictions_5']
    predictions_rand = row['predictions_rand_5']
    # Create the relevance scores in order to calculate the DCG
    #Initialize the dictionary and score
    relevance_scores = {}
    score = len(predictions)
    # Iterate over the ground_truth values to assign the relevance score
    for value in ground_truth:
        if value not in relevance_scores:
            relevance_scores[value] = score
            score -= 1
    # Score the ground truth and the predicted courses
    ground_truth_scores = [relevance_scores.get(item, 0) for item in ground_truth]
    prediction_scores = [relevance_scores.get(item, 0) for item in predictions]
    prediction_rand_scores = [relevance_scores.get(item, 0) for item in predictions_rand]    
    # Calculate the DCG and IDCG
    dcg = calculate_dcg(prediction_scores)
    dcg_rand = calculate_dcg(prediction_rand_scores)
    idcg = calculate_dcg(ground_truth_scores)    
    # Calculate the NDCG
    ndcg = dcg / idcg if idcg != 0 else 0
    # Calculate the NDCG for the random
    ndcg_rand= dcg_rand / idcg if idcg != 0 else 0
    # Pass the value to the new column for the NDCG
    df_users.at[index, "pred_ndcg_5"] = ndcg
    df_users.at[index, "random_ndcg_5"] = ndcg_rand

Display of the scores for each user.

In [18]:
# Display the scores
df_scores = df_users.loc[:, ["id", "pred_ndcg_10", "random_ndcg_10", "pred_ndcg_5", "random_ndcg_5"]]
df_scores

Unnamed: 0,id,pred_ndcg_10,random_ndcg_10,pred_ndcg_5,random_ndcg_5
0,1,0.790172,0.170286,0.681469,0.16771
1,2,0.580434,0.209306,0.550146,0.0
2,3,0.489273,0.497221,0.457758,0.194705
3,4,0.838059,0.28807,0.649002,0.755638
4,5,0.793862,0.475961,0.469578,0.075322
5,6,0.443897,0.347213,0.307114,0.317551
6,7,0.798135,0.480225,0.742892,0.083855
7,8,0.396569,0.415256,0.394027,0.304805
8,9,0.572692,0.529272,0.486764,0.223135
9,10,0.4547,0.323269,0.171522,0.245691


Average model scores.

In [19]:
df_scores.loc[:,["pred_ndcg_10", "random_ndcg_10", "pred_ndcg_5", "random_ndcg_5"]].mean()

pred_ndcg_10      0.629343
random_ndcg_10    0.391898
pred_ndcg_5       0.510534
random_ndcg_5     0.246666
dtype: float64

By looking at the results, there are three major insights:

- In the case of courses recommendations, the top 5 score is more relevant because students in general will focus on a curriculum of 5 classes rather than 10. Taking this into consideration, maximazing the NDCG for the top 5 should be a priority.

- The model is performing better than the random recommender overall. However, when it comes to predicting the top 5, our model signficantly outperforms the benchmark (random recommender sometimes has a NDCG score of "0", which means that all the courses provided are not relevant).

- Both scores for the model outperform the benchmark by a large margin. This means that the model is powerful at predicting the general preference of the individual (NDCG 10) and recommending a specific curriculum (NDCG 5).