# Course Recommender

Design and implement a "course recommender" that can suggest relevant courses to users based on their input from a dataset of over 4000 courses extracted from the Coursera catalog.

## Import Packages

In [23]:
import os
import json
import re
import ast

import pandas as pd
import numpy as np
import openai

from IPython.display import display

In [7]:
# Set up OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

## Load Data

The dataset is sourced from [Kaggle](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)

In [8]:
path = 'data/Coursera.csv'
data = pd.read_csv(path)

In [9]:
data.head()

Unnamed: 0,course_name,university,difficulty_level,course_rating,course_url,description,skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


In [10]:
# Check for missing values
print(data.isnull().sum())

course_name         0
university          0
difficulty_level    0
course_rating       0
course_url          0
description         0
skills              0
dtype: int64


## Generate course embeddings
The following function `create_embeddings` creates embeddings for a given text using OpenAI's API. Embeddings are numerical representations of text data, and are widely used for various natural language processing tasks. Read more on the [OpenAI's embedding models](https://platform.openai.com/docs/guides/embeddings).

In [11]:
def create_embeddings(text, model_engine="text-embedding-ada-002"):
    """
    Create embeddings using OpenAI for a given text.
    
    Args:
        - text (string): The text for which you want to generate the embedding.
        - model_engine (string, default: "text-embedding-ada-002"): The specific model to use for generating embeddings.
        
    Returns:
        - A list of numerical values representing the embedding of the input text.
    """
    try:
        response = openai.Embedding.create(
            input=text,
            model=model_engine,
        )["data"][0]["embedding"]
        return response
    except Exception as e:
        print(e)
        return None

In [12]:
# Get embeddings for all course names. This line might take up to 10 mins to complete.
data['embedding'] = data['course_name'].apply(create_embeddings)

In [13]:
data.head()

Unnamed: 0,course_name,university,difficulty_level,course_rating,course_url,description,skills,embedding
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...,"[0.004957424942404032, -0.013018687255680561, ..."
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...,"[-0.011336499825119972, -0.022729190066456795,..."
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...,"[0.002505358075723052, -0.006338656414300203, ..."
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...,"[-0.0029222306329756975, -0.03425585851073265,..."
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...,"[-0.016929104924201965, 0.013173501938581467, ..."


In [14]:
# Save the dataframe with embeddings
data.to_csv('data/Coursera_embeddings.csv', index=False)

## Testing

In [15]:
# Load the updated dataframe for testing
data_test = pd.read_csv('data/Coursera_embeddings.csv')

In [16]:
# Convert string representation of embeddings back to list format
data_test['embedding'] = data_test['embedding'].apply(ast.literal_eval)

In [17]:
# check that the embeddings are indeed in list format
type(data['embedding'][0])

list

The following function computes the similarity score between two embeddings. The similarity score indicates how closely related the two embeddings (and thereby the original texts) are.

The function uses the `inner` function from numpy to compute the dot product between two embeddings. When embeddings are generated in a way that they maintain the semantic meaning of the original text, a higher dot product indicates that the texts are more similar in meaning. This dot product can serve as a similarity score, with higher values indicating more similar texts.

In [18]:
def calculate_similarity(embedding1, embedding2):
    """
    Calculate the similarity score between two embeddings.
    
    Args:
        - embedding1 (list): The first embedding.
        - embedding2 (list): The second embedding.
    
    Returns:
        - A numerical value (float) representing the similarity between the two embeddings.
    """
    return np.inner(embedding1, embedding2)

In [33]:
def recommend_courses(user_prompt, data, default_display=True, similarity_score_threshold=0.5, top_n=5):
    """
    Recommend top N courses based on the similarity of the user's input to the course embeddings.

    Args:
        - user_prompt (str): The user's input text for which course recommendations are sought.
        - data (DataFrame): The dataset containing course information and embeddings.
        - similarity_score_threshold (float, default: 0.7): The threshold above which courses are considered relevant.
        - top_n (int, default: 5): The number of top courses to recommend.
        - default_display (bool, default: True): If True, display full course info, else only specific columns.

    Returns:
        - DataFrame: A DataFrame containing recommended courses based on the user's input.
    """
    
    user_embedding = create_embeddings(user_prompt)
    data['similarity_score'] = data['embedding'].apply(lambda x: calculate_similarity(user_embedding, x))
    
    filtered_data = data[data['similarity_score'] >= similarity_score_threshold]
    sorted_data = filtered_data.sort_values(by='similarity_score', ascending=False)
    
    # Drop the similarity score column after sorting
    sorted_data.drop('similarity_score', axis=1, inplace=True)
    
    if not default_display:
        # Only retain the specified columns if default_display is set to False
        columns_to_display = ['course_name', 'university', 'difficulty_level', 'course_rating']
        sorted_data = sorted_data[columns_to_display]
    
    return sorted_data.head(top_n)

## Recommendations

Based on the `recommend_courses` function, which utilizes the embeddings generated for each course, we can provide recommendations based on user input.

In [35]:
user_input = input("Enter a course topic or name: ")
recommendations = recommend_courses(user_input, data_test, False)
display(recommendations)

Enter a course topic or name: Machine Learning


Unnamed: 0,course_name,university,difficulty_level,course_rating
2707,Machine Learning for Data Analysis,Wesleyan University,Intermediate,4.2
2405,Machine Learning With Big Data,University of California San Diego,Beginner,4.6
903,Machine Learning: Classification,University of Washington,Beginner,4.7
3230,Machine Learning for All,University of London,Conversant,4.7
384,Introduction to Machine Learning,Duke University,Beginner,4.6
