# **Content-based Course Recommender System Using User Profile and Course Genres**

The most common type of content-based recommendation system is to recommend items to users based on their profiles. The user's profile revolves around that user's preferences and tastes. It is shaped based on user ratings, including the number of times a user has clicked on different items or liked those items.

The recommendation process is based on the similarity between those items. The similarity or closeness of items is measured based on the similarity in the content of those items. When we say content, we're talking about things like the item's category, tag, genre, and so on. Essentially the features about an item.

For online course recommender systems, we already know how to extract features from courses (such as genres or BoW features). Next, based on the course genres and users' ratings, we want to further build user profiles (if unknown).

A user profile can be seen as the user feature vector that mathematically represents a user's learning interests.

With the user profile feature vectors and course genre feature vectors constructed, we can use several computational methods, such as a simple dot product, to compute or predict an interest score for each course and recommend those courses with high interest scores.

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/user_profile_score.png)

## Objectives

* Generate a user profile based on course genres and rating
* Generate course recommendations based on a user's profile and course genres

In [2]:
%pip install scikit-learn
%pip install pandas



In [3]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

In [4]:
# also set a random state
rs = 123

### Lets generate user profiles using course genres and ratings
Suppose we have a very simple course genre dataset that contains only three genres: Python, Database, and MachineLearning.

We also have two courses: Machine Learning with Python and SQL Learning with Python and their genres as follows:

In [5]:
course_genres = ['Python', 'Database', 'MachineLearning']
courses = [['Machine Learning with Python', 1, 0, 1], ["SQL with Python", 1, 1, 0]]
courses_df = pd.DataFrame(courses, columns = ['Title'] + course_genres)
courses_df

Unnamed: 0,Title,Python,Database,MachineLearning
0,Machine Learning with Python,1,0,1
1,SQL with Python,1,1,0


As we can see from the dataset:

* Course Machine Learning with Python has Python and MachineLearning genres
* Course SQL with Python has Python and Database genres

Then let's create another simple user rating dataframe containing the ratings from two users.

In [6]:
users = [['user0', 'Machine Learning with Python', 3], ['user1', 'SQL with Python', 2]]
users_df = pd.DataFrame(users, columns = ['User', 'Title', 'Rating'])
users_df

Unnamed: 0,User,Title,Rating
0,user0,Machine Learning with Python,3
1,user1,SQL with Python,2


Suppose user0 rated Machine Learning with Python as 3 (completed with a certificate) and user1 rated SQL with Python as 2 (just audited or not completed).

Based on their course ratings and course genres. Can we generate a profile vector for each user?

Intuitively, since user0 has completed the course Machine Learning with Python, they should be interested in those genres associated with the course, i.e.,Machine Learning and Python.

On the other hand, user0 has not taken the SQL with Python so it is likely they are not interested in the database genre.


To quantify such user interests, we could multiply user0's rating vector with a course genre matrix and get the weighted genre's vector for the courses:

In [7]:
# User 0 rated course 0 as 3 and course 1 as 0/NA (unknown or not interested)
u0 = np.array([[3, 0]])

In [8]:
# The course genre's matrix
C = courses_df[['Python', 'Database', 'MachineLearning']].to_numpy()

In [9]:
# Before multiple them, let's first print their shapes:
print(f"User profile vector shape {u0.shape} and course genre matrix shape {C.shape}")

User profile vector shape (1, 2) and course genre matrix shape (2, 3)


If we multiple a $1 x 2$ vector with a $2 x 3$ matrix, we will get a 1 x 3 vector representing the user profile vector.

$$u_0C = \begin{bmatrix} 3 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\\\\\\\ 1 & 1 & 0 \end{bmatrix}$$

In [10]:
u0_weights = np.matmul(u0, C)
u0_weights

array([[3, 0, 3]])

In [11]:
course_genres

['Python', 'Database', 'MachineLearning']

Let's take a look at the result. This u0_weights is also called the weighted genre vector and represents the interests of the user for each genre based on the courses they have rated. As we can see from the results, user0 seems interested in Python and MachineLearning with a rating of 3.

Similarly, we can calculate the weighted genre matrix for user 1:

$$u_1C = \begin{bmatrix} 0 & 2 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\\\\\\\ 1 & 1 & 0 \end{bmatrix}$$

In [12]:
# User 1 rated course 0 as 0 (unknown or not interested) and course 1 as 2
u1 = np.array([[0, 2]])

In [13]:
u1_weights = np.matmul(u1, C)
u1_weights

array([[2, 2, 0]])

As we can see from the u1_weights vector, user1 seems very interested in Python and Database with a value 2.

Let's combine the two weighted genre vectors and create a user profile dataframe:

In [14]:
weights = np.concatenate((u0_weights.reshape(1, 3), u1_weights.reshape(1, 3)), axis=0)
profiles_df = pd.DataFrame(weights, columns=['Python', 'Database', 'MachineLearning'])
profiles_df.insert(0, 'user', ['user0', 'user1'])

In [15]:
profiles_df

Unnamed: 0,user,Python,Database,MachineLearning
0,user0,3,0,3
1,user1,2,2,0


Now this `profiles_df` clearly shows the user profiles or course interests.


# Generate recommendation scores for some new courses

With the user profiles generated, we can see that user0 is very interested in Python and machine learning, and user1 is very interested in Python and database.

Now, suppose we published some new courses titled as Python 101, Database 101, and Machine Learning with R:

In [16]:
new_courses = [['Python 101', 1, 0, 0], ["Database 101", 0, 1, 0], ["Machine Learning with R", 0, 0, 1]]
new_courses_df = pd.DataFrame(new_courses, columns = ['Title', 'Python', 'Database', 'MachineLearning'])
new_courses_df

Unnamed: 0,Title,Python,Database,MachineLearning
0,Python 101,1,0,0
1,Database 101,0,1,0
2,Machine Learning with R,0,0,1


Next, how can we calculate a recommendation score for each new course with respect to user0 and user1, using user profile vectors and genre vectors?

One simple but effective way is to apply the dot product to the user profile vector and course genre vector (as they always have the same shape). Since we have two users and three courses, we need to perform a matrix multiplication:

In [17]:
profiles_df

Unnamed: 0,user,Python,Database,MachineLearning
0,user0,3,0,3
1,user1,2,2,0


Let's convert the course genre dataframe into a 2-D numpy array:


In [18]:
# Drop the title column
new_courses_df = new_courses_df.loc[:, new_courses_df.columns != 'Title']
course_matrix = new_courses_df.values
course_matrix

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

In [19]:
# course matrix shape
course_matrix.shape

(3, 3)

As we can see from the above output, the course matrix is a 3 x 3 matrix and each row vector is a course genre vector.

Then we can convert the user profile dataframe into another 2-d numpy array:

In [20]:
# Drop the user column
profiles_df = profiles_df.loc[:, profiles_df.columns != 'user']
profile_matrix = profiles_df.values
profile_matrix

array([[3, 0, 3],
       [2, 2, 0]])

In [21]:
profile_matrix.shape

(2, 3)

The profile matrix is a 2 x 3 matrix and each row is a user profile vector:

If we multiply the course matrix and the user profile matrix, we can get the 2 x 3 course recommendation matrix with each element `(i, j)` representing a recommendation score of course `i` to user `j`. Intuitively, if a user `j` is interested in some topics(genres) and if a course `i` also has the same topics(genres), it means the user profile vector and course genre vector share many common dimensions and a dot product is likely to have a large value.

In [22]:
scores = np.matmul(course_matrix, profile_matrix.T)
scores

array([[3, 2],
       [0, 2],
       [3, 0]])

In [23]:
# Now let's add the course titles and user ids back to make the results more clear:

scores_df = pd.DataFrame(scores, columns=['User0', 'User1'])
scores_df.index = ['Python 101', 'Database 101', 'Machine Learning with R']

In [24]:
# recommendation score dataframe
scores_df

Unnamed: 0,User0,User1
Python 101,3,2
Database 101,0,2
Machine Learning with R,3,0


From the score results, we can see that:
- For user0, the recommended courses are `Python 101` and `Machine Learning with R` because user0 is very interested in Python and machine learning
- For user1, the recommended courses are `Python 101` and `Database 101` because user1 seems very interested in topics like Python and database

### TASK: Generate course recommendations based on user profile and course genre vectors

By now you have learned how to calculate recommendation scores using a user profile vector and a course genre vector.  Now, let's work on some real-world datasets to generate real personalized courses recommendations.


In [25]:
# First, we will load a user's profile dataframe and a course genre dataframe:

course_genre_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_genre.csv"
course_genres_df = pd.read_csv(course_genre_url)

In [26]:
course_genres_df.head()

Unnamed: 0,COURSE_ID,TITLE,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,ML0201EN,robots are coming build iot apps with watson ...,0,0,0,0,0,0,0,0,0,0,0,1,1,0
1,ML0122EN,accelerating deep learning with gpu,0,1,0,0,0,1,0,1,0,0,0,0,0,0
2,GPXX0ZG0EN,consuming restful services using the reactive ...,0,0,0,0,0,0,0,0,0,0,0,1,1,0
3,RP0105EN,analyzing big data in r using apache spark,1,0,0,1,0,0,0,0,1,0,1,0,0,0
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,0,0,0,0,1,0,0,0,0,0,0,1,0,0


In [27]:
profile_genre_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_profile.csv"
profile_df = pd.read_csv(profile_genre_url)

In [28]:
profile_df.head()

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,2,52.0,14.0,6.0,43.0,3.0,33.0,0.0,29.0,41.0,2.0,18.0,34.0,9.0,6.0
1,4,40.0,2.0,4.0,28.0,0.0,14.0,0.0,20.0,24.0,0.0,6.0,6.0,0.0,2.0
2,5,24.0,8.0,18.0,24.0,0.0,30.0,0.0,22.0,14.0,2.0,14.0,26.0,4.0,6.0
3,7,2.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
4,8,6.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,6.0,0.0,2.0,0.0,0.0,0.0


In [29]:
# The profile dataframe contains the course interests for each user, for example, user 8 is very interested in R, data analysis, database, and big data:


profile_df[profile_df['user'] == 8]

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
4,8,6.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,6.0,0.0,2.0,0.0,0.0,0.0


Next, let's load a test dataset, containing test users to whom we want to make course recommendations:


In [30]:
test_users_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-ML0321EN-Coursera/labs/v2/module_3/ratings.csv"
test_users_df = pd.read_csv(test_users_url)

In [31]:
test_users_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,5
1,1342067,CL0101EN,3
2,1990814,ML0120ENv3,5
3,380098,BD0211EN,5
4,779563,DS0101EN,3


Let's look at how many test users we have in the dataset.


In [32]:
# Group the test users DataFrame by the 'user' column and find the maximum value for each group,
# then reset the index and drop the old index to obtain a DataFrame with unique user IDs
test_users = test_users_df.groupby(['user']).max().reset_index(drop=False)

# Extract the 'user' column from the test_users DataFrame and convert it to a list of user IDs
test_user_ids = test_users['user'].to_list()

# Print the total number of test users by obtaining the length of the test_user_ids list
print(f"Total numbers of test users {len(test_user_ids)}")


Total numbers of test users 33901


Then for each test user in the test dataset, you need to first find out which courses are unknown/unselected to them. For example, suppose we have a user `1078030` with profile:


In [33]:
test_user_profile = profile_df[profile_df['user'] == 1078030]
test_user_profile

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
18204,1078030,0.0,12.0,0.0,9.0,0.0,12.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
# Now let's get the test user vector by excluding the `user` column
test_user_vector = test_user_profile.iloc[0, 1:].values
test_user_vector

array([ 0., 12.,  0.,  9.,  0., 12.,  0.,  6.,  0.,  0.,  0.,  0.,  0.,
        0.])

We can first find their enrolled courses in `test_users_df`:


In [36]:
enrolled_courses = test_users_df[test_users_df['user'] == 1078030]['item'].to_list()
enrolled_courses = set(enrolled_courses)

In [37]:
enrolled_courses

{'DA0101EN',
 'DV0101EN',
 'ML0101ENv3',
 'ML0115EN',
 'ML0120ENv2',
 'ML0122ENv1',
 'PY0101EN',
 'ST0101EN'}

In [38]:
# We then print the entire course list:

all_courses = set(course_genres_df['COURSE_ID'].values)
all_courses

{'AI0111EN',
 'BC0101EN',
 'BC0201EN',
 'BC0202EN',
 'BD0101EN',
 'BD0111EN',
 'BD0115EN',
 'BD0121EN',
 'BD0123EN',
 'BD0131EN',
 'BD0133EN',
 'BD0135EN',
 'BD0137EN',
 'BD0141EN',
 'BD0143EN',
 'BD0145EN',
 'BD0151EN',
 'BD0153EN',
 'BD0211EN',
 'BD0212EN',
 'BD0221EN',
 'BD0223EN',
 'BENTEST4',
 'CB0101EN',
 'CB0103EN',
 'CB0105ENv1',
 'CB0201EN',
 'CC0101EN',
 'CC0103EN',
 'CC0120EN',
 'CC0121EN',
 'CC0150EN',
 'CC0201EN',
 'CC0210EN',
 'CC0250EN',
 'CC0271EN',
 'CL0101EN',
 'CNSC02EN',
 'CO0101EN',
 'CO0193EN',
 'CO0201EN',
 'CO0301EN',
 'CO0302EN',
 'CO0401EN',
 'COM001EN',
 'CP0101EN',
 'DA0101EN',
 'DA0151EN',
 'DA0201EN',
 'DAI101EN',
 'DB0101EN',
 'DB0111EN',
 'DB0113EN',
 'DB0115EN',
 'DB0151EN',
 'DE0205EN',
 'DJ0101EN',
 'DP0101EN',
 'DS0101EN',
 'DS0103EN',
 'DS0105EN',
 'DS0107',
 'DS0110EN',
 'DS0132EN',
 'DS0201EN',
 'DS0301EN',
 'DS0321EN',
 'DV0101EN',
 'DV0151EN',
 'DW0101EN',
 'DX0106EN',
 'DX0107EN',
 'DX0108EN',
 'EE0101EN',
 'GPXX01AVEN',
 'GPXX01DCEN',
 'GPXX01

Then we can use all courses to subtract the enrolled courses to get a set of all unknown courses for user `1078030`, and we want to find potential interested courses hidden in the unknown course list.


In [39]:
unknown_courses = all_courses.difference(enrolled_courses)
unknown_courses

{'AI0111EN',
 'BC0101EN',
 'BC0201EN',
 'BC0202EN',
 'BD0101EN',
 'BD0111EN',
 'BD0115EN',
 'BD0121EN',
 'BD0123EN',
 'BD0131EN',
 'BD0133EN',
 'BD0135EN',
 'BD0137EN',
 'BD0141EN',
 'BD0143EN',
 'BD0145EN',
 'BD0151EN',
 'BD0153EN',
 'BD0211EN',
 'BD0212EN',
 'BD0221EN',
 'BD0223EN',
 'BENTEST4',
 'CB0101EN',
 'CB0103EN',
 'CB0105ENv1',
 'CB0201EN',
 'CC0101EN',
 'CC0103EN',
 'CC0120EN',
 'CC0121EN',
 'CC0150EN',
 'CC0201EN',
 'CC0210EN',
 'CC0250EN',
 'CC0271EN',
 'CL0101EN',
 'CNSC02EN',
 'CO0101EN',
 'CO0193EN',
 'CO0201EN',
 'CO0301EN',
 'CO0302EN',
 'CO0401EN',
 'COM001EN',
 'CP0101EN',
 'DA0151EN',
 'DA0201EN',
 'DAI101EN',
 'DB0101EN',
 'DB0111EN',
 'DB0113EN',
 'DB0115EN',
 'DB0151EN',
 'DE0205EN',
 'DJ0101EN',
 'DP0101EN',
 'DS0101EN',
 'DS0103EN',
 'DS0105EN',
 'DS0107',
 'DS0110EN',
 'DS0132EN',
 'DS0201EN',
 'DS0301EN',
 'DS0321EN',
 'DV0151EN',
 'DW0101EN',
 'DX0106EN',
 'DX0107EN',
 'DX0108EN',
 'EE0101EN',
 'GPXX01AVEN',
 'GPXX01DCEN',
 'GPXX01RYEN',
 'GPXX03HFEN',
 'GP

In [40]:
# We can get the genre vectors for those unknown courses as well:

unknown_course_genres = course_genres_df[course_genres_df['COURSE_ID'].isin(unknown_courses)]
# Now let's get the course matrix by excluding `COURSE_ID` and `TITLE` columns:
course_matrix = unknown_course_genres.iloc[:, 2:].values
course_matrix

array([[0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0]])

Given the user profile vector for user `1078030`  and all the unseen course genres vectors above, you can use the dot product to calculate the recommendation score for each unknown course. e.g., the recommendation score for course `accelerating deep learning with gpu` is:


In [41]:
score = np.dot(course_matrix[1], test_user_vector)
score

30.0

Later, we will need to choose a recommendation score threshold. If the score of any course is above the threshold, we may recommend that course to the user.


The workflow can be summarized in the following flowchart:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/recommend_courses.png)


Next, let's calculate the recommendation scores of all courses for all the 1000 test users.


In [42]:
# Reload the test users dataset from the specified URL using pandas and store it in test_users_df
test_users_df = pd.read_csv(test_users_url)

# Reload the user profiles dataset from the specified URL containing user profiles and their associated genres using pandas and store it in profile_df
profile_df = pd.read_csv(profile_genre_url)

# Reload the course genres dataset from the specified URL containing course genres using pandas and store it in course_genres_df
course_genres_df = pd.read_csv(course_genre_url)

# Create an empty dictionary to store the results of the recommendation process
res_dict = {}


We only want to recommend courses with very high scores so we may set a score threshold to filter out those courses with low scores.


In [43]:
# Only keep the score larger than the recommendation threshold
# The threshold can be fine-tuned to adjust the size of generated recommendations
score_threshold = 10.0

We defined a function called generate_recommendation_scores() to compute the recommendation scores of all the unknown courses for all test users.

TODO: Complete the generate_recommendation_scores() function blow to generate recommendation score for all users. You may also implement the task with different solutions.

In [44]:
def generate_recommendation_scores():
    """
    Generate recommendation scores for users and courses.

    Returns:
    users (list): List of user IDs.
    courses (list): List of recommended course IDs.
    scores (list): List of recommendation scores.
    """

    users = []      # List to store user IDs
    courses = []    # List to store recommended course IDs
    scores = []     # List to store recommendation scores

    # Iterate over each user ID in the test_user_ids list
    for user_id in test_user_ids:
        # Get the user profile data for the current user
        test_user_profile = profile_df[profile_df['user'] == user_id]

        # Get the user vector for the current user id (replace with your method to obtain the user vector)
        test_user_vector = test_user_profile.iloc[0, 1:].values

        # Get the known course ids for the current user
        enrolled_courses = test_users_df[test_users_df['user'] == user_id]['item'].to_list()

        # Calculate the unknown course ids
        unknown_courses = all_courses.difference(enrolled_courses)

        # Filter the course_genres_df to include only unknown courses
        unknown_course_df = course_genres_df[course_genres_df['COURSE_ID'].isin(unknown_courses)]
        unknown_course_ids = unknown_course_df['COURSE_ID'].values

        # Calculate the recommendation scores using dot product
        recommendation_scores = np.dot(unknown_course_df.iloc[:, 2:].values, test_user_vector)

        # Append the results into the users, courses, and scores list
        for i in range(0, len(unknown_course_ids)):
            score = recommendation_scores[i]

            # Only keep the courses with high recommendation score
            if score >= score_threshold:
                users.append(user_id)
                courses.append(unknown_course_ids[i])
                scores.append(recommendation_scores[i])

    return users, courses, scores



NOTE: Instead of using some absolute score threshold, you may also try sorting the scores for each user and return the top-ranked courses.

After you have completed the function generate_recommendation_scores() above, you can test it and generate recommendation scores and save the courses recommendations into a dataframe with three columns: USER, COURSE_ID, SCORE:

In [45]:
# Call the generate_recommendation_scores function to obtain recommendation scores for users and courses,
# and assign the returned lists to variables users, courses, and scores
users, courses, scores = generate_recommendation_scores()

# Create an empty dictionary named res_dict to store the results of the recommendation process
res_dict = {}

# Store the lists of users, courses, and scores into the res_dict dictionary with corresponding keys
res_dict['USER'] = users
res_dict['COURSE_ID'] = courses
res_dict['SCORE'] = scores

# Create a DataFrame named res_df using the res_dict dictionary, specifying the column order as ['USER', 'COURSE_ID', 'SCORE']
res_df = pd.DataFrame(res_dict, columns=['USER', 'COURSE_ID', 'SCORE'])

# Save the res_df DataFrame to a CSV file named "profile_rs_results.csv" without including the index
res_df.to_csv("profile_rs_results.csv", index=False)

# Output the res_df DataFrame
res_df


Unnamed: 0,USER,COURSE_ID,SCORE
0,2,ML0201EN,43.0
1,2,GPXX0ZG0EN,43.0
2,2,GPXX0Z2PEN,37.0
3,2,DX0106EN,47.0
4,2,GPXX06RFEN,52.0
...,...,...,...
1500419,2102680,excourse62,15.0
1500420,2102680,excourse69,14.0
1500421,2102680,excourse77,14.0
1500422,2102680,excourse78,14.0
