# **Content-based Course Recommender System Using User Profile and Course Genres**


## Objectives

* Part 1. Feature Engineer a user profile based on course genres and rating
* Part 2. Generate course recommendations based on a user's profile and course genres

----

The most common type of content-based recommendation system is to recommend items to users based on their profiles. The user's profile revolves around that user's preferences and tastes. It is shaped based on user ratings, including the number of times a user has clicked on different items or liked those items.

The recommendation process is based on the similarity between those items. The similarity or closeness of items is measured based on the similarity in the content of those items. When we say content, we're talking about things like the item's category, tag, genre, and so on. Essentially the features about an item.


For online course recommender systems, we already know how to extract features from courses (such as genres or BoW features). Next, based on the course genres and users' ratings, we want to further build user profiles (if unknown). 

A user profile can be seen as the user feature vector that mathematically represents a user's learning interests.


With the user profile feature vectors and course genre feature vectors constructed, we can use several computational methods, such as a simple dot product, to compute or predict an interest score for each course and recommend those courses with high interest scores.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/user_profile_score.png)


----


## Part 1. Feature Engineering - The **user-item interaction matrix**


In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

In [2]:
# also set a random state
rs = 42

### Lets generate user profiles using course genres and ratings


First, we will load a user's profile dataframe and a course genre dataframe:


In [3]:
course_genres_df = pd.read_csv('course_genre.csv')
print(course_genres_df.shape)
course_genres_df.head()

(307, 16)


Unnamed: 0,COURSE_ID,TITLE,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,ML0201EN,robots are coming build iot apps with watson ...,0,0,0,0,0,0,0,0,0,0,0,1,1,0
1,ML0122EN,accelerating deep learning with gpu,0,1,0,0,0,1,0,1,0,0,0,0,0,0
2,GPXX0ZG0EN,consuming restful services using the reactive ...,0,0,0,0,0,0,0,0,0,0,0,1,1,0
3,RP0105EN,analyzing big data in r using apache spark,1,0,0,1,0,0,0,0,1,0,1,0,0,0
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,0,0,0,0,1,0,0,0,0,0,0,1,0,0


In [4]:
ratings_df = pd.read_csv('ratings.csv')
print(ratings_df.shape)
ratings_df.head()

(233306, 3)


Unnamed: 0,user,item,rating
0,1889878,CC0101EN,5
1,1342067,CL0101EN,3
2,1990814,ML0120ENv3,5
3,380098,BD0211EN,5
4,779563,DS0101EN,3


We can now transform `ratings_df` dataset using the `pivot` method preparing it to extract the user-item interaction matrix. The dataset contains three columns, `user id` (learner), `item id`(course), and `rating`(enrollment mode).

Note that this matrix is presented as the dense or vertical form, and you may convert it to a sparse matrix using `pivot` :

In [5]:
rating_sparse_df = ratings_df.pivot(index='user', columns='item',values='rating').fillna(0)#.reset_index().rename_axis(index=None, columns=None)
print(rating_sparse_df.shape)
rating_sparse_df.head()

(33901, 126)


item,AI0111EN,BC0101EN,BC0201EN,BC0202EN,BD0101EN,BD0111EN,BD0115EN,BD0121EN,BD0123EN,BD0131EN,...,SW0201EN,TA0105,TA0105EN,TA0106EN,TMP0101EN,TMP0105EN,TMP0106,TMP107,WA0101EN,WA0103EN
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,4.0,0.0,0.0,5.0,4.0,0.0,5.0,3.0,3.0,...,0.0,5.0,0.0,4.0,0.0,3.0,3.0,0.0,5.0,0.0
4,0.0,0.0,0.0,0.0,5.0,3.0,4.0,5.0,3.0,4.0,...,0.0,4.0,0.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0
5,3.0,5.0,5.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,4.0,4.0,4.0,4.0,4.0,5.0,0.0,3.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We now filter `course_genres_df` to feature only courses rated in `ratings_df`. Next we sort the rows of `course_genres_df` so it matches with the columns of `user_pivoted_df` for matrix multiplication.

In [6]:
# filter course_genres_df to feature only courses rated in ratings_df DataFrame
rated_courses= ratings_df['item'].unique()
course_genres_rated = course_genres_df[course_genres_df['COURSE_ID'].isin(rated_courses)]

# To facilitate future matrix multiplication set COURSE_ID as index
print(course_genres_rated.shape)
course_genres_rated.set_index('COURSE_ID',inplace=True)

(126, 16)


we use `merge` to sort the columns of `course_genres_rated` according to the columns of `user_pivoted_df`.

In [7]:
# we use merge to sort course_genres_rated.
course_genres_rated = pd.merge(pd.Series(rating_sparse_df.columns), course_genres_rated, how='inner', left_on=pd.Series(rating_sparse_df.columns), right_on='COURSE_ID')
course_genres_rated.drop(columns=['item','TITLE'],inplace=True)
course_genres_rated.head()

Unnamed: 0,COURSE_ID,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,AI0111EN,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,BC0101EN,0,0,0,0,0,0,0,0,0,0,0,1,0,1
2,BC0201EN,0,0,1,0,0,0,0,0,0,0,0,0,0,1
3,BC0202EN,0,0,0,0,0,0,0,0,0,0,0,1,0,1
4,BD0101EN,1,0,0,0,0,0,0,0,1,0,0,0,0,0


We can check that the rows in `course_genres_rated` were properly sorted.

In [8]:
course_genres_rated[course_genres_rated['COURSE_ID'] == 'ML0201EN']

Unnamed: 0,COURSE_ID,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
94,ML0201EN,0,0,0,0,0,0,0,0,0,0,0,1,1,0


Next we finaly perform the matrix multiplication between `user_pivoted_df` and `course_genres_rated` DataFrames to generate the user-item interaction matrix `profile_df`:

In [9]:
c_mat = course_genres_rated.iloc[:,1:]
u_mat = rating_sparse_df.to_numpy()
print(u_mat.shape, c_mat.shape)
profile = np.matmul(u_mat,c_mat)

(33901, 126) (126, 14)


In [10]:
# We define the user-item interaction matrix as profile_df
profile_df = pd.concat([pd.Series(rating_sparse_df.index), profile],axis=1)
print(profile_df.shape)
profile_df.head(10)

(33901, 15)


Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,2,83.0,21.0,8.0,68.0,4.0,49.0,0.0,47.0,66.0,3.0,27.0,41.0,9.0,8.0
1,4,78.0,5.0,6.0,48.0,0.0,30.0,0.0,45.0,46.0,0.0,12.0,9.0,0.0,4.0
2,5,47.0,18.0,36.0,46.0,0.0,59.0,0.0,47.0,29.0,4.0,27.0,49.0,7.0,13.0
3,7,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
4,8,13.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,13.0,0.0,5.0,0.0,0.0,0.0
5,9,32.0,0.0,0.0,13.0,0.0,5.0,0.0,0.0,28.0,5.0,5.0,0.0,0.0,0.0
6,12,11.0,5.0,0.0,16.0,0.0,9.0,0.0,9.0,11.0,5.0,0.0,5.0,5.0,3.0
7,16,16.0,11.0,0.0,8.0,0.0,10.0,0.0,11.0,16.0,0.0,0.0,4.0,0.0,4.0
8,17,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,19,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,4.0,0.0,0.0


As we can see from above, each row vector represents the rating history of a user and each column vector represents the users who rated the item. A user-item interaction matrix is usually very sparse as you can imagine one user very likely only interacts with a very small subset of items and one item is very likely to be interacted by a small subset of users.

The profile dataframe contains the course interests for each user, for example, user 8 is very interested in R, data analysis, database, and big data:


In [11]:
profile_df[profile_df['user'] == 8]

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
4,8,13.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,13.0,0.0,5.0,0.0,0.0,0.0


In [12]:
# save profile_df DataFrame to csv
profile_df.to_csv("user_profile.csv", index=False)

----

## Part 2. Course recommendations based on user profile:


Next, let's sample a test dataset, containing test users to whom we want to make course recommendations.

In [13]:
np.random.seed(rs)
test_uesrs = np.random.choice(ratings_df['user'].unique(), size=1000, replace=False)

In [14]:
test_users_df = ratings_df[ratings_df['user'].isin(test_uesrs)]
test_users_df

Unnamed: 0,user,item,rating
197,393930,ST0101EN,3
242,470972,BD0111EN,3
264,1137644,BD0211EN,5
347,970117,CB0103EN,3
372,590871,BD0111EN,5
...,...,...,...
233151,882207,BD0211EN,4
233172,1055271,BC0201EN,3
233211,1170033,CO0301EN,4
233219,1525198,CNSC02EN,5


Let's look at how many test users we have in the dataset.

In [15]:
# Print the total number of test users by obtaining the length of the test_user_ids list
print(f"Total numbers of test users { len(test_users_df['user'].unique()) }")

Total numbers of test users 1000


Then for each test user in the test dataset, you need to first find out which courses are unknown/unselected to them. For example, suppose we have a user `1231456` with profile:


In [16]:
# course example = 1231456
course_ex = 1231456
test_user_profile = profile_df[profile_df['user'] == course_ex	]
test_user_profile

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
21561,1231456,3.0,10.0,3.0,8.0,0.0,0.0,0.0,9.0,3.0,4.0,0.0,0.0,0.0,0.0


In [17]:
# Now let's get the test user vector by excluding the `user` column
test_user_vector = test_user_profile.iloc[0, 1:].to_numpy()
test_user_vector

array([ 3., 10.,  3.,  8.,  0.,  0.,  0.,  9.,  3.,  4.,  0.,  0.,  0.,
        0.])

We can first find their enrolled courses in `test_users_df`:

In [18]:
test_users_df[test_users_df['user'] == course_ex]['item'].to_list()

['DA0101EN',
 'DS0101EN',
 'PY0101EN',
 'CB0103EN',
 'BD0101EN',
 'ST0101EN',
 'CC0103EN']

In [19]:
enrolled_courses = test_users_df[test_users_df['user'] == course_ex]['item'].to_list()
enrolled_courses = set(enrolled_courses)
enrolled_courses

{'BD0101EN',
 'CB0103EN',
 'CC0103EN',
 'DA0101EN',
 'DS0101EN',
 'PY0101EN',
 'ST0101EN'}

In [20]:
all_courses = set(course_genres_rated['COURSE_ID'].values)
len(all_courses)

126

Then we can use all courses to subtract the enrolled courses to get a set of all unknown courses for user `1231456`, and we want to find potential interested courses hidden in the unknown course list.

In [21]:
unknown_courses = all_courses.difference(enrolled_courses)
len(unknown_courses)

119

We can get the genre vectors for those unknown courses as well:

In [22]:
unknown_course_genres = course_genres_df[course_genres_df['COURSE_ID'].isin(unknown_courses)]
# Now let's get the course matrix by excluding `COURSE_ID` and `TITLE` columns:
course_matrix = unknown_course_genres.iloc[:, 2:].to_numpy()
course_matrix

array([[0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Given the user profile vector for user `1231456`  and all the unseen course genres vectors above, you can use the dot product to calculate the recommendation score for each unknown course. e.g., the recommendation score for course `accelerating deep learning with gpu` is:

In [23]:
score = np.dot(course_matrix[1], test_user_vector)
score

19.0

Later, we will need to choose a recommendation score threshold. If the score of any course is above the threshold, we may recommend that course to the user.


The workflow can be summarized in the following flowchart:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_3/images/recommend_courses.png)


Next, let's calculate the recommendation scores of all courses for all the 1000 test users sampled from `ratings_df`. 


We only want to recommend courses with very high scores so we may set a score threshold to filter out those courses with low scores.


In [24]:
# Only keep the score larger than the recommendation threshold
# The threshold can be fine-tuned to adjust the size of generated recommendations
score_threshold = 20.0

We defined a function called `generate_recommendation_scores()` to compute the recommendation scores of all the unknown courses for all test users.

In [25]:
def generate_recommendation_scores():
    """
    Generate recommendation scores for users and courses.

    Returns:
    users (list): List of user IDs.
    courses (list): List of recommended course IDs.
    scores (list): List of recommendation scores.
    """

    users = []      # List to store user IDs
    courses = []    # List to store recommended course IDs
    scores = []     # List to store recommendation scores
    test_user_ids = test_users_df['user'].unique() # list of 1000 test users

    # Iterate over each user ID in the test_user_ids list
    for user_id in test_user_ids:
        # Get the user profile data for the current user
        test_user_profile = profile_df[profile_df['user'] == user_id]

        # Get the user vector for the current user id 
        test_user_vector = test_user_profile.iloc[0, 1:].values

        # Get the known course ids for the current user
        enrolled_courses = test_users_df[test_users_df['user'] == user_id]['item'].to_list()

        # Calculate the unknown course ids
        unknown_courses = all_courses.difference(enrolled_courses)

        # Filter the course_genres_df to include only unknown courses
        # Notice we use course_genres_rated instead of the original DataFrame course_genres_df to contemplate only rated courses
        unknown_course_df = course_genres_rated[course_genres_rated['COURSE_ID'].isin(unknown_courses)]
        unknown_course_ids = unknown_course_df['COURSE_ID'].values

        # Calculate the recommendation scores using dot product
        recommendation_scores = np.dot(unknown_course_df.iloc[:, 1:].values, test_user_vector)

        # Append the results into the users, courses, and scores list
        for i in range(0, len(unknown_course_ids)):
            score = recommendation_scores[i]

            # Only keep the courses with high recommendation score
            if score >= score_threshold:
                users.append(user_id)
                courses.append(unknown_course_ids[i])
                scores.append(recommendation_scores[i])

    return users, courses, scores
 


After you have completed the function `generate_recommendation_scores()` above, you can test it and generate recommendation scores and save the courses recommendations into a dataframe with three columns: `USER`, `COURSE_ID`, `SCORE`:


In [26]:
# Call the generate_recommendation_scores function to obtain recommendation scores for users and courses,
# and assign the returned lists to variables users, courses, and scores
users, courses, scores = generate_recommendation_scores()

# Create an empty dictionary named res_dict to store the results of the recommendation process
res_dict = {}

# Store the lists of users, courses, and scores into the res_dict dictionary with corresponding keys
res_dict['USER'] = users
res_dict['COURSE_ID'] = courses
res_dict['SCORE'] = scores

# Create a DataFrame named res_df using the res_dict dictionary, specifying the column order as ['USER', 'COURSE_ID', 'SCORE']
res_df = pd.DataFrame(res_dict, columns=['USER', 'COURSE_ID', 'SCORE'])

# Save the res_df DataFrame to a CSV file named "profile_rs_results.csv" without including the index
#res_df.to_csv("profile_rs_results.csv", index=False)

# Output the res_df DataFrame
res_df


Unnamed: 0,USER,COURSE_ID,SCORE
0,470972,BD0123EN,28.0
1,470972,BD0131EN,50.0
2,470972,BD0133EN,28.0
3,470972,BD0135EN,28.0
4,470972,BD0137EN,28.0
...,...,...,...
10623,228478,TMP0105EN,35.0
10624,1383536,ML0101EN,23.0
10625,1383536,ML0122EN,23.0
10626,1174740,ML0101EN,21.0


With the course recommendation list generated for each test user, we perform some analytic tasks to answer the following two questions:


- On average, how many new courses have been recommended per test user?
- What are the most frequently recommended courses? Return the top-10 commonly recommended courses across all test users.

In [27]:
count_df = pd.DataFrame(res_df.groupby('USER').size().sort_values(ascending=False),columns=['Count']).reset_index().rename_axis(index=None, columns=None)
count_df.head(10)

Unnamed: 0,USER,Count
0,762476,75
1,507506,75
2,752457,74
3,1501711,73
4,746163,71
5,1185467,68
6,1559851,66
7,573764,66
8,1048308,66
9,1109665,65


In [28]:
count_df.describe()

Unnamed: 0,USER,Count
count,619.0,619.0
mean,1067032.0,17.169628
std,469679.6,14.655659
min,40303.0,1.0
25%,710950.5,5.0
50%,1036494.0,13.0
75%,1425568.0,24.0
max,2093050.0,75.0


We've got approximately 17 recommendations per user in the test dataset.

Next we find the 10 most recommended items

In [29]:
courses_1 = pd.DataFrame(res_df.groupby('COURSE_ID').size().sort_values(ascending=False),columns=['Count']).reset_index().rename_axis(index=None, columns=None).head(10)
courses_1

Unnamed: 0,COURSE_ID,Count
0,TA0106EN,379
1,ML0122EN,351
2,RP0105EN,343
3,TMP0105EN,341
4,SC0103EN,306
5,ML0101EN,304
6,BD0212EN,299
7,DX0108EN,251
8,TMP107,251
9,BD0143EN,245


In [30]:
mask = course_genres_df['COURSE_ID'].isin(courses_1['COURSE_ID'].values)
df_2 = course_genres_df[mask][['COURSE_ID','TITLE']].reset_index(drop=True)

pd.merge(courses_1,df_2, left_on='COURSE_ID', right_on='COURSE_ID')


Unnamed: 0,COURSE_ID,Count,TITLE
0,TA0106EN,379,text analytics at scale
1,ML0122EN,351,accelerating deep learning with gpu
2,RP0105EN,343,analyzing big data in r using apache spark
3,TMP0105EN,341,getting started with the data apache spark ma...
4,SC0103EN,306,spark overview for scala analytics
5,ML0101EN,304,machine learning with python
6,BD0212EN,299,spark fundamentals ii
7,DX0108EN,251,data science bootcamp with python for universi...
8,TMP107,251,data science bootcamp with python
9,BD0143EN,245,using hbase for real time access to your big data


The top-10 commonly recommended courses across all test users are listed above.

The winner with 379 recommendations is **'text analytics at scale'**.

----

In the next notebook, we use unsupervised machine learning to cluster similar items so we can provide a clustering based recommender system.