<a href="https://colab.research.google.com/github/ehoppenstedt/recommendation_systems/blob/main/course_recomendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Course Recommendations

This is a course recommendation system.
It is trained in an Edtech platform where users can select courses and create customized learning paths.
What I do here is check how courses are combined together by users, then use k-nearest neigbors to make a system that suggests 5 courses more likely to be combined with the course you want to see.

This is useful because it can suggest learning paths to users based on what other users see as "adequate" combinations. The recommendation system was inspired in spotify's playlist creator,

The steps here are:
1. Preparation of the data
2. A quick check for sparcity in the data to determine if the chosen model is adequate
3. the model an recommendation system.


# **Preparation of the Data**

In [1]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

In [7]:
#read both data sources and append them to have one df
df1 = pd.read_csv('./lp_1M.csv')
df2 = pd.read_csv('./lp_2M.csv')

# Concatenate the two dataframes
df = pd.concat([df1, df2], ignore_index=True)

In [163]:
#Save the whole DF.
#It is somewhat heavy, so I advice caution
df.to_csv(r'./lps.csv')

In [131]:
df.shape, df.columns

((2000000, 6),
 Index(['id', 'title', 'courses', 'order', 'level', 'user_learning_path_id'], dtype='object'))

In [132]:
print(type(df['courses'].iloc[0]))

<class 'str'>


In [10]:
#df['courses'] = df['courses'].apply(lambda x: x.strip('{}').split(','))  # Convert 'courses' from string to list
#that previous line wasn't necessary, it assumed column 'courses' was not a list

# Convert list of courses to one-hot encoded vectors
mlb = MultiLabelBinarizer()
encoded_courses = pd.DataFrame(mlb.fit_transform(df['courses']), columns=mlb.classes_, index=df.index)

# Combine original dataframe with encoded courses
df_combined = pd.concat([df, encoded_courses], axis=1)


In [11]:
df_combined.iloc[:, 4:].apply(pd.Series.unique)

level                    [custom, optional, advanced, intermediate, bas...
user_learning_path_id    [128322, 128323, 128324, 128325, 128326, 12832...
,                                                                   [1, 0]
0                                                                   [1, 0]
1                                                                   [1, 0]
2                                                                   [0, 1]
3                                                                   [1, 0]
4                                                                   [0, 1]
5                                                                   [0, 1]
6                                                                   [1, 0]
7                                                                   [1, 0]
8                                                                   [1, 0]
9                                                                   [1, 0]
{                        

In [15]:
df_combined.select_dtypes(include=[np.number])

Unnamed: 0,id,order,user_learning_path_id,",",0,1,2,3,4,5,6,7,8,9,{,}
0,107416,5,128322,1,1,1,0,1,0,0,1,1,1,1,1,1
1,107417,5,128323,1,1,1,1,1,1,1,1,1,1,1,1,1
2,107418,5,128324,1,1,1,0,0,1,0,0,0,0,1,1,1
3,107419,5,128325,1,1,1,1,0,0,1,0,1,1,1,1,1
4,107420,5,128326,0,0,0,0,0,1,0,0,0,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1999995,1495364,5,4762414,1,1,1,0,0,1,0,1,1,1,1,1,1
1999996,1495363,5,4762402,1,1,1,0,0,1,0,1,1,1,1,1,1
1999997,1495362,5,4762387,1,1,1,0,0,1,0,1,1,1,1,1,1
1999998,1495361,5,4762369,1,1,1,0,0,1,0,1,1,1,1,1,1


# Sparcity of Data Test

Formula:

$\text{Sparsity} = \left(\frac{\text{Number of Missing Interactions}}{\text{Total Possible Interactions}}\right) \times 100\% $

Sparsity of the data: In a recommendation system, we often deal with sparse data, where most users have not interacted with most items (courses, in this case). The sparser the data, the more data is generally needed to make accurate recommendations.


In [14]:
import numpy as np

# Assuming `df` is a DataFrame where each row is a user and each column is a course
# and the values indicate whether the user has taken the course or not
n_total = np.prod(df.shape)  # total number of possible interactions
n_missing = df.isnull().sum().sum()  # total number of missing interactions

sparsity = (n_missing / n_total) * 100

sparsity


0.14396666666666666

Given that my data consists of 2 million observations and has a sparsity level of 14.4% (which indicates that the matrix is not overly sparse), the K-Nearest Neighbors (KNN)  approach could potentially work well for the recommendation system.

A potential approach could be to start with KNN to establish a baseline, then try a more complex deep learning model and see if it offers significant improvements.

We might also consider other recommendation system algorithms like matrix factorization techniques, or even hybrid systems that combine the advantages of multiple approaches.

# Model building

In [16]:
from sklearn.neighbors import NearestNeighbors

In [130]:
# Building the model
knn = NearestNeighbors(metric='cosine', algorithm='brute')
df_combined_numerical = df_combined.select_dtypes(include=[np.number])
knn.fit(df_combined_numerical)

# Defining the recommendation function
def recommend_courses(course_id, num_recommendations=5):
    distances, indices = knn.kneighbors(df_combined.iloc[course_id, 4:].values.reshape(1, -1), n_neighbors=num_recommendations)
    return df_combined_numerical.index[indices.flatten()].tolist()


#Finding courses in course catalogue

In [21]:
org_df = pd.read_csv('./course_catalogue.csv')
org_df.columns

Index(['course_id', 'title', 'description', 'slug', 'course_launch_date'], dtype='object')

In [22]:
# Assume org_df is your "organizations_course" dataframe
def get_course_titles(course_ids):
    return org_df.loc[org_df['course_id'].isin(course_ids), 'title'].tolist()

#Making this interactive

In [128]:
# Define a function to get the top 5 recommendations from the model
def get_top5_recommendations(model, course_id, n_total):
    distances, indices = model.kneighbors(df_combined_numerical.loc[course_id].values.reshape(1, -1), n_neighbors=n_total-1)
    return indices[0]  # Exclude the first recommendation as it's the input course itself

# Define a function to handle user input and provide recommendations
def recommend_courses():
    course_id = input("Please enter a course ID: ")  # Get course ID from user
    course_id = int(course_id)  # Convert to int as your IDs are int

    # Get top 5 recommendations
    recommended_ids = get_top5_recommendations(knn, course_id, 6)


    # Get course titles
    recommended_titles = get_course_titles(recommended_ids)


    print(f"Courses recommended for: id: {course_id} Title: {recommended_titles[0]}")
    for id in recommended_titles:
      print(id)


In [129]:
recommend_courses()

Please enter a course ID: 1753




Courses recommended for: id: 1753 Title: Curso de Introducción a la Terminal y Línea de Comandos 2019
Curso de Introducción a la Terminal y Línea de Comandos 2019
Curso de Web Scraping: Extracción de Datos en la Web
Curso de Gestión Efectiva del Tiempo
Curso de Fidelización de Clientes
Curso de CRM con Salesforce 2019
