# User Based Collaborative Filtering

The User-Based Collaborative Filtering Recommender is a system that suggests items based on the preferences of similar users to the target.

i.e. users who agreed in the past tend to agree again in the future.

We analyse user ratings of courses to identify users with similar 'tastes' and leverage their past opinions to predict what the current user might like.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from coursemate.dataset import Dataset
from coursemate.model import UserBasedCF

pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)


In [2]:
dataset = Dataset('data/Coursera_courses.csv', 'data/Coursera.csv', 'data/Coursera_reviews.csv')
dataset.set_interaction_counts(3, 50)
dataset.show_dataset_details()

Loading Coursera courses...
Loading Coursera reviews...
Segmenting out students with less than 3 or more than 50 reviews...
30719 students, 468 courses, 174219 reviews
Sparsity: 1.21%
Duplicates: 4.54%


In [3]:
dataset.set_train_test_split_by_user()

Setting the train-test split by user...


In [4]:
train_Xmatrix, train_ymatrix, df_train_X, df_train_y = dataset.get_train_matrix_split(ratio=0.8)
test_Xmatrix, test_ymatrix, df_test_X, df_test_y = dataset.get_test_matrix_split(ratio=0.5)

Computing the training and test rating matrix...


131100it [00:08, 15194.05it/s]


Computing the test rating matrix split...
Computing the training and test rating matrix...


131100it [00:08, 14745.26it/s]
43119it [00:02, 16292.72it/s]


In [5]:
user_based_cf_model = UserBasedCF()

In [6]:
all_train = pd.concat([df_train_X, df_train_y])
user_based_cf_model.fit(all_train)

In [7]:
# Get recommendations for a particular user
user_id = 'By Kelvin k'
recommendations = user_based_cf_model.recommend(user_id, k=5)

In [8]:
recommendations

['aws-fundamentals-going-cloud-native',
 'information-security-data',
 'sql-for-data-science',
 'python-basics',
 'intro-sql']

### Evaluation

In [9]:
def calculate_hit_rate(model, test_X, test_y, k=5):
    hit_count = 0
    total = 0

    for user_id in tqdm(dataset.test_students):
        user_history = test_X[test_X['reviewers'] == user_id]['course_id'].values
        actual_next_courses = test_y[test_y['reviewers'] == user_id]['course_id'].values
        
        if len(actual_next_courses) == 0:
            continue

        recommended_courses = model.recommend(user_id, k=k)

        hits = any(course in recommended_courses for course in actual_next_courses)
        hit_count += 1 if hits else 0
        total += 1

    # Calculate overall hit rate
    return hit_count / total

In [10]:
user_based_cf_model = UserBasedCF()

all_train = pd.concat([df_train_X, df_train_y])

user_based_cf_model.fit(all_train)

In [11]:
calculate_hit_rate(user_based_cf_model, df_test_X, df_test_y, k=10)

100%|██████████| 7680/7680 [02:59<00:00, 42.78it/s]


0.0

In [12]:
calculate_hit_rate(user_based_cf_model, df_test_X, df_test_y, k=5)

  0%|          | 0/7680 [00:00<?, ?it/s]

100%|██████████| 7680/7680 [03:29<00:00, 36.57it/s]


0.0