# Implementing Recommender Systems - Lab

## Introduction

In this lab, you'll practice creating a recommender system model using `surprise`. You'll also get the chance to create a more complete recommender system pipeline to obtain the top recommendations for a specific user.


## Objectives

In this lab you will: 

- Use surprise's built-in reader class to process data to work with recommender algorithms 
- Obtain a prediction for a specific user for a particular item 
- Introduce a new user with rating to a rating matrix and make recommendations for them 
- Create a function that will return the top n recommendations for a user 


For this lab, we will be using the famous 1M movie dataset. It contains a collection of user ratings for many different movies. In the last lesson, you were exposed to working with `surprise` datasets. In this lab, you will also go through the process of reading in a dataset into the `surprise` dataset format. To begin with, load the dataset into a Pandas DataFrame. Determine which columns are necessary for your recommendation system and drop any extraneous ones.

In [1]:
import pandas as pd
df = pd.read_csv('./ml-latest-small/ratings.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [2]:
# Drop unnecessary columns
new_df = df[['userId', 'movieId', 'rating']]

It's now time to transform the dataset into something compatible with `surprise`. In order to do this, you're going to need `Reader` and `Dataset` classes. There's a method in `Dataset` specifically for loading dataframes.

In [3]:
!pip install scikit-surprise


  error: subprocess-exited-with-error
  
  √ó Building wheel for scikit-surprise (pyproject.toml) did not run successfully.
  ‚îÇ exit code: 1
  ‚ï∞‚îÄ> [155 lines of output]
      !!
      
              ********************************************************************************
              Please use a simple string containing a SPDX expression for `project.license`. You can also use `project.license-files`. (Both options available on setuptools>=77.0.0).
      
              By 2026-Feb-18, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
              ********************************************************************************
      
      !!
        corresp(dist, value, root_dir)
      !!
      
              ********************************************************************************

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies: started
  Installing build dependencies: still running...
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml): started
  Building wheel for scikit-surprise (pyproject.toml): finished with status 'error'
Failed to build scikit-surprise


In [4]:
import sys
!"{sys.executable}" -m pip install scikit-surprise


Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'


  error: subprocess-exited-with-error
  
  √ó Getting requirements to build wheel did not run successfully.
  ‚îÇ exit code: 1
  ‚ï∞‚îÄ> [44 lines of output]
      Compiling surprise/similarities.pyx because it changed.
      Compiling surprise/prediction_algorithms/matrix_factorization.pyx because it changed.
      Compiling surprise/prediction_algorithms/optimize_baselines.pyx because it changed.
      Compiling surprise/prediction_algorithms/slope_one.pyx because it changed.
      Compiling surprise/prediction_algorithms/co_clustering.pyx because it changed.
      [1/5] Cythonizing surprise/prediction_algorithms/co_clustering.pyx
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
              self.avg_cltr_i = avg_cltr_i
              self.avg_cocltr = avg_cocltr
      
              return self
      
          def compute_averages(self, np.ndarray[np.int_t] cltr_u,
                                                

In [4]:
from surprise import Reader, Dataset
print("Surprise imported successfully üéâ")


Surprise imported successfully üéâ


In [5]:
import sys
print(sys.executable)


c:\Users\Moringa School\anaconda3\python.exe


In [6]:
from surprise import Reader, Dataset
# read in values as Surprise dataset 
# Define the rating scale
reader = Reader(rating_scale=(0.5, 5.0))

# Load the DataFrame into a Surprise Dataset
data = Dataset.load_from_df(new_df[['userId', 'movieId', 'rating']], reader)


Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity

In [7]:
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


## Determine the best model 

Now, compare the different models and see which ones perform best. For consistency sake, use RMSE to evaluate models. Remember to cross-validate! Can you get a model with a higher average RMSE on test data than 0.869?

In [8]:
# importing relevant libraries
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
import numpy as np

In [9]:
## Perform a gridsearch with SVD
# ‚è∞ This cell may take several minutes to run
# Define the parameter grid for SVD
param_grid = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30, 40],
    'lr_all': [0.002, 0.005, 0.007],
    'reg_all': [0.02, 0.05, 0.1]
}

# Perform Grid Search
gs = GridSearchCV(
    SVD,
    param_grid,
    measures=['rmse'],
    cv=5,          # 5-fold cross validation
    joblib_verbose=2,
    n_jobs=-1
)

gs.fit(data)

# Print best RMSE and parameters
print("Best RMSE:", gs.best_score['rmse'])
print("Best Params:", gs.best_params['rmse'])


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   19.9s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:  5.5min


Best RMSE: 0.8550552663141554
Best Params: {'n_factors': 150, 'n_epochs': 40, 'lr_all': 0.007, 'reg_all': 0.1}


[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed:  6.9min finished


In [10]:
# print out optimal parameters for SVD after GridSearch
# Print best RMSE score
print("Best RMSE score:", gs.best_score['rmse'])

# Print best parameters
print("Best parameters for SVD:", gs.best_params['rmse'])


Best RMSE score: 0.8550552663141554
Best parameters for SVD: {'n_factors': 150, 'n_epochs': 40, 'lr_all': 0.007, 'reg_all': 0.1}


In [11]:
# cross validating with KNNBasic
# Define the KNNBasic algorithm
knn = KNNBasic()

# Cross-validate using RMSE
results = cross_validate(knn, data, measures=['rmse'], cv=5, verbose=True)

# Print the average RMSE
print("Average RMSE:", np.mean(results['test_rmse']))

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9467  0.9434  0.9442  0.9413  0.9538  0.9459  0.0043  
Fit time          0.40    0.53    0.51    0.44    0.40    0.46    0.05    
Test time         3.19    3.62    3.19    2.49    2.29    2.96    0.49    
Average RMSE: 0.945853474790642


In [12]:
# print out the average RMSE score for the test set
# Define the KNNBasic algorithm
knn = KNNBasic()

# Perform 5-fold cross validation
results = cross_validate(knn, data, measures=['rmse'], cv=5, verbose=True)

# Calculate average RMSE on test set
avg_rmse = np.mean(results['test_rmse'])

print("Average RMSE (Test Set):", avg_rmse)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9456  0.9532  0.9552  0.9474  0.9457  0.9494  0.0040  
Fit time          0.43    0.67    0.49    0.30    0.58    0.50    0.13    
Test time         3.39    2.83    2.76    1.80    2.79    2.71    0.51    
Average RMSE (Test Set): 0.9494257140724554


In [13]:
# cross validating with KNNBaseline
# Define the KNNBaseline algorithm
knn_baseline = KNNBaseline()

# Perform 5-fold cross validation
results = cross_validate(knn_baseline, data, measures=['rmse'], cv=5, verbose=True)

# Calculate average RMSE on test set
avg_rmse = np.mean(results['test_rmse'])

print("Average RMSE (Test Set):", avg_rmse)


Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8743  0.8777  0.8721  0.8743  0.8788  0.8754  0.0024  
Fit time          1.20    0.85    1.00    0.94    1.78    1.15    0.33    
Test time         6.67    2.28    4.58    4.25    3.17    4.19    1.48    
Average RMSE (Test Set): 0.8754400754264673


In [14]:
# print out the average score for the test set
# Define the KNNBaseline algorithm
knn_baseline = KNNBaseline()

# Perform 5-fold cross validation
results = cross_validate(knn_baseline, data, measures=['rmse'], cv=5, verbose=True)

# Calculate average RMSE on test set
avg_rmse = np.mean(results['test_rmse'])

print("Average RMSE (Test Set):", avg_rmse)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8781  0.8708  0.8747  0.8777  0.8742  0.8751  0.0026  
Fit time          0.93    0.69    0.89    0.75    0.88    0.83    0.09    
Test time         2.68    3.28    2.75    2.72    2.71    2.83    0.23    
Average RMSE (Test Set): 0.8751107449021346


Based off these outputs, it seems like the best performing model is the SVD model with `n_factors = 50` and a regularization rate of 0.05. Use that model or if you found one that performs better, feel free to use that to make some predictions.

## Making Recommendations

It's important that the output for the recommendation is interpretable to people. Rather than returning the `movie_id` values, it would be far more valuable to return the actual title of the movie. As a first step, let's read in the movies to a dataframe and take a peek at what information we have about them.

In [15]:
df_movies = pd.read_csv('./ml-latest-small/movies.csv')

In [16]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Making simple predictions
Just as a reminder, let's look at how you make a prediction for an individual user and item. First, we'll fit the SVD model we had from before.

In [17]:
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2b57d6c6ed0>

In [18]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=3.106333630907246, details={'was_impossible': False})

This prediction value is a tuple and each of the values within it can be accessed by way of indexing. Now let's put our knowledge of recommendation systems to do something interesting: making predictions for a new user!

## Obtaining User Ratings 

It's great that we have working models and everything, but wouldn't it be nice to get to recommendations specifically tailored to your preferences? That's what we'll be doing now. The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it. 

The function `movie_rater()` should take as parameters: 

* `movie_df`: DataFrame - a dataframe containing the movie ids, name of movie, and genres
* `num`: int - number of ratings
* `genre`: string - a specific genre from which to draw movies

The function returns:
* rating_list : list - a collection of dictionaries in the format of {'userId': int , 'movieId': int , 'rating': float}

#### This function is optional, but fun :) 

In [19]:
def movie_rater(movie_df,num, genre=None):
    pass
        

In [21]:
# try out the new function here!


import random

def movie_rater(movie_df, num, genre=None):
    """
    Asks the user to rate a number of movies.
    Returns a list of dictionaries: {'userId': int, 'movieId': int, 'rating': float}
    """

    # Filter by genre if provided
    if genre:
        movie_df = movie_df[movie_df['genres'].str.contains(genre, regex=False)]

    rating_list = []

    # Randomly pick movies
    sampled_movies = movie_df.sample(num)

    # Create a new user ID (just use a high number to avoid conflicts)
    new_user_id = movie_df['movieId'].max() + 1

    for _, row in sampled_movies.iterrows():
        print(f"\nMovie: {row['title']} | Genres: {row['genres']}")
        rating = input("Rate this movie from 1-5 (or press enter to skip): ")

        if rating.strip() == "":
            continue

        try:
            rating = float(rating)
            if rating < 1 or rating > 5:
                print("Rating must be between 1 and 5. Skipping...")
                continue

            rating_list.append({
                'userId': new_user_id,
                'movieId': int(row['movieId']),
                'rating': rating
            })
        except:
            print("Invalid rating. Skipping...")

    return rating_list


If you're struggling to come up with the above function, you can use this list of user ratings to complete the next segment

In [23]:
user_ratings = movie_rater(df_movies, num=5, genre='Comedy')
user_ratings



Movie: Revenge of the Nerds II: Nerds in Paradise (1987) | Genres: Comedy



Movie: Tropic Thunder (2008) | Genres: Action|Adventure|Comedy|War

Movie: In July (Im Juli) (2000) | Genres: Comedy|Romance

Movie: Amazon Women on the Moon (1987) | Genres: Comedy|Sci-Fi

Movie: Journey 2: The Mysterious Island (2012) | Genres: Action|Adventure|Comedy|Sci-Fi|IMAX


[{'userId': 193610, 'movieId': 27178, 'rating': 5.0},
 {'userId': 193610, 'movieId': 4079, 'rating': 5.0},
 {'userId': 193610, 'movieId': 92681, 'rating': 5.0}]

### Making Predictions With the New Ratings
Now that you have new ratings, you can use them to make predictions for this new user. The proper way this should work is:

* add the new ratings to the original ratings DataFrame, read into a `surprise` dataset 
* train a model using the new combined DataFrame
* make predictions for the user
* order those predictions from highest rated to lowest rated
* return the top n recommendations with the text of the actual movie (rather than just the index number) 

In [24]:
## add the new ratings to the original ratings DataFrame
# Convert list of dicts to DataFrame
new_ratings_df = pd.DataFrame(user_ratings)

# Append to original ratings DataFrame
combined_ratings = pd.concat([new_df, new_ratings_df], ignore_index=True)

combined_ratings.head()
# Load combined ratings into Surprise dataset


reader = Reader(rating_scale=(0.5, 5.0))

data_combined = Dataset.load_from_df(
    combined_ratings[['userId', 'movieId', 'rating']],
    reader
)



In [25]:
# train a model using the new combined DataFrame
# Use the best parameters you found
svd_new = SVD(n_factors=50, reg_all=0.05)

# Train on full combined dataset
trainset = data_combined.build_full_trainset()
svd_new.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2b5049570d0>

In [26]:
# make predictions for the user
# you'll probably want to create a list of tuples in the format (movie_id, predicted_score)
new_user_id = new_ratings_df['userId'].iloc[0]

# All movie IDs in the dataset
all_movie_ids = df_movies['movieId'].unique()

# Movies already rated by the user
rated_movie_ids = new_ratings_df['movieId'].unique()

# Filter out already rated movies
unrated_movies = [m for m in all_movie_ids if m not in rated_movie_ids]
predictions = []

for movie_id in unrated_movies:
    pred = svd_new.predict(new_user_id, movie_id)
    predictions.append((movie_id, pred.est))
predictions[:10]



[(1, 4.27833989381878),
 (2, 3.868292434734969),
 (3, 3.6490136211503637),
 (4, 3.449692796862232),
 (5, 3.3606272839430704),
 (6, 4.356401983178466),
 (7, 3.4874386273464895),
 (8, 3.5772367949992705),
 (9, 3.3845526047583148),
 (10, 3.8071623092945805)]

In [27]:
# order the predictions from highest to lowest rated

ranked_movies = sorted(predictions, key=lambda x: x[1], reverse=True)
ranked_movies[:10]


[(1104, 4.762043357909304),
 (1204, 4.728169895600267),
 (318, 4.727411522111893),
 (7153, 4.69346321984254),
 (750, 4.655678304601412),
 (2959, 4.645714475904911),
 (1276, 4.6393233349211185),
 (58559, 4.635212748769364),
 (904, 4.631611297155196),
 (56782, 4.627023075547773)]

 For the final component of this challenge, it could be useful to create a function `recommended_movies()` that takes in the parameters:
* `user_ratings`: list - list of tuples formulated as (user_id, movie_id) (should be in order of best to worst for this individual)
* `movie_title_df`: DataFrame 
* `n`: int - number of recommended movies 

The function should use a `for` loop to print out each recommended *n* movies in order from best to worst

In [28]:
# return the top n recommendations using the 
def recommended_movies(user_ratings,movie_title_df,n):
        pass
            
recommended_movies(ranked_movies,df_movies,5)

## Level Up (Optional)

* Try and chain all of the steps together into one function that asks users for ratings for a certain number of movies, then all of the above steps are performed to return the top $n$ recommendations
* Make a recommender system that only returns items that come from a specified genre

## Summary

In this lab, you got the chance to implement a collaborative filtering model as well as retrieve recommendations from that model. You also got the opportunity to add your own recommendations to the system to get new recommendations for yourself! Next, you will learn how to use Spark to make recommender systems.