<a href="https://colab.research.google.com/github/arafatro/Recommender-Sys/blob/main/01_Recommender_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommender Systems Practice Lab

This lab demonstrates how to build a simple recommender system using the Amazon ReviewsLens dataset. We will cover:
- Data loading and basic statistics
- A simple recommender using mean ratings
- Collaborative filtering using a user-movie sparse matrix
- Finding similar movies with Nearest Neighbors (using cosine similarity)
- Incorporating user bias to filter recommendations

*Note: Ensure you have an active internet connection, as the data is loaded from S3.*


In [None]:
# Import Libraries and Suppress Warnings
import warnings
warnings.simplefilter(action='ignore')

import pandas as pd
import numpy as np
import random
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid")


*   List item We first suppress warnings to keep the output clean.
*   List item We import essential libraries:
    *   List item Pandas and Numpy for data manipulation and numerical operations.
    *   List item Scipy's csr_matrix to efficiently create a sparse matrix (used in collaborative filtering).
*   List item NearestNeighbors from scikit-learn to find similar movies using cosine similarity.
*   List item For visualization, we import Matplotlib and Seaborn and use the %matplotlib inline magic command so that plots appear directly in the notebook.

## Data Loading and Basic Statistics
We load the ratings and movies datasets from S3 and compute some basic statistics.


In [None]:
# Load Datasets from S3
ratings_url = "https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv"
movies_url = "https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv"

ratings = pd.read_csv(ratings_url)
movies = pd.read_csv(movies_url)

print("Ratings dataset (first 5 rows):")
display(ratings.head())

print("Movies dataset (first 5 rows):")
display(movies.head())

Ratings dataset (first 5 rows):


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Movies dataset (first 5 rows):


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


*   List item The ratings and movies datasets are loaded directly from S3 URLs.
*   List item We then display the first five rows of each dataset to ensure the data has been loaded correctly.

In [None]:
# Compute basic statistics
n_movies = ratings['movieId'].nunique()
n_ratings = len(ratings)
n_users = ratings['userId'].nunique()

print(f"Number of ratings: {n_ratings}")
print(f"Number of unique movieIds: {n_movies}")
print(f"Number of unique users: {n_users}")
print(f"Average ratings per user: {round(n_ratings/n_users, 2)}")
print(f"Average ratings per movie: {round(n_ratings/n_movies, 2)}")

Number of ratings: 100836
Number of unique movieIds: 9724
Number of unique users: 610
Average ratings per user: 165.3
Average ratings per movie: 10.37


In [None]:
# Compute user frequency (number of ratings per user)
user_freq = ratings.groupby('userId')['movieId'].count().reset_index()
user_freq.columns = ['userId', 'n_ratings']
print("User frequency (first 5 rows):")
display(user_freq.head())

User frequency (first 5 rows):


Unnamed: 0,userId,n_ratings
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44


*   List item Here, we group the ratings by `userId` and count the number of movies each user rated.
*   List item This gives us an idea of user engagement within the dataset.

## Simple Recommender Using Mean Ratings

We compute the average rating for each movie to identify the lowest and highest rated movies.


In [None]:
# Mean Ratings Recommender
mean_rating = ratings.groupby('movieId')['rating'].mean()

# Find movie with lowest and highest average rating
lowest_rated = mean_rating.idxmin()
highest_rated = mean_rating.idxmax()

print("Lowest rated movie:")
display(movies.loc[movies['movieId'] == lowest_rated])

print("Highest rated movie:")
display(movies.loc[movies['movieId'] == highest_rated])


Lowest rated movie:


Unnamed: 0,movieId,title,genres
2689,3604,Gypsy (1962),Musical


Highest rated movie:


Unnamed: 0,movieId,title,genres
48,53,Lamerica (1994),Adventure|Drama


*   List item The mean rating for each movie is computed.
*   List item Using these averages, we identify and display the movie with the lowest and highest average ratings.
*   List item This simple recommender method is based solely on average ratings.

In [None]:
# Optionally, inspect ratings for these movies
print("Ratings for the lowest rated movie:")
display(ratings[ratings['movieId'] == lowest_rated])
print("Ratings for the highest rated movie:")
display(ratings[ratings['movieId'] == highest_rated])

Ratings for the lowest rated movie:


Unnamed: 0,userId,movieId,rating,timestamp
13633,89,3604,0.5,1520408880


Ratings for the highest rated movie:


Unnamed: 0,userId,movieId,rating,timestamp
13368,85,53,5.0,889468268
96115,603,53,5.0,963180003


*   List item This section shows all the rating entries for the lowest and highest rated movies to further inspect the data.

## Collaborative Filtering Setup

We now prepare for collaborative filtering by creating a user-movie sparse matrix.


In [None]:
# Build User-Movie Sparse Matrix
# Map userId and movieId to continuous indices
N = ratings['userId'].nunique()  # Number of users
M = ratings['movieId'].nunique()  # Number of movies

user_mapper = {user: idx for idx, user in enumerate(ratings["userId"].unique())}
movie_mapper = {movie: idx for idx, movie in enumerate(ratings["movieId"].unique())}

user_index = [user_mapper[i] for i in ratings['userId']]
movie_index = [movie_mapper[i] for i in ratings['movieId']]

# Create a sparse matrix with shape (number of movies, number of users)
X = csr_matrix((ratings["rating"], (movie_index, user_index)), shape=(M, N))

# For demonstration, convert the sparse matrix to dense (not recommended for large datasets)
X_df = pd.DataFrame(X.toarray())
print("User-Movie Ratings Matrix (first 5 rows):")
display(X_df.head())

User-Movie Ratings Matrix (first 5 rows):


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,600,601,602,603,604,605,606,607,608,609
0,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
1,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,3.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,5.0
3,5.0,0.0,0.0,2.0,0.0,4.0,0.0,4.0,0.0,0.0,...,4.0,5.0,0.0,0.0,0.0,3.0,0.0,4.5,0.0,5.0
4,5.0,0.0,0.0,0.0,4.0,1.0,4.5,5.0,0.0,0.0,...,5.0,5.0,0.0,0.0,0.0,4.5,0.0,4.5,0.0,4.0


*   List item We first determine the number of unique users and movies.
*   List item Then, we create mappings to convert user and movie IDs into continuous index values.
*   List item Using these mappings, we build a sparse matrix X where each row corresponds to a movie, each column corresponds to a user, and each cell contains the rating.
*   List item For demonstration, the sparse matrix is converted to a dense DataFrame and the first five rows are displayed.

## Finding Similar Movies Using Nearest Neighbors

We implement a function to find similar movies based on cosine similarity.


In [None]:
# Find Similar Movies Function
from sklearn.neighbors import NearestNeighbors

# Create an inverse mapping from matrix indices to original movieIds
movie_inv_mapper = {idx: movie for idx, movie in enumerate(ratings["movieId"].unique())}

def find_similar_movies(movie_id, X, k=10):
    """
    Given a movie_id, find k similar movies based on cosine similarity.
    """
    neighbour_ids = []
    movie_ind = movie_mapper[movie_id]
    movie_vec = X[movie_ind]

    # Increase k by 1 to account for the movie itself
    kNN = NearestNeighbors(n_neighbors=k+1, algorithm="brute", metric="cosine")
    kNN.fit(X)
    movie_vec = movie_vec.reshape(1, -1)
    neighbor = kNN.kneighbors(movie_vec, return_distance=False)

    # Collect indices for similar movies, excluding the first one (self)
    for i in range(1, k+1):
        n = neighbor.item(i)
        neighbour_ids.append(movie_inv_mapper[n])
    return neighbour_ids

# Create a mapping from movieId to movie title for easy reference
movie_titles = dict(zip(movies['movieId'], movies['title']))

# Example: Find similar movies for a specific movie (e.g., movie_id 586)
selected_movie_id = 586
selected_movie_title = movie_titles[selected_movie_id]
print(f"Since you watched '{selected_movie_title}', you might also like:")

similar_ids = find_similar_movies(selected_movie_id, X, k=10)
for mid in similar_ids:
    print(movie_titles[mid])

Since you watched 'Home Alone (1990)', you might also like:
Mrs. Doubtfire (1993)
Lion King, The (1994)
Pretty Woman (1990)
Jurassic Park (1993)
Jumanji (1995)
Speed (1994)
Forrest Gump (1994)
Aladdin (1992)
Mask, The (1994)
Indiana Jones and the Temple of Doom (1984)


*   List item This function uses cosine similarity to find similar movies:
    *   List item It maps the input movie ID to its corresponding index.
    *   List item Then, using the Nearest Neighbors algorithm, it finds the k nearest movies (excluding the movie itself).
    *   List item The function returns a list of similar movie IDs, which are then converted to movie titles for display.

## Incorporating User Bias in Recommendations

We calculate each user's average rating (user bias) and filter recommendations based on whether the rating meets or exceeds the user's bias.


In [None]:
# Compute and Apply User Bias
# Transpose the user-movie matrix so rows represent users
df_user = X_df.T.copy()

# Calculate user bias (mean rating per user, ignoring zeros)
df_user['userBias'] = df_user[df_user != 0].mean(numeric_only=True, axis=1)
print("User Bias (first 5 rows):")
display(df_user.head())

User Bias (first 5 rows):


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9715,9716,9717,9718,9719,9720,9721,9722,9723,userBias
0,4.0,4.0,4.0,5.0,5.0,3.0,5.0,4.0,5.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.366379
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.948276
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.435897
3,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.555556
4,4.0,0.0,0.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.636364


In [None]:
# Retrieve ratings for the selected movie
rated = ratings[ratings["movieId"] == selected_movie_id].reset_index(drop=True)
print("Ratings for the selected movie:")
display(rated)

Ratings for the selected movie:


Unnamed: 0,userId,movieId,rating,timestamp
0,8,586,3.0,839463702
1,14,586,3.0,835441451
2,18,586,3.5,1455748696
3,19,586,3.0,965707079
4,20,586,3.0,1054038279
...,...,...,...,...
111,592,586,4.0,837350242
112,594,586,5.0,1109036952
113,599,586,3.0,1498525239
114,602,586,1.0,840875757


In [None]:
# Get user bias for users who rated the movie
usrBias = df_user.loc[rated["userId"].values, "userBias"].reset_index(drop=True)
print("User biases for raters:")
display(usrBias)

User biases for raters:


Unnamed: 0,userBias
0,3.260870
1,3.448148
2,2.607397
3,3.590909
4,3.260722
...,...
111,3.266990
112,4.200000
113,2.991481
114,3.507953


In [None]:
# Filter recommendations: keep ratings where rating >= user's bias
filtering = rated["rating"] >= usrBias
recommend = rated[filtering]
print("Filtered Recommendations (ratings above user bias):")
display(recommend)

Filtered Recommendations (ratings above user bias):


Unnamed: 0,userId,movieId,rating,timestamp
2,18,586,3.5,1455748696
10,62,586,4.0,1521489913
19,102,586,4.0,835877270
20,103,586,4.0,1431957135
23,116,586,3.5,1337199910
24,117,586,4.0,844162913
30,169,586,5.0,1078284644
38,220,586,4.5,1230061714
40,229,586,3.0,838143590
50,280,586,4.0,1348532002


*   List item The user-movie matrix is transposed so each row now represents a user.
*   List item The code calculates the average rating (user bias) for each user, ignoring zeros.
*   List item Then, it retrieves the ratings for the selected movie and matches each rating with the corresponding user's bias.
*   List item Finally, it filters and displays only those ratings that are equal to or higher than the user's bias. This step is intended to refine recommendations based on how lenient or strict each user tends to be in their ratings.

## Summary

In this lab, we:
- Loaded and inspected the Amazon Reviews ratings and movies datasets.
- Computed basic statistics on ratings, movies, and users.
- Built a simple recommender by analyzing mean ratings.
- Created a user-movie sparse matrix for collaborative filtering.
- Implemented a function to find similar movies using cosine similarity.
- Incorporated user bias to filter recommendations.

This hands-on practice provides a solid foundation for building more advanced recommender systems.

Happy coding!
