In [13]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("prajitdatta/movielens-100k-dataset")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/movielens-100k-dataset


# Task
Create a Python notebook to build a movie recommendation system using user-based collaborative filtering. The system should recommend top-rated unseen movies for a given user based on user similarity computed from a user-item matrix. Evaluate the system's performance using precision at K. Additionally, implement item-based collaborative filtering and matrix factorization (SVD) as bonus tasks. Ensure each step of the process is in a separate cell. Use the provided user ratings data.

## Load data

### Subtask:
Load the user ratings data from the downloaded files.


**Reasoning**:
Load the user rating data, movie information, and user information from the respective files into pandas DataFrames and display the head of each DataFrame.



In [14]:
import pandas as pd
import os

# Assuming 'path' variable is available from the previous cell
data_dir = os.path.join(path, 'ml-100k')

ratings_df = pd.read_csv(os.path.join(data_dir, 'u.data'), sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
movies_df = pd.read_csv(os.path.join(data_dir, 'u.item'), sep='|', names=['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'], encoding='latin-1')
users_df = pd.read_csv(os.path.join(data_dir, 'u.user'), sep='|', names=['user_id', 'age', 'gender', 'occupation', 'zip_code'])

print("Ratings DataFrame Head:")
display(ratings_df.head())

print("\nMovies DataFrame Head:")
display(movies_df.head())

print("\nUsers DataFrame Head:")
display(users_df.head())

Ratings DataFrame Head:


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596



Movies DataFrame Head:


Unnamed: 0,item_id,title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0



Users DataFrame Head:


Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


**Reasoning**:
The previous attempt to load the data failed because the file path was incorrect. I need to list the files in the downloaded directory to confirm the correct file names and paths.



In [3]:
import os

# Assuming 'path' variable is available from the previous cell
print(os.listdir(path))

['ml-100k']


**Reasoning**:
The previous attempt failed because the files were not directly in the `path` directory but inside a subdirectory named 'ml-100k'. I will now construct the correct file paths and load the dataframes.



In [4]:
import pandas as pd
import os

# Assuming 'path' variable is available from the previous cell
data_dir = os.path.join(path, 'ml-100k')

ratings_df = pd.read_csv(os.path.join(data_dir, 'u.data'), sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
movies_df = pd.read_csv(os.path.join(data_dir, 'u.item'), sep='|', names=['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'], encoding='latin-1')
users_df = pd.read_csv(os.path.join(data_dir, 'u.user'), sep='|', names=['user_id', 'age', 'gender', 'occupation', 'zip_code'])

print("Ratings DataFrame Head:")
display(ratings_df.head())

print("\nMovies DataFrame Head:")
display(movies_df.head())

print("\nUsers DataFrame Head:")
display(users_df.head())

Ratings DataFrame Head:


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596



Movies DataFrame Head:


Unnamed: 0,item_id,title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0



Users DataFrame Head:


Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


## Prepare data

### Subtask:
Create a user-item matrix from the ratings data.


**Reasoning**:
Pivot the ratings_df to create a user-item matrix and fill missing values with 0.



In [15]:
user_item_matrix = ratings_df.pivot(index='user_id', columns='item_id', values='rating').fillna(0)
display(user_item_matrix.head())

item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Calculate user similarity

### Subtask:
Compute the similarity between users based on their ratings using a suitable similarity metric (e.g., cosine similarity).


**Reasoning**:
Compute the cosine similarity between users based on the user-item matrix.



In [16]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarity_matrix = cosine_similarity(user_item_matrix)
display(user_similarity_matrix)

array([[1.        , 0.16693098, 0.04745954, ..., 0.14861694, 0.17950788,
        0.39817474],
       [0.16693098, 1.        , 0.11059132, ..., 0.16148478, 0.17226781,
        0.10579788],
       [0.04745954, 0.11059132, 1.        , ..., 0.10124256, 0.13341615,
        0.02655587],
       ...,
       [0.14861694, 0.16148478, 0.10124256, ..., 1.        , 0.1016418 ,
        0.09511958],
       [0.17950788, 0.17226781, 0.13341615, ..., 0.1016418 , 1.        ,
        0.18246466],
       [0.39817474, 0.10579788, 0.02655587, ..., 0.09511958, 0.18246466,
        1.        ]])

## Generate recommendations

### Subtask:
For a given user, find similar users and recommend movies that similar users liked but the target user has not seen yet.


**Reasoning**:
Define the function `recommend_movies_user_based` as described in the instructions, then call the function with sample inputs and display the results.



In [17]:
import numpy as np

def recommend_movies_user_based(target_user_id, user_item_matrix, user_similarity_matrix, movies_df, k=10, n_similar_users=50):
    """
    Recommends movies for a target user using user-based collaborative filtering.

    Args:
        target_user_id (int): The ID of the target user.
        user_item_matrix (pd.DataFrame): The user-item matrix.
        user_similarity_matrix (np.ndarray): The user-user similarity matrix.
        movies_df (pd.DataFrame): DataFrame containing movie information.
        k (int): The number of recommendations to provide.
        n_similar_users (int): The number of most similar users to consider.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get the index of the target user
    target_user_index = user_item_matrix.index.get_loc(target_user_id)

    # Get similarity scores for the target user
    user_similarity_scores = user_similarity_matrix[target_user_index]

    # Get the indices of all users
    all_user_indices = np.arange(len(user_item_matrix))

    # Sort users by similarity in descending order and get top N similar users (excluding the target user)
    # Use argsort to get the indices that would sort the array
    sorted_user_indices = np.argsort(user_similarity_scores)[::-1]

    # Filter out the target user index and get the top N similar users' indices
    similar_user_indices = sorted_user_indices[sorted_user_indices != target_user_index][:n_similar_users]

    # Get the movies rated by the target user
    target_user_rated_movies = user_item_matrix.loc[target_user_id][user_item_matrix.loc[target_user_id] > 0].index.tolist()

    # Initialize a dictionary to store potential recommendations and their scores
    recommendations = {}

    # Iterate through the top N similar users
    for similar_user_index in similar_user_indices:
        similar_user_id = user_item_matrix.index[similar_user_index]
        similarity_score = user_similarity_scores[similar_user_index]

        # Get the movies rated by the similar user
        similar_user_rated_movies = user_item_matrix.loc[similar_user_id][user_item_matrix.loc[similar_user_id] > 0].index.tolist()

        # Identify movies rated by the similar user that the target user has not seen
        unseen_movies = [movie_id for movie_id in similar_user_rated_movies if movie_id not in target_user_rated_movies]

        # For these unseen movies, calculate a weighted rating
        for movie_id in unseen_movies:
            similar_user_rating = user_item_matrix.loc[similar_user_id, movie_id]
            weighted_rating = similar_user_rating * similarity_score

            # Add or update these weighted ratings in the recommendation dictionary
            if movie_id not in recommendations:
                recommendations[movie_id] = weighted_rating
            else:
                recommendations[movie_id] += weighted_rating

    # Sort the movies in the recommendation dictionary by their accumulated weighted scores
    sorted_recommendations = sorted(recommendations.items(), key=lambda item: item[1], reverse=True)

    # Retrieve the top k movies
    top_k_recommendations = sorted_recommendations[:k]

    # Get the titles of the recommended movies
    recommended_movie_titles = [movies_df[movies_df['item_id'] == movie_id]['title'].iloc[0] for movie_id, score in top_k_recommendations]

    return recommended_movie_titles

# Test the function with a sample user ID and k
sample_user_id = 1
k_recommendations = 10
recommended_movies = recommend_movies_user_based(sample_user_id, user_item_matrix, user_similarity_matrix, movies_df, k=k_recommendations)

print(f"Top {k_recommendations} movie recommendations for user {sample_user_id}:")
for i, movie_title in enumerate(recommended_movies):
    print(f"{i+1}. {movie_title}")

Top 10 movie recommendations for user 1:
1. E.T. the Extra-Terrestrial (1982)
2. Stand by Me (1986)
3. Schindler's List (1993)
4. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)
5. One Flew Over the Cuckoo's Nest (1975)
6. Speed (1994)
7. Heathers (1989)
8. Batman (1989)
9. Glory (1989)
10. Casablanca (1942)


## Evaluate performance

### Subtask:
Evaluate the recommendation system using appropriate metrics like Precision@K.


**Reasoning**:
Define the Precision@K function and select a subset of users for evaluation.



In [18]:
def precision_at_k(recommended_movies, actual_rated_movies, k):
    """
    Calculates Precision@K.

    Args:
        recommended_movies (list): A list of recommended movie titles.
        actual_rated_movies (list): A list of movie titles actually rated by the user.
        k (int): The number of recommendations.

    Returns:
        float: The Precision@K score.
    """
    # Ensure actual_rated_movies is a set for efficient lookup
    actual_rated_movies_set = set(actual_rated_movies)

    # Determine the number of relevant recommendations
    relevant_recommendations = sum(1 for movie in recommended_movies if movie in actual_rated_movies_set)

    # Calculate Precision@K
    return relevant_recommendations / k if k > 0 else 0

# Select a subset of users for evaluation (e.g., first 50 users)
evaluation_users = users_df['user_id'].unique()[:50]
print(f"Selected {len(evaluation_users)} users for evaluation.")

Selected 50 users for evaluation.


**Reasoning**:
Iterate through the selected users, generate recommendations, get their actual rated movies, calculate Precision@K for each user, and store the scores.



In [19]:
# Store Precision@K scores for each user
precision_scores = []

for user_id in evaluation_users:
    # Generate recommendations for the user
    recommended_movies = recommend_movies_user_based(user_id, user_item_matrix, user_similarity_matrix, movies_df, k=k_recommendations)

    # Get the list of movies the user actually rated
    actual_rated_movie_ids = ratings_df[ratings_df['user_id'] == user_id]['item_id'].tolist()
    actual_rated_movie_titles = movies_df[movies_df['item_id'].isin(actual_rated_movie_ids)]['title'].tolist()

    # Calculate Precision@K
    precision = precision_at_k(recommended_movies, actual_rated_movie_titles, k_recommendations)
    precision_scores.append(precision)

print(f"Calculated Precision@K for {len(precision_scores)} users.")

Calculated Precision@K for 50 users.


**Reasoning**:
Calculate the average Precision@K across all selected users and print the result.



In [20]:
# Calculate the average Precision@K
average_precision_at_k = sum(precision_scores) / len(precision_scores)

# Print the average Precision@K score
print(f"Average Precision@{k_recommendations}: {average_precision_at_k:.4f}")

Average Precision@10: 0.0020


## Bonus: item-based collaborative filtering

### Subtask:
Implement item-based collaborative filtering as a bonus.


**Reasoning**:
Implement item-based collaborative filtering by calculating item-item similarity, defining a recommendation function, and testing it with a sample user. This will address steps 1 through 7 of the instructions.



In [21]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Calculate the item-item similarity matrix
# Transpose the user-item matrix to have items as rows and users as columns
item_user_matrix = user_item_matrix.T

# Calculate item-item cosine similarity
item_similarity_matrix = cosine_similarity(item_user_matrix)

print("Item-Item Similarity Matrix (first 5x5):")
display(pd.DataFrame(item_similarity_matrix, index=item_user_matrix.index, columns=item_user_matrix.index).head())

# 2-6. Define a function recommend_movies_item_based
def recommend_movies_item_based(target_user_id, user_item_matrix, item_similarity_matrix, movies_df, k=10):
    """
    Recommends movies for a target user using item-based collaborative filtering.

    Args:
        target_user_id (int): The ID of the target user.
        user_item_matrix (pd.DataFrame): The user-item matrix.
        item_similarity_matrix (np.ndarray): The item-item similarity matrix.
        movies_df (pd.DataFrame): DataFrame containing movie information.
        k (int): The number of recommendations to provide.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get the row of ratings for the target user
    target_user_ratings = user_item_matrix.loc[target_user_id]

    # Identify movies the target user has not rated (ratings are 0)
    unrated_movie_ids = target_user_ratings[target_user_ratings == 0].index.tolist()

    # Identify movies the target user has rated
    rated_movie_ids = target_user_ratings[target_user_ratings > 0].index.tolist()

    # Get the ratings of the movies the user has rated
    rated_movie_ratings = target_user_ratings[rated_movie_ids].values

    # Initialize a dictionary to store predicted ratings for unrated movies
    predicted_ratings = {}

    # Iterate through each unrated movie
    for unrated_item_id in unrated_movie_ids:
        # Get the similarity scores between the unrated item and all other items
        # Find the index of the unrated item in the item_similarity_matrix
        try:
            unrated_item_index = item_user_matrix.index.get_loc(unrated_item_id)
        except KeyError:
            # If item_id is not in the matrix (shouldn't happen with this dataset, but good practice)
            continue

        item_similarity_scores = item_similarity_matrix[unrated_item_index]

        # Filter similarity scores to only include items the target user has rated
        rated_items_indices = [item_user_matrix.index.get_loc(item_id) for item_id in rated_movie_ids]
        similarities_with_rated_items = item_similarity_scores[rated_items_indices]

        # Calculate the weighted sum of ratings for the unrated movie
        # Only consider items with non-zero similarity
        non_zero_similarity_indices = np.where(similarities_with_rated_items > 0)[0]

        if len(non_zero_similarity_indices) > 0:
            valid_similarities = similarities_with_rated_items[non_zero_similarity_indices]
            valid_ratings = rated_movie_ratings[non_zero_similarity_indices]

            # Predicted rating is the sum of (similarity * rating) / sum of similarities
            predicted_rating = np.sum(valid_similarities * valid_ratings) / np.sum(valid_similarities)
            predicted_ratings[unrated_item_id] = predicted_rating

    # Sort unrated movies based on their predicted ratings in descending order
    sorted_recommendations = sorted(predicted_ratings.items(), key=lambda item: item[1], reverse=True)

    # Get the top k recommended movie IDs
    top_k_recommended_item_ids = [item_id for item_id, rating in sorted_recommendations[:k]]

    # Get the titles of the recommended movies
    recommended_movie_titles = movies_df[movies_df['item_id'].isin(top_k_recommended_item_ids)]['title'].tolist()

    return recommended_movie_titles

# 7. Test the function with a sample user ID and k
sample_user_id = 1
k_recommendations = 10
recommended_movies_item_based = recommend_movies_item_based(sample_user_id, user_item_matrix, item_similarity_matrix, movies_df, k=k_recommendations)

print(f"\nTop {k_recommendations} movie recommendations for user {sample_user_id} (Item-Based):")
for i, movie_title in enumerate(recommended_movies_item_based):
    print(f"{i+1}. {movie_title}")

Item-Item Similarity Matrix (first 5x5):


item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.402382,0.330245,0.454938,0.286714,0.116344,0.620979,0.481114,0.496288,0.273935,...,0.035387,0.0,0.0,0.0,0.035387,0.0,0.0,0.0,0.047183,0.047183
2,0.402382,1.0,0.273069,0.502571,0.318836,0.083563,0.383403,0.337002,0.255252,0.171082,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078299,0.078299
3,0.330245,0.273069,1.0,0.324866,0.212957,0.106722,0.372921,0.200794,0.273669,0.158104,...,0.0,0.0,0.0,0.0,0.032292,0.0,0.0,0.0,0.0,0.096875
4,0.454938,0.502571,0.324866,1.0,0.334239,0.090308,0.489283,0.490236,0.419044,0.252561,...,0.0,0.0,0.094022,0.094022,0.037609,0.0,0.0,0.0,0.056413,0.075218
5,0.286714,0.318836,0.212957,0.334239,1.0,0.037299,0.334769,0.259161,0.272448,0.055453,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094211



Top 10 movie recommendations for user 1 (Item-Based):
1. C'est arrivé près de chez vous (1992)
2. Cyclo (1995)
3. My Life and Times With Antonin Artaud (En compagnie d'Antonin Artaud) (1993)
4. Ballad of Narayama, The (Narayama Bushiko) (1958)
5. Three Lives and Only One Death (1996)
6. Death in Brunswick (1991)
7. Office Killer (1997)
8. King of New York (1990)
9. Little City (1998)
10. Mamma Roma (1962)


**Reasoning**:
The previous code failed because of an `UnboundLocalError`. The variable `rated_movie_ratings` was used before it was assigned a value inside the `recommend_movies_item_based` function. The correct variable to use is `rated_movie_ids` to index `target_user_ratings`. I need to fix this error in the code.



In [12]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Calculate the item-item similarity matrix
# Transpose the user-item matrix to have items as rows and users as columns
item_user_matrix = user_item_matrix.T

# Calculate item-item cosine similarity
item_similarity_matrix = cosine_similarity(item_user_matrix)

print("Item-Item Similarity Matrix (first 5x5):")
display(pd.DataFrame(item_similarity_matrix, index=item_user_matrix.index, columns=item_user_matrix.index).head())

# 2-6. Define a function recommend_movies_item_based
def recommend_movies_item_based(target_user_id, user_item_matrix, item_similarity_matrix, movies_df, k=10):
    """
    Recommends movies for a target user using item-based collaborative filtering.

    Args:
        target_user_id (int): The ID of the target user.
        user_item_matrix (pd.DataFrame): The user-item matrix.
        item_similarity_matrix (np.ndarray): The item-item similarity matrix.
        movies_df (pd.DataFrame): DataFrame containing movie information.
        k (int): The number of recommendations to provide.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get the row of ratings for the target user
    target_user_ratings = user_item_matrix.loc[target_user_id]

    # Identify movies the target user has not rated (ratings are 0)
    unrated_movie_ids = target_user_ratings[target_user_ratings == 0].index.tolist()

    # Identify movies the target user has rated
    rated_movie_ids = target_user_ratings[target_user_ratings > 0].index.tolist()

    # Get the ratings of the movies the user has rated
    # Corrected: Use rated_movie_ids to index target_user_ratings
    rated_movie_ratings = target_user_ratings[rated_movie_ids].values

    # Initialize a dictionary to store predicted ratings for unrated movies
    predicted_ratings = {}

    # Iterate through each unrated movie
    for unrated_item_id in unrated_movie_ids:
        # Get the similarity scores between the unrated item and all other items
        # Find the index of the unrated item in the item_similarity_matrix
        unrated_item_index = item_user_matrix.index.get_loc(unrated_item_id)
        item_similarity_scores = item_similarity_matrix[unrated_item_index]

        # Filter similarity scores to only include items the target user has rated
        rated_items_indices = [item_user_matrix.index.get_loc(item_id) for item_id in rated_movie_ids]
        similarities_with_rated_items = item_similarity_scores[rated_items_indices]

        # Calculate the weighted sum of ratings for the unrated movie
        # Only consider items with non-zero similarity
        non_zero_similarity_indices = np.where(valid_similarities > 0)[0] # This line has an error

        if len(non_zero_similarity_indices) > 0:
            valid_similarities = similarities_with_rated_items[non_zero_similarity_indices]
            valid_ratings = rated_movie_ratings[non_zero_similarity_indices]

            # Predicted rating is the sum of (similarity * rating) / sum of similarities
            predicted_rating = np.sum(valid_similarities * valid_ratings) / np.sum(valid_similarities)
            predicted_ratings[unrated_item_id] = predicted_rating

    # Sort unrated movies based on their predicted ratings in descending order
    sorted_recommendations = sorted(predicted_ratings.items(), key=lambda item: item[1], reverse=True)

    # Get the top k recommended movie IDs
    top_k_recommended_item_ids = [item_id for item_id, rating in sorted_recommendations[:k]]

    # Get the titles of the recommended movies
    recommended_movie_titles = movies_df[movies_df['item_id'].isin(top_k_recommended_item_ids)]['title'].tolist()

    return recommended_movie_titles

# 7. Test the function with a sample user ID and k
sample_user_id = 1
k_recommendations = 10
recommended_movies_item_based = recommend_movies_item_based(sample_user_id, user_item_matrix, item_similarity_matrix, movies_df, k=k_recommendations)

print(f"\nTop {k_recommendations} movie recommendations for user {sample_user_id} (Item-Based):")
for i, movie_title in enumerate(recommended_movies_item_based):
    print(f"{i+1}. {movie_title}")

Item-Item Similarity Matrix (first 5x5):


item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.402382,0.330245,0.454938,0.286714,0.116344,0.620979,0.481114,0.496288,0.273935,...,0.035387,0.0,0.0,0.0,0.035387,0.0,0.0,0.0,0.047183,0.047183
2,0.402382,1.0,0.273069,0.502571,0.318836,0.083563,0.383403,0.337002,0.255252,0.171082,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078299,0.078299
3,0.330245,0.273069,1.0,0.324866,0.212957,0.106722,0.372921,0.200794,0.273669,0.158104,...,0.0,0.0,0.0,0.0,0.032292,0.0,0.0,0.0,0.0,0.096875
4,0.454938,0.502571,0.324866,1.0,0.334239,0.090308,0.489283,0.490236,0.419044,0.252561,...,0.0,0.0,0.094022,0.094022,0.037609,0.0,0.0,0.0,0.056413,0.075218
5,0.286714,0.318836,0.212957,0.334239,1.0,0.037299,0.334769,0.259161,0.272448,0.055453,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094211


UnboundLocalError: cannot access local variable 'valid_similarities' where it is not associated with a value

**Reasoning**:
The previous code failed due to an `UnboundLocalError` because `valid_similarities` was used before it was assigned. The line `non_zero_similarity_indices = np.where(valid_similarities > 0)[0]` should be using `similarities_with_rated_items` instead of `valid_similarities`. I need to fix this error in the code.



In [13]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Calculate the item-item similarity matrix
# Transpose the user-item matrix to have items as rows and users as columns
item_user_matrix = user_item_matrix.T

# Calculate item-item cosine similarity
item_similarity_matrix = cosine_similarity(item_user_matrix)

print("Item-Item Similarity Matrix (first 5x5):")
display(pd.DataFrame(item_similarity_matrix, index=item_user_matrix.index, columns=item_user_matrix.index).head())

# 2-6. Define a function recommend_movies_item_based
def recommend_movies_item_based(target_user_id, user_item_matrix, item_similarity_matrix, movies_df, k=10):
    """
    Recommends movies for a target user using item-based collaborative filtering.

    Args:
        target_user_id (int): The ID of the target user.
        user_item_matrix (pd.DataFrame): The user-item matrix.
        item_similarity_matrix (np.ndarray): The item-item similarity matrix.
        movies_df (pd.DataFrame): DataFrame containing movie information.
        k (int): The number of recommendations to provide.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get the row of ratings for the target user
    target_user_ratings = user_item_matrix.loc[target_user_id]

    # Identify movies the target user has not rated (ratings are 0)
    unrated_movie_ids = target_user_ratings[target_user_ratings == 0].index.tolist()

    # Identify movies the target user has rated
    rated_movie_ids = target_user_ratings[target_user_ratings > 0].index.tolist()

    # Get the ratings of the movies the user has rated
    rated_movie_ratings = target_user_ratings[rated_movie_ids].values

    # Initialize a dictionary to store predicted ratings for unrated movies
    predicted_ratings = {}

    # Iterate through each unrated movie
    for unrated_item_id in unrated_movie_ids:
        # Get the similarity scores between the unrated item and all other items
        # Find the index of the unrated item in the item_similarity_matrix
        try:
            unrated_item_index = item_user_matrix.index.get_loc(unrated_item_id)
        except KeyError:
            # If item_id is not in the matrix (shouldn't happen with this dataset, but good practice)
            continue

        item_similarity_scores = item_similarity_matrix[unrated_item_index]

        # Filter similarity scores to only include items the target user has rated
        rated_items_indices = [item_user_matrix.index.get_loc(item_id) for item_id in rated_movie_ids]
        similarities_with_rated_items = item_similarity_scores[rated_items_indices]

        # Calculate the weighted sum of ratings for the unrated movie
        # Only consider items with non-zero similarity
        # Corrected: Use similarities_with_rated_items instead of valid_similarities
        non_zero_similarity_indices = np.where(similarities_with_rated_items > 0)[0]

        if len(non_zero_similarity_indices) > 0:
            valid_similarities = similarities_with_rated_items[non_zero_similarity_indices]
            valid_ratings = rated_movie_ratings[non_zero_similarity_indices]

            # Predicted rating is the sum of (similarity * rating) / sum of similarities
            predicted_rating = np.sum(valid_similarities * valid_ratings) / np.sum(valid_similarities)
            predicted_ratings[unrated_item_id] = predicted_rating

    # Sort unrated movies based on their predicted ratings in descending order
    sorted_recommendations = sorted(predicted_ratings.items(), key=lambda item: item[1], reverse=True)

    # Get the top k recommended movie IDs
    top_k_recommended_item_ids = [item_id for item_id, rating in sorted_recommendations[:k]]

    # Get the titles of the recommended movies
    recommended_movie_titles = movies_df[movies_df['item_id'].isin(top_k_recommended_item_ids)]['title'].tolist()

    return recommended_movie_titles

# 7. Test the function with a sample user ID and k
sample_user_id = 1
k_recommendations = 10
recommended_movies_item_based = recommend_movies_item_based(sample_user_id, user_item_matrix, item_similarity_matrix, movies_df, k=k_recommendations)

print(f"\nTop {k_recommendations} movie recommendations for user {sample_user_id} (Item-Based):")
for i, movie_title in enumerate(recommended_movies_item_based):
    print(f"{i+1}. {movie_title}")

Item-Item Similarity Matrix (first 5x5):


item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.402382,0.330245,0.454938,0.286714,0.116344,0.620979,0.481114,0.496288,0.273935,...,0.035387,0.0,0.0,0.0,0.035387,0.0,0.0,0.0,0.047183,0.047183
2,0.402382,1.0,0.273069,0.502571,0.318836,0.083563,0.383403,0.337002,0.255252,0.171082,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078299,0.078299
3,0.330245,0.273069,1.0,0.324866,0.212957,0.106722,0.372921,0.200794,0.273669,0.158104,...,0.0,0.0,0.0,0.0,0.032292,0.0,0.0,0.0,0.0,0.096875
4,0.454938,0.502571,0.324866,1.0,0.334239,0.090308,0.489283,0.490236,0.419044,0.252561,...,0.0,0.0,0.094022,0.094022,0.037609,0.0,0.0,0.0,0.056413,0.075218
5,0.286714,0.318836,0.212957,0.334239,1.0,0.037299,0.334769,0.259161,0.272448,0.055453,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094211



Top 10 movie recommendations for user 1 (Item-Based):
1. C'est arrivé près de chez vous (1992)
2. Cyclo (1995)
3. My Life and Times With Antonin Artaud (En compagnie d'Antonin Artaud) (1993)
4. Ballad of Narayama, The (Narayama Bushiko) (1958)
5. Three Lives and Only One Death (1996)
6. Death in Brunswick (1991)
7. Office Killer (1997)
8. King of New York (1990)
9. Little City (1998)
10. Mamma Roma (1962)


## Bonus: matrix factorization (svd)

### Subtask:
Implement matrix factorization using SVD as a bonus.


**Reasoning**:
Implement matrix factorization using SVD as a bonus task by importing necessary classes, loading the data into a Surprise Dataset, building a training set, instantiating and training an SVD model, defining a recommendation function that uses the trained model to predict ratings for unseen movies and recommend the top K, and finally testing the function with a sample user.



In [14]:
from surprise import Dataset, Reader, SVD
import pandas as pd
import numpy as np

# 3. Define a Reader object, specifying the rating scale as (1, 5).
reader = Reader(rating_scale=(1, 5))

# 4. Load the ratings_df DataFrame into a Surprise Dataset object using the defined Reader.
# 5. Build a full training set from the loaded dataset.
data = Dataset.load_from_df(ratings_df[['user_id', 'item_id', 'rating']], reader)
trainset = data.build_full_trainset()

# 6. Instantiate an SVD model.
svd_model = SVD(n_factors=50, random_state=42) # You can experiment with n_factors

# 7. Train the SVD model on the full training set.
svd_model.fit(trainset)

# 8. Define a function that takes a user ID, the trained SVD model, the user-item matrix, and the movies DataFrame as input.
# 9. Inside the function, get the list of all movie IDs.
# 10. Identify the movies the target user has already rated using the user-item matrix.
# 11. Iterate through all movie IDs. If a movie ID is not in the list of movies the user has rated, predict the rating for the target user and this movie using the trained SVD model's predict method.
# 12. Store the predicted ratings for the unrated movies.
# 13. Sort the unrated movies based on their predicted ratings in descending order.
# 14. Get the top K movie IDs from the sorted list.
# 15. Retrieve the titles of the top K recommended movies from the movies DataFrame.
# 16. Return the list of recommended movie titles.
def recommend_movies_svd(target_user_id, svd_model, user_item_matrix, movies_df, k=10):
    """
    Recommends movies for a target user using a trained SVD model.

    Args:
        target_user_id (int): The ID of the target user.
        svd_model: The trained SVD model from the surprise library.
        user_item_matrix (pd.DataFrame): The user-item matrix (used to find unrated movies).
        movies_df (pd.DataFrame): DataFrame containing movie information.
        k (int): The number of recommendations to provide.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get the list of all movie IDs
    all_movie_ids = movies_df['item_id'].tolist()

    # Identify the movies the target user has already rated
    target_user_rated_movies = user_item_matrix.loc[target_user_id][user_item_matrix.loc[target_user_id] > 0].index.tolist()

    # Store predicted ratings for unrated movies
    predicted_ratings = {}

    # Iterate through all movie IDs and predict rating if the user hasn't seen it
    for movie_id in all_movie_ids:
        if movie_id not in target_user_rated_movies:
            # Predict the rating for the target user and this movie
            predicted_rating = svd_model.predict(target_user_id, movie_id).est
            predicted_ratings[movie_id] = predicted_rating

    # Sort unrated movies based on their predicted ratings in descending order
    sorted_recommendations = sorted(predicted_ratings.items(), key=lambda item: item[1], reverse=True)

    # Get the top k recommended movie IDs
    top_k_recommended_item_ids = [item_id for item_id, rating in sorted_recommendations[:k]]

    # Retrieve the titles of the recommended movies
    recommended_movie_titles = movies_df[movies_df['item_id'].isin(top_k_recommended_item_ids)]['title'].tolist()

    return recommended_movie_titles

# 17. Test the recommendation function with a sample user ID and K, and print the recommended movie titles.
sample_user_id = 1
k_recommendations = 10
recommended_movies_svd = recommend_movies_svd(sample_user_id, svd_model, user_item_matrix, movies_df, k=k_recommendations)

print(f"\nTop {k_recommendations} movie recommendations for user {sample_user_id} (SVD):")
for i, movie_title in enumerate(recommended_movies_svd):
    print(f"{i+1}. {movie_title}")

ModuleNotFoundError: No module named 'surprise'

**Reasoning**:
The previous command failed because the `surprise` library is not installed. Install the library using pip.



In [15]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━[0m [32m122.9/154.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp312-cp312-linux_x86_64.whl size=2610404 sha256=3ad03882231923c9bf232e3cfd281b59b6b2c424ff4b5f701b6a77988240be31
  Stored in directory: /root/.cache/pip/wheels/75/fa/bc/

**Reasoning**:
Now that the `surprise` library is installed, retry the previous code block to implement matrix factorization using SVD.



In [22]:
from surprise import Dataset, Reader, SVD

# 3. Define a Reader object, specifying the rating scale as (1, 5).
reader = Reader(rating_scale=(1, 5))

# 4. Load the ratings_df DataFrame into a Surprise Dataset object using the defined Reader.
# 5. Build a full training set from the loaded dataset.
data = Dataset.load_from_df(ratings_df[['user_id', 'item_id', 'rating']], reader)
trainset = data.build_full_trainset()

# 6. Instantiate an SVD model.
svd_model = SVD(n_factors=50, random_state=42) # You can experiment with n_factors

# 7. Train the SVD model on the full training set.
svd_model.fit(trainset)

# 8. Define a function that takes a user ID, the trained SVD model, the user-item matrix, and the movies DataFrame as input.
# 9. Inside the function, get the list of all movie IDs.
# 10. Identify the movies the target user has already rated using the user-item matrix.
# 11. Iterate through all movie IDs. If a movie ID is not in the list of movies the user has rated, predict the rating for the target user and this movie using the trained SVD model's predict method.
# 12. Store the predicted ratings for the unrated movies.
# 13. Sort the unrated movies based on their predicted ratings in descending order.
# 14. Get the top K movie IDs from the sorted list.
# 15. Retrieve the titles of the top K recommended movies from the movies DataFrame.
# 16. Return the list of recommended movie titles.
def recommend_movies_svd(target_user_id, svd_model, user_item_matrix, movies_df, k=10):
    """
    Recommends movies for a target user using a trained SVD model.

    Args:
        target_user_id (int): The ID of the target user.
        svd_model: The trained SVD model from the surprise library.
        user_item_matrix (pd.DataFrame): The user-item matrix (used to find unrated movies).
        movies_df (pd.DataFrame): DataFrame containing movie information.
        k (int): The number of recommendations to provide.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get the list of all movie IDs
    all_movie_ids = movies_df['item_id'].tolist()

    # Identify the movies the target user has already rated
    target_user_rated_movies = user_item_matrix.loc[target_user_id][user_item_matrix.loc[target_user_id] > 0].index.tolist()

    # Store predicted ratings for unrated movies
    predicted_ratings = {}

    # Iterate through all movie IDs and predict rating if the user hasn't seen it
    for movie_id in all_movie_ids:
        if movie_id not in target_user_rated_movies:
            # Predict the rating for the target user and this movie
            predicted_rating = svd_model.predict(target_user_id, movie_id).est
            predicted_ratings[movie_id] = predicted_rating

    # Sort unrated movies based on their predicted ratings in descending order
    sorted_recommendations = sorted(predicted_ratings.items(), key=lambda item: item[1], reverse=True)

    # Get the top k recommended movie IDs
    top_k_recommended_item_ids = [item_id for item_id, rating in sorted_recommendations[:k]]

    # Retrieve the titles of the recommended movies
    recommended_movie_titles = movies_df[movies_df['item_id'].isin(top_k_recommended_item_ids)]['title'].tolist()

    return recommended_movie_titles

# 17. Test the recommendation function with a sample user ID and K, and print the recommended movie titles.
sample_user_id = 1
k_recommendations = 10
recommended_movies_svd = recommend_movies_svd(sample_user_id, svd_model, user_item_matrix, movies_df, k=k_recommendations)

print(f"\nTop {k_recommendations} movie recommendations for user {sample_user_id} (SVD):")
for i, movie_title in enumerate(recommended_movies_svd):
    print(f"{i+1}. {movie_title}")


Top 10 movie recommendations for user 1 (SVD):
1. L.A. Confidential (1997)
2. Close Shave, A (1995)
3. To Kill a Mockingbird (1962)
4. Casablanca (1942)
5. Maltese Falcon, The (1941)
6. Lawrence of Arabia (1962)
7. Wings of Desire (1987)
8. Rear Window (1954)
9. Manchurian Candidate, The (1962)
10. Strictly Ballroom (1992)


**Reasoning**:
The error message indicates a compatibility issue between the installed `surprise` library and the current version of NumPy. The easiest solution is to downgrade NumPy to a version compatible with `surprise`.



In [17]:
!pip install numpy<2

/bin/bash: line 1: 2: No such file or directory


**Reasoning**:
The previous command to downgrade NumPy failed. It seems the `<` symbol was misinterpreted by the shell. I will try again using double quotes around the package specification to ensure it is interpreted correctly by pip.



In [18]:
!pip install "numpy<2"

Collecting numpy<2
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>

**Reasoning**:
NumPy has been successfully downgraded. Now, retry the previous code block to implement matrix factorization using SVD.



In [19]:
from surprise import Dataset, Reader, SVD

# 3. Define a Reader object, specifying the rating scale as (1, 5).
reader = Reader(rating_scale=(1, 5))

# 4. Load the ratings_df DataFrame into a Surprise Dataset object using the defined Reader.
# 5. Build a full training set from the loaded dataset.
data = Dataset.load_from_df(ratings_df[['user_id', 'item_id', 'rating']], reader)
trainset = data.build_full_trainset()

# 6. Instantiate an SVD model.
svd_model = SVD(n_factors=50, random_state=42) # You can experiment with n_factors

# 7. Train the SVD model on the full training set.
svd_model.fit(trainset)

# 8. Define a function that takes a user ID, the trained SVD model, the user-item matrix, and the movies DataFrame as input.
# 9. Inside the function, get the list of all movie IDs.
# 10. Identify the movies the target user has already rated using the user-item matrix.
# 11. Iterate through all movie IDs. If a movie ID is not in the list of movies the user has rated, predict the rating for the target user and this movie using the trained SVD model's predict method.
# 12. Store the predicted ratings for the unrated movies.
# 13. Sort the unrated movies based on their predicted ratings in descending order.
# 14. Get the top K movie IDs from the sorted list.
# 15. Retrieve the titles of the top K recommended movies from the movies DataFrame.
# 16. Return the list of recommended movie titles.
def recommend_movies_svd(target_user_id, svd_model, user_item_matrix, movies_df, k=10):
    """
    Recommends movies for a target user using a trained SVD model.

    Args:
        target_user_id (int): The ID of the target user.
        svd_model: The trained SVD model from the surprise library.
        user_item_matrix (pd.DataFrame): The user-item matrix (used to find unrated movies).
        movies_df (pd.DataFrame): DataFrame containing movie information.
        k (int): The number of recommendations to provide.

    Returns:
        list: A list of recommended movie titles.
    """
    # Get the list of all movie IDs
    all_movie_ids = movies_df['item_id'].tolist()

    # Identify the movies the target user has already rated
    target_user_rated_movies = user_item_matrix.loc[target_user_id][user_item_matrix.loc[target_user_id] > 0].index.tolist()

    # Store predicted ratings for unrated movies
    predicted_ratings = {}

    # Iterate through all movie IDs and predict rating if the user hasn't seen it
    for movie_id in all_movie_ids:
        if movie_id not in target_user_rated_movies:
            # Predict the rating for the target user and this movie
            predicted_rating = svd_model.predict(target_user_id, movie_id).est
            predicted_ratings[movie_id] = predicted_rating

    # Sort unrated movies based on their predicted ratings in descending order
    sorted_recommendations = sorted(predicted_ratings.items(), key=lambda item: item[1], reverse=True)

    # Get the top k recommended movie IDs
    top_k_recommended_item_ids = [item_id for item_id, rating in sorted_recommendations[:k]]

    # Retrieve the titles of the recommended movies
    recommended_movie_titles = movies_df[movies_df['item_id'].isin(top_k_recommended_item_ids)]['title'].tolist()

    return recommended_movie_titles

# 17. Test the recommendation function with a sample user ID and K, and print the recommended movie titles.
sample_user_id = 1
k_recommendations = 10
recommended_movies_svd = recommend_movies_svd(sample_user_id, svd_model, user_item_matrix, movies_df, k=k_recommendations)

print(f"\nTop {k_recommendations} movie recommendations for user {sample_user_id} (SVD):")
for i, movie_title in enumerate(recommended_movies_svd):
    print(f"{i+1}. {movie_title}")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.12/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.12/dist-packages/ipykernel/kernelapp.py", line 712, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.12/dist-package

ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it).

## Summary:

### Data Analysis Key Findings

*   The user-based collaborative filtering recommendation system, when evaluated on a subset of 50 users, achieved an average Precision@10 of 0.0020. This indicates that, on average, only 0.2 of the top 10 recommended movies for these users were movies they had actually rated.
*   The implementation of item-based collaborative filtering was successful, and the system was able to generate a list of recommended movies for a sample user based on item similarity and the user's existing ratings.
*   The attempt to implement matrix factorization using SVD via the `scikit-surprise` library failed due to a persistent `ImportError` related to NumPy compatibility, which could not be resolved by downgrading NumPy within the environment.

### Insights or Next Steps

*   The low Precision@10 score for the user-based collaborative filtering suggests that this approach, as implemented, may not be highly effective for this dataset. Further refinement of the user similarity calculation, incorporating rating thresholds, or exploring different similarity metrics could potentially improve performance.
*   Investigate alternative matrix factorization libraries or methods that do not exhibit the NumPy compatibility issues encountered with `scikit-surprise` to successfully implement and evaluate the SVD approach.


In [2]:
!pip install scikit-surprise==1.1.3 --no-deps

Collecting scikit-surprise==1.1.3
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/772.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/772.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m768.0/772.0 kB[0m [31m12.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp312-cp312-linux_x86_64.whl size=3513891 sha256=f7cb48848ee5eaad7a80e301231dcda6f49678eb1804e235bbab903b239dac32
  Stored in directory: /root/.cache/pip/whe

## Summary

### Data Analysis Key Findings

* The user-based collaborative filtering recommendation system, when evaluated on a subset of 50 users, achieved an average Precision@10 of 0.0020. This indicates that, on average, only 0.2 of the top 10 recommended movies for these users were movies they had actually rated.
* The implementation of item-based collaborative filtering was successful, and the system was able to generate a list of recommended movies for a sample user based on item similarity and the user's existing ratings.
* The matrix factorization using SVD implementation was also successful after resolving the library compatibility issues. The SVD model was trained, and recommendations were generated for a sample user.

### Insights or Next Steps

* The low Precision@10 score for the user-based collaborative filtering suggests that this approach, as implemented, may not be highly effective for this dataset. Further refinement of the user similarity calculation, incorporating rating thresholds, or exploring different similarity metrics could potentially improve performance.
* The item-based and SVD methods provide alternative approaches that could potentially yield better recommendations. Further evaluation of these methods using metrics like Precision@K and comparing their performance would be a valuable next step.
* Exploring hyperparameter tuning for the SVD model (e.g., the number of factors) could also lead to improved performance.
* The current evaluation is based on a small subset of users. A more comprehensive evaluation on a larger, held-out test set would provide a more reliable measure of the system's performance.