 ## Step 1: Import Necessary Libraries
 
Import essential libraries for building our recommendation system. Each library has a specific role:

pandas and numpy are used for data manipulation.
scikit-learn provides machine learning algorithms and tools like train_test_split and metrics.
scipy.sparse helps with creating sparse matrices, which are efficient for large datasets.
matplotlib is used for visualization if needed.

In [1]:
import pandas as pd       # To handle dataframes and datasets
import numpy as np        # To perform numerical operations
from sklearn.model_selection import train_test_split  # For splitting datasets
from sklearn.metrics.pairwise import cosine_similarity  # To compute similarity between items
from sklearn.metrics import mean_squared_error  # To calculate model accuracy (RMSE)
from scipy.sparse import csr_matrix  # For creating sparse matrices (efficient memory usage)
from sklearn.neighbors import NearestNeighbors  # KNN model for collaborative filtering
import matplotlib.pyplot as plt  # For visualization




## Step 2: Load Datasets
Load the provided datasets (ratings.csv, movies.csv, tags.csv, and links.csv) into Pandas DataFrames. Pandas makes it easier to manipulate and analyze these datasets. After loading, we display the first few rows of each dataset using .head() to understand their structure.

In [2]:
# Load datasets into DataFrames
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')
tags = pd.read_csv('tags.csv')
links = pd.read_csv('links.csv')

In [3]:
print(ratings.head())

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


In [4]:
print(movies.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


In [5]:
print(tags.head())

   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferrell  1445714992
3       2    89774     Boxing story  1445715207
4       2    89774              MMA  1445715200


In [6]:
print(links.head())

   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0


+ ratings.csv: Contains user ratings for movies (with columns userId, movieId, rating).

+ movies.csv: Contains metadata for movies (movieId, title, genres).

+ tags.csv: Contains tags or labels assigned to movies by users (userId, movieId, tag).

+ links.csv: Contains external links to movie information (movieId, imdbId, tmdbId).

## Step 3: Data Preprocessing

Prepare the data for building a recommendation system. The ratings and movies datasets are merged to create a movie_ratings DataFrame. Then, we pivot the data to get a user-movie matrix where rows represent users, columns represent movies, and the values represent ratings. Missing values are filled with 0 (indicating no rating given). Finally, we convert this matrix into a sparse matrix for memory efficiency.

In [7]:
# Merge ratings and movies datasets based on the common movieId
movie_ratings = pd.merge(ratings, movies, on='movieId')


In [8]:
# Create a user-movie matrix where rows are users and columns are movie titles
user_movie_matrix = movie_ratings.pivot_table(index='userId', columns='title', values='rating')


In [9]:
# Fill NaN values with 0 since missing values indicate no rating provided
user_movie_matrix.fillna(0, inplace=True)


In [10]:
# Convert the matrix to a sparse matrix to optimize memory usage
user_movie_sparse_matrix = csr_matrix(user_movie_matrix.values)

+ pivot_table() creates the user-movie matrix where each user’s ratings for specific movies are stored.

+ fillna(0) replaces missing values (NaN) with 0.

+ csr_matrix stores the data more efficiently for matrix operations.


## Step 4: Build Collaborative Filtering Model (KNN-Based)

Use the K-nearest neighbors (KNN) algorithm to find users who have similar preferences (similar movie ratings). We can later use this similarity to recommend movies. The metric used for similarity is cosine distance, which measures how similar two users' rating vectors are.

In [11]:
# Instantiate the KNN model using cosine similarity and brute-force search
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)

In [12]:
# Train the KNN model using the user-movie matrix
knn.fit(user_movie_sparse_matrix)

In [13]:
# Example: Test with userId = 0 (first user), find 5 similar users
user_id = 0
distances, indices = knn.kneighbors(user_movie_matrix.iloc[user_id, :].values.reshape(1, -1), n_neighbors=6)

In [14]:
# Output the indices of similar users (excluding the first, which is the user itself)
print(f"Top 5 similar users to user {user_id}: {indices.flatten()[1:]}")

Top 5 similar users to user 0: [265 312 367  56  90]


+ NearestNeighbors creates a KNN model using cosine distance (for finding similar users).

+ kneighbors() finds the nearest users to a given user.

+ indices shows the most similar users to the input user.


## Step 5: Generate Recommendations Based on Collaborative Filtering

Now that we’ve identified similar users, we aggregate their movie ratings to generate recommendations. The higher the mean rating from similar users, the more likely the movie will be recommended.

In [15]:
# Extract ratings from similar users
similar_users = indices.flatten()[1:]  # Skip first index (the input user itself)
similar_users_ratings = user_movie_matrix.iloc[similar_users]

In [16]:
# Calculate mean ratings for each movie across the similar users
mean_ratings = similar_users_ratings.mean(axis=0)

In [17]:
mean_ratings

title
'71 (2014)                                   0.0
'Hellboy': The Seeds of Creation (2004)      0.0
'Round Midnight (1986)                       0.0
'Salem's Lot (2004)                          0.0
'Til There Was You (1997)                    0.0
                                            ... 
eXistenZ (1999)                              0.0
xXx (2002)                                   0.0
xXx: State of the Union (2005)               0.0
¡Three Amigos! (1986)                        0.6
À nous la liberté (Freedom for Us) (1931)    0.0
Length: 9719, dtype: float64

In [18]:
# Sort the mean ratings in descending order to recommend highest-rated movies
recommended_movies = mean_ratings.sort_values(ascending=False).head(10)

In [19]:
print(f"Top recommended movies for user {user_id}:")
print(recommended_movies)

Top recommended movies for user 0:
title
Aliens (1986)                                                                     4.8
Matrix, The (1999)                                                                4.8
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    4.8
Saving Private Ryan (1998)                                                        4.7
Usual Suspects, The (1995)                                                        4.5
Pulp Fiction (1994)                                                               4.5
Princess Bride, The (1987)                                                        4.4
Hunt for Red October, The (1990)                                                  4.3
Terminator, The (1984)                                                            4.3
Batman (1989)                                                                     4.2
dtype: float64


+ We calculate the mean rating for each movie across similar users.

+ Sort the movies by their average rating to find the highest-rated movies that can be recommended to the user.


## Step 6: Build Content-Based Filtering System

In content-based filtering, we recommend movies based on their features (like genre and tags) rather than user preferences. Here, we combine the movie genres and tags into a single metadata feature for each movie, which we’ll use to calculate content similarity.

In [20]:
# Split movie genres into individual words
movies['genres'] = movies['genres'].str.split('|')

In [21]:
# Merge movies with tags (joining on movieId)
movies_with_tags = pd.merge(movies, tags, on='movieId', how='left')

In [22]:
# Replace NaN tags with empty strings
movies_with_tags['tag'].fillna('', inplace=True)

In [23]:
# Combine genres and tags into a single metadata column
movies_with_tags['metadata'] = movies_with_tags['genres'].apply(lambda x: ' '.join(x)) + ' ' + movies_with_tags['tag']

In [24]:
# Vectorize the metadata using TF-IDF (Term Frequency-Inverse Document Frequency)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_with_tags['metadata'])

In [25]:
# Compute cosine similarity between movies based on their metadata
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

+ We split genres into separate terms and join the tags.

+ TF-IDF helps convert the text metadata (genres + tags) into numerical vectors.

+ Cosine similarity is computed to measure how similar movies are based on their content.

## Step 7: Generate Recommendations Based on Content Similarity

Given a movie title, we use its cosine similarity score to recommend movies that have similar content (similar genres or tags).

In [26]:
# Define a function to recommend movies based on cosine similarity
def recommend_movies_based_on_content(movie_title):
    # Find the index of the given movie in the dataset
    idx = movies_with_tags[movies_with_tags['title'] == movie_title].index[0]

In [27]:
# Get similarity scores for the movie, sorted by highest score
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)


NameError: name 'idx' is not defined

In [28]:
# Get the top 5 most similar movies
sim_scores = sim_scores[1:6]  # Exclude the movie itself
movie_indices = [i[0] for i in sim_scores]

return movies_with_tags['title'].iloc[movie_indices]

NameError: name 'sim_scores' is not defined

In [29]:
# Example: Recommend movies similar to 'Toy Story (1995)'
similar_movies = recommend_movies_based_on_content('Toy Story (1995)')
print(f"Movies similar to 'Toy Story (1995)':\n{similar_movies}")

Movies similar to 'Toy Story (1995)':
None


+ We compute the similarity score of the given movie against all other movies.

+ The highest-rated similar movies are returned as recommendations.


## Step 8: Evaluation (RMSE Calculation)

We evaluate the performance of our collaborative filtering model using Root Mean Squared Error (RMSE), a common metric to measure prediction accuracy.

In [30]:
# Flatten actual and predicted values to calculate RMSE
actual = np.array(user_movie_matrix.iloc[user_id, :]).flatten()
predicted = np.array(mean_ratings).flatten()


In [31]:
# Filter out movies that the user has actually rated (ignoring zero ratings)
mask = actual > 0
rmse = np.sqrt(mean_squared_error(actual[mask], predicted[mask]))


In [32]:
# Print the RMSE score
print(f"RMSE for collaborative filtering: {rmse}")

RMSE for collaborative filtering: 3.1777174089327835


+ mean_squared_error calculates the squared difference between actual and predicted ratings.

+ RMSE is the square root of this error, providing a measure of accuracy.
