### Introduction: The Collaborative Filter leverages user-item interactions to generate personalized anime recommendations. This approach is particularly effective for users who have rated a substantial number of animes, specifically those with 50 or more ratings. The system employs the Singular Value Decomposition (SVD) algorithm from the Surprise library, a popular choice for collaborative filtering tasks.
### Data Preprocessing: The script begins by loading two datasets: a complete ratings file and an anime information file. To ensure data quality and improve model performance, we apply filtering techniques. Users with fewer than 50 ratings and animes with fewer than 100 ratings are removed from the dataset. This step helps to mitigate the cold start problem and focuses the model on users and items with sufficient interaction data.
### Model Configuration: The SVD algorithm is initialized with specific hyperparameters, including a random state for reproducibility, 25 latent factors, and 10 training epochs. These settings were likely determined through experimentation to balance model performance and computational efficiency.
### Training and Evaluation: The model is trained on a subset of the data (1% sample) for initial testing and hyperparameter tuning. Cross-validation is performed using 5-fold CV, with Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) as evaluation metrics. This process helps in assessing the model's predictive accuracy and generalization capability.
### Recommendation Generation: To generate recommendations for a specific user, the system first identifies all anime titles the user hasn't rated. It then predicts the user's potential ratings for these unseen animes using the trained SVD model. The predictions are sorted in descending order of estimated ratings, and the top 10 recommendations are selected.
### Model Persistence: The trained model is saved to disk, allowing for quick loading and recommendation generation in future sessions without the need for retraining. This feature is particularly useful for deploying the recommendation system in a production environment.
### Scalability: While initial experiments were conducted on a 1% sample of the data, the script demonstrates the capability to train on the entire dataset. This full-scale training leverages the efficiency of the SVD algorithm to handle larger datasets, potentially improving recommendation quality by capturing more user-item interactions.
### Output: The system provides a list of top 10 recommended anime titles for a given user, based on the collaborative filtering approach. These recommendations are personalized, taking into account the user's rating history and the rating patterns of similar users in the dataset.

In [1]:
import pandas as pd

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from surprise.accuracy import rmse


In [2]:
df = pd.read_csv('data/rating_complete.csv')
anime_df = pd.read_csv('data/anime.csv')

Remove the users with low number of movies rated

Remove the animes with low number of ratings

In [3]:
# Set thresholds for minimum number of ratings per user and per movie
min_user_ratings = 50
min_movie_ratings = 100

# Filter users with fewer than min_user_ratings ratings
user_counts = df['user_id'].value_counts()
valid_users = user_counts[user_counts >= min_user_ratings].index
dropped_users = user_counts[user_counts < min_user_ratings].index.tolist()

# Filter movies with fewer than min_movie_ratings ratings
movie_counts = df['anime_id'].value_counts()
valid_movies = movie_counts[movie_counts >= min_movie_ratings].index
dropped_movies = movie_counts[movie_counts < min_movie_ratings].index.tolist()

# Filter the dataframe based on valid users and movies
df = df[df['user_id'].isin(valid_users) & df['anime_id'].isin(valid_movies)]

#print("Dropped Users:", dropped_users)
#print("Dropped Movies:", dropped_movies)


In [4]:
print(df['rating'].unique())

df.head()

[ 9 10  6  8  7  5  3  1  2  4]


Unnamed: 0,user_id,anime_id,rating
35,1,22535,9
36,1,32281,10
37,1,38000,9
38,1,18679,6
39,1,37497,8


In [5]:
anime_df.head(3)

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0


In [6]:
filtered_df = df.sample( frac=0.01, random_state=42)

In [7]:
%%time

# Define the Reader object
reader = Reader(rating_scale=(1, 10))

# Load the data into the surprise Dataset
data = Dataset.load_from_df(filtered_df[['user_id', 'anime_id', 'rating']], reader)

# Define the SVD algorithm
algo = SVD()

# Run 5-fold cross-validation
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Train the algorithm on the entire dataset
trainset = data.build_full_trainset()
algo.fit(trainset)

# Create a mapping dictionary from anime_id to anime_name
anime_id_to_name = dict(zip(anime_df['MAL_ID'], anime_df['Name']))

# Example usage: recommend anime for a specific user
user_id = 255419  # Specify the user for whom you want recommendations
# Get a list of all anime ids
all_anime_ids = df['anime_id'].unique()
# Filter out the anime ids the user has already rated
anime_ids_to_predict = [anime_id for anime_id in all_anime_ids if not trainset.knows_user(user_id) or not trainset.ur[trainset.to_inner_uid(user_id)]]
# Predict ratings for the filtered anime ids
predictions = [algo.predict(user_id, anime_id) for anime_id in anime_ids_to_predict]
# Sort the predictions by estimated rating in descending order
sorted_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
# Get the top N recommendations (e.g., top 10)
top_n_recommendations = sorted_predictions[:10]
# Extract anime names from recommendations using the mapping dictionary
recommended_anime_names = [anime_id_to_name[pred.iid] for pred in top_n_recommendations]

print("Top 10 recommended anime for user", user_id, ":", recommended_anime_names)



Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.4374  1.4442  1.4416  1.4326  1.4361  1.4384  0.0041  
MAE (testset)     1.0926  1.0987  1.0955  1.0919  1.0950  1.0947  0.0024  
Fit time          9.21    9.54    10.56   11.13   13.90   10.87   1.67    
Test time         1.62    1.70    2.24    1.90    2.30    1.95    0.27    
Top 10 recommended anime for user 255419 : ['Ookami Kodomo no Ame to Yuki', 'Made in Abyss', '3-gatsu no Lion 2nd Season', 'Mushishi', 'Ginga Eiyuu Densetsu', 'Fullmetal Alchemist: Brotherhood', 'Steins;Gate', "Gintama'", 'Mushishi Zoku Shou 2nd Season', 'Hajime no Ippo']
CPU times: total: 34.6 s
Wall time: 1min 33s


1. use the hyperparameters.
2. try to do it in a VM

In [8]:
%%time

# Define the Reader object
reader = Reader(rating_scale=(1, 10))

# Load the data into the surprise Dataset
data = Dataset.load_from_df(filtered_df[['user_id', 'anime_id', 'rating']], reader)

# Define the SVD algorithm
algo = SVD(random_state=42, n_factors=50, n_epochs=10)

# Run 5-fold cross-validation
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Train the algorithm on the entire dataset
trainset = data.build_full_trainset()
algo.fit(trainset)

# Create a mapping dictionary from anime_id to anime_name
anime_id_to_name = dict(zip(anime_df['MAL_ID'], anime_df['Name']))

# Example usage: recommend anime for a specific user
user_id = 255419  # Specify the user for whom you want recommendations
# Get a list of all anime ids
all_anime_ids = df['anime_id'].unique()
# Filter out the anime ids the user has already rated
anime_ids_to_predict = [anime_id for anime_id in all_anime_ids if not trainset.knows_user(user_id) or not trainset.ur[trainset.to_inner_uid(user_id)]]
# Predict ratings for the filtered anime ids
predictions = [algo.predict(user_id, anime_id) for anime_id in anime_ids_to_predict]
# Sort the predictions by estimated rating in descending order
sorted_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
# Get the top N recommendations (e.g., top 10)
top_n_recommendations = sorted_predictions[:10]
# Extract anime names from recommendations using the mapping dictionary
recommended_anime_names = [anime_id_to_name[pred.iid] for pred in top_n_recommendations]

print("Top 10 recommended anime for user", user_id, ":", recommended_anime_names)


Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.4556  1.4471  1.4583  1.4565  1.4539  1.4543  0.0038  
MAE (testset)     1.1058  1.1007  1.1085  1.1065  1.1045  1.1052  0.0026  
Fit time          4.91    4.13    3.71    5.91    6.74    5.08    1.12    
Test time         1.71    1.45    1.48    2.63    1.97    1.85    0.43    
Top 10 recommended anime for user 255419 : ['3-gatsu no Lion 2nd Season', 'Ginga Eiyuu Densetsu', 'Hunter x Hunter (2011)', 'Gintama°', 'Steins;Gate', 'Gintama', "Gintama'", 'Owarimonogatari 2nd Season', 'Shingeki no Kyojin Season 3 Part 2', 'Fullmetal Alchemist: Brotherhood']
CPU times: total: 24 s
Wall time: 1min 1s


In [9]:
%%time

# Define the Reader object
reader = Reader(rating_scale=(1, 10))

# Load the data into the surprise Dataset
data = Dataset.load_from_df(filtered_df[['user_id', 'anime_id', 'rating']], reader)

# Define the SVD algorithm
algo = SVD(random_state=42, n_factors=25, n_epochs=10)

# Run 5-fold cross-validation
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Train the algorithm on the entire dataset
trainset = data.build_full_trainset()
algo.fit(trainset)

# Create a mapping dictionary from anime_id to anime_name
anime_id_to_name = dict(zip(anime_df['MAL_ID'], anime_df['Name']))

# Example usage: recommend anime for a specific user
user_id = 255419  # Specify the user for whom you want recommendations
# Get a list of all anime ids
all_anime_ids = df['anime_id'].unique()
# Filter out the anime ids the user has already rated
anime_ids_to_predict = [anime_id for anime_id in all_anime_ids if not trainset.knows_user(user_id) or not trainset.ur[trainset.to_inner_uid(user_id)]]
# Predict ratings for the filtered anime ids
predictions = [algo.predict(user_id, anime_id) for anime_id in anime_ids_to_predict]
# Sort the predictions by estimated rating in descending order
sorted_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
# Get the top N recommendations (e.g., top 10)
top_n_recommendations = sorted_predictions[:10]
# Extract anime names from recommendations using the mapping dictionary
recommended_anime_names = [anime_id_to_name[pred.iid] for pred in top_n_recommendations]

print("Top 10 recommended anime for user", user_id, ":", recommended_anime_names)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.4503  1.4562  1.4501  1.4552  1.4468  1.4517  0.0035  
MAE (testset)     1.1008  1.1047  1.1019  1.1069  1.0997  1.1028  0.0026  
Fit time          5.82    5.50    5.48    5.40    5.60    5.56    0.14    
Test time         2.84    2.00    1.96    2.50    2.04    2.27    0.35    
Top 10 recommended anime for user 255419 : ['Steins;Gate', "Gintama'", 'Gintama', 'Gintama°', 'Hunter x Hunter (2011)', '3-gatsu no Lion 2nd Season', 'Shingeki no Kyojin Season 3 Part 2', 'Ginga Eiyuu Densetsu', 'Fullmetal Alchemist: Brotherhood', 'Koe no Katachi']
CPU times: total: 28.8 s
Wall time: 1min 8s


In [10]:
%%time 

# Train the algorithm on the entire dataset
trainset = data.build_full_trainset()
algo.fit(trainset)

CPU times: total: 3.84 s
Wall time: 9.87 s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2010120b400>

In [14]:
%%time

# Example usage: recommend anime for a specific user
user_id = 255419  # Specify the user for whom you want recommendations
# Get a list of all anime ids
all_anime_ids = df['anime_id'].unique()
# Filter out the anime ids the user has already rated
anime_ids_to_predict = [anime_id for anime_id in all_anime_ids if not trainset.knows_user(user_id) or not trainset.ur[trainset.to_inner_uid(user_id)]]
# Predict ratings for the filtered anime ids
predictions = [algo.predict(user_id, anime_id) for anime_id in anime_ids_to_predict]
# Sort the predictions by estimated rating in descending order
sorted_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
# Get the top N recommendations (e.g., top 10)
top_n_recommendations = sorted_predictions[:10]
# Extract anime names from recommendations using the mapping dictionary
recommended_anime_names = [anime_id_to_name[pred.iid] for pred in top_n_recommendations]

print("Top 10 recommended anime for user", user_id, ":", recommended_anime_names)

Top 10 recommended anime for user 255419 : ['Steins;Gate', "Gintama'", 'Gintama', 'Gintama°', 'Hunter x Hunter (2011)', '3-gatsu no Lion 2nd Season', 'Shingeki no Kyojin Season 3 Part 2', 'Ginga Eiyuu Densetsu', 'Fullmetal Alchemist: Brotherhood', 'Koe no Katachi']
CPU times: total: 422 ms
Wall time: 1.23 s


In [11]:
from surprise.dump import dump

# Define the file path where you want to save the model
file_path = "trained_svd_model.pkl"

# Save the trained model
dump(file_path, algo=algo)

print("Model saved successfully to", file_path)

Model saved successfully to trained_svd_model.pkl


In [13]:
from surprise.dump import load

# Define the file path where the model is saved
file_path = "trained_svd_model.pkl"

# Load the model
algo = load(file_path)[1]


# Define the Reader object
reader = Reader(rating_scale=(1, 10))

# Load the data into the surprise Dataset
data = Dataset.load_from_df(filtered_df[['user_id', 'anime_id', 'rating']], reader)
trainset = data.build_full_trainset()


# Create a mapping dictionary from anime_id to anime_name
anime_id_to_name = dict(zip(anime_df['MAL_ID'], anime_df['Name']))

print("Model loaded successfully from", file_path)


Model loaded successfully from trained_svd_model.pkl


## Let's try the entire sample

In [7]:
%%time

# Define the Reader object
reader = Reader(rating_scale=(1, 10))

# Load the data into the surprise Dataset
data = Dataset.load_from_df(df[['user_id', 'anime_id', 'rating']], reader)

# Define the SVD algorithm
algo = SVD(random_state=42, n_factors=25, n_epochs=10)

# Train the algorithm on the entire dataset
trainset = data.build_full_trainset()
algo.fit(trainset)


CPU times: total: 6min 36s
Wall time: 23min 21s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x256814c6e80>

In [11]:
%%time

# Create a mapping dictionary from anime_id to anime_name
anime_id_to_name = dict(zip(anime_df['MAL_ID'], anime_df['Name']))

# Example usage: recommend anime for a specific user
user_id = 255419  # Specify the user for whom you want recommendations
# Get a list of all anime ids
all_anime_ids = df['anime_id'].unique()
# Filter out the anime ids the user has already rated
anime_ids_to_predict = [anime_id for anime_id in all_anime_ids if not trainset.knows_user(user_id) or not trainset.ur[trainset.to_inner_uid(user_id)]]
# Predict ratings for the filtered anime ids
predictions = [algo.predict(user_id, anime_id) for anime_id in anime_ids_to_predict]
# Sort the predictions by estimated rating in descending order
sorted_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
# Get the top N recommendations (e.g., top 10)
top_n_recommendations = sorted_predictions[:10]
# Extract anime names from recommendations using the mapping dictionary
recommended_anime_names = [anime_id_to_name[pred.iid] for pred in top_n_recommendations]

print("Top 10 recommended anime for user", user_id, ":", recommended_anime_names)


Top 10 recommended anime for user 255419 : ['Monster', 'Neon Genesis Evangelion: The End of Evangelion', '3-gatsu no Lion 2nd Season', 'Aria the Origination', 'Ginga Eiyuu Densetsu', 'Gintama: The Final', 'Hunter x Hunter (2011)', 'Ashita no Joe 2', 'Fullmetal Alchemist: Brotherhood', 'Neon Genesis Evangelion']
CPU times: total: 703 ms
Wall time: 4.08 s
