
# Anime Recommender - Collaborative filtering

This project aims to build an anime recommender using collaborative filtering, 
a method that predicts user preferences by anticipating what someone with similar tastes would also enjoy. 

Collaborative filtering comes in two forms:

- **User-based:** Recommends items by finding similar users and suggesting items that they have liked or interacted with.
- **Item-based:** Recommends items by finding similar items to those that the user has shown interest in. Note that this is different from content-based filtering as the 'similarity' is based on its relationship with users, not the content.

By leveraging user interactions and item similarities, this recommender provides personalized anime recommendations based on user preferences and behaviors.


## Import required libraries

In [106]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, accuracy
from surprise.model_selection import train_test_split as surprise_train_test_split
from sklearn.model_selection import train_test_split as train_test_split

## Import cleaned dataset

In [107]:

anime_reviews = pd.read_csv("datasets/anime_review_clean.csv")

anime_data = pd.read_csv("datasets/anime_2020_clean.csv")
anime_uid_list = anime_data.uid.unique()

# Improve the accuracy of review by taking
counts = anime_reviews['profile'].value_counts()
anime_reviews_improved = anime_reviews[anime_reviews['profile'].isin(counts[counts >= 5].index)]
print("anime_reviews")
print("Size before: ", anime_reviews.shape)
print("Size after: ", anime_reviews_improved.shape)

anime_reviews_uids = anime_reviews_improved.anime_uid.unique()
anime_data_improved = anime_data[anime_data['uid'].isin(anime_reviews_uids)]
print("anime_data")
print("Size before: ", anime_data.shape)
print("Size after: ", anime_data_improved.shape)


anime_reviews
Size before:  (129988, 4)
Size after:  (66425, 4)
anime_data
Size before:  (8094, 9)
Size after:  (7002, 9)



# Preparation: Merging the dataset

Before fitting the model, we would first create the user-item matrix as both user-based and item-based uses the same matrix

In [17]:

# Merge data
merged_data = pd.merge(anime_data.drop('score', axis=1), anime_reviews, left_on='uid', right_on='anime_uid')
merged_data_improved = pd.merge(anime_data_improved.drop('score', axis=1), anime_reviews_improved, left_on='uid', right_on='anime_uid')

merged_data_improved.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,profile,anime_uid,score,scores
0,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,RangFlash,1,10,"{'Overall': '10', 'Story': '8', 'Animation': '..."
1,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,reinis-jan,1,9,"{'Overall': '9', 'Story': '7', 'Animation': '9..."
2,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,Sephiroth1335,1,8,"{'Overall': '8', 'Story': '8', 'Animation': '8..."
3,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,iHitokage,1,10,"{'Overall': '10', 'Story': '10', 'Animation': ..."
4,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,GrimmChicken,1,9,"{'Overall': '9', 'Story': '9', 'Animation': '9..."



# Attempt 1: User-based collaborative filtering

User-based allows us to pass in an existing user in the dataset, gauge which users are similar, then recommend animes based on that.  

We are using KNN(k-nearest-neighbour) as the algorithm.

### Create and fit model

In [18]:
# Define rating scale
reader = Reader(rating_scale=(1, 10))

# Load data into Surprise dataset format
data = Dataset.load_from_df(merged_data_improved[['profile', 'uid', 'score']], reader)

# Split data into train and test sets
trainset, testset = surprise_train_test_split(data, test_size=0.2, random_state=42)

# Build user-based collaborative filtering model
sim_options = {'name': 'cosine', 'user_based': True}
model_1 = KNNBasic(sim_options=sim_options)
model_1.fit(trainset)



Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x1546a6b90>

### Evaluating the model

In [79]:
# Make predictions
predictions = model_1.test(testset)

# Calculate RMSE and MAE
accuracy.rmse(predictions)
accuracy.mae(predictions)

RMSE: 2.1341
MAE:  1.6463


1.6462646278669615

### Testing the recommendation system
We will attempt to generate recommendations for a specific user

In [93]:
# Lets test to generate recommendations for a specific user
user_profile = 'skrn'

# Find animes user has watched
watched_anime_ids = merged_data[merged_data['profile'] == user_profile]['uid'].values

# Then find animes not watched
not_watched_anime_ids = [uid for uid in anime_uid_list if uid not in watched_anime_ids]

# Predict ratings for items not rated by the user
predicted_ratings = {}
for anime_id in not_watched_anime_ids:
    predicted_rating = model_1.predict(user_profile, anime_id).est
    predicted_ratings[anime_id] = predicted_rating

# Recommend top 10 unwatched animes
top_n = 10
recommended_anime_ids = sorted(predicted_ratings, key=predicted_ratings.get, reverse=True)[:top_n]


# Lets see some of the animes user likes
watched_anime_ratings = merged_data[(merged_data['profile'] == user_profile) & (merged_data['score'] >= threshold_score)]
watched_anime_ratings = watched_anime_ratings.sort_values(by='score', ascending=False).head(10)

watched_anime_titles_ids = anime_data[anime_data['uid'].isin(watched_anime_ratings['uid'])]

print("Some of the animes user watched with their titles and IDs:\n")
for index, row in watched_anime_ratings.iterrows():
    anime_info = watched_anime_titles_ids[watched_anime_titles_ids['uid'] == row['uid']]
    if not anime_info.empty:
        title = anime_info.iloc[0]['title']
        uid = anime_info.iloc[0]['uid']
        print("Title:", title, "| ID:", uid, "| Rating:", row['score'])


recommended_anime_titles = anime_data[anime_data['uid'].isin(recommended_anime_ids)]['title'].values
print("\n\nRecommended Anime Titles:\n")
for title in recommended_anime_titles:
    print(title)

Some of the animes user watched with their titles and IDs:

Title: Tengen Toppa Gurren Lagann Movie 2: Lagann-hen | ID: 4565 | Rating: 8
Title: Haikyuu!!: Karasuno Koukou vs. Shiratorizawa Gakuen Koukou | ID: 32935 | Rating: 8
Title: Death Note | ID: 1535 | Rating: 7
Title: Code Geass: Hangyaku no Lelouch | ID: 1575 | Rating: 7
Title: Tengen Toppa Gurren Lagann Movie 1: Gurren-hen | ID: 4107 | Rating: 7
Title: Haikyuu!! | ID: 20583 | Rating: 7
Title: Haikyuu!! Second Season | ID: 28891 | Rating: 7
Title: One Punch Man | ID: 30276 | Rating: 7


Recommended Anime Titles:

eX-Driver
The Big O
Itsudatte My Santa!
Kazemakase Tsukikage Ran
Mahoromatic Summer Special
MÄR
Dragon Ball Z Movie 13: Ryuuken Bakuhatsu!! Gokuu ga Yaraneba Dare ga Yaru
Dragon Ball Z Special 1: Tatta Hitori no Saishuu Kessen
Bishoujo Senshi Sailor Moon: Sailor Stars
Chicchana Yukitsukai Sugar Specials



# Attempt 2: Item-based collaborative filtering

Item-based allows us to input an anime, and get 'similar' animes.  
We are using KNN(k-nearest-neighbour) as the algorithm as well.

### Preparation 
1. We split the data based on users since it is item-based, and we do not want to exclude items
2. Creating a user-item matrix for training the model

In [95]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets based on users
train_users, test_users = train_test_split(merged_data_improved['profile'].unique(), test_size=0.1, random_state=42)

# Filter the data based on the split users
train_data = merged_data_improved[merged_data_improved['profile'].isin(train_users)]
test_data = merged_data_improved[merged_data_improved['profile'].isin(test_users)]

# Create user-item matrix for training data
train_anime_pivot = train_data.pivot_table(index='uid', columns='profile', values='score').fillna(0)
train_anime_matrix = csr_matrix(train_anime_pivot.values)

print("User-Item Matrix")
print("Shape", train_anime_pivot.shape)
train_anime_pivot.head()

User-Item Matrix
Shape (6776, 4669)


profile,--Sunclaudius,-Alians-,-Animewatcher-,-Elina-,-Ereya-,-FlameHaze-,-Ghosxuto-,-Haoto-,-HippySnob-,-Lupa-,...,zerrubabbel,zeru02,zeruon,ziggyopolous,zillion29,zimmercj,znyggisen,zoddtheimmortal,zombie_pegasus,zperson5
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Create and fit model

In [105]:

# Fit the Nearest Neighbors model using cosine similarity on the training data
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(train_anime_matrix)

# Function to find similar items for a given item
def find_similar_animes(anime_id, k=5):
    try:
        query_index = train_anime_pivot.index.get_loc(anime_id)
    except KeyError:
        print(f"Anime ID {anime_id} not found in the index.")
        return []

    distances, indices = model_knn.kneighbors(train_anime_matrix[query_index], n_neighbors=k+1)
    similar_animes = []
    for i in range(1, len(distances.flatten())):
        similar_anime_id = train_anime_pivot.index[indices.flatten()[i]]
        similar_animes.append((similar_anime_id, distances.flatten()[i]))
    return similar_animes


# Example usage: Find 5 similar items for a given item
sample_anime_id=37999

similar_animes = find_similar_animes(anime_id=sample_anime_id, k=5)

print("Inputed Anime:")
print(anime_data[anime_data['uid'] == sample_anime_id]['title'].values[0])

print("\nSimilar Animes:")
for anime_id, distance in similar_animes:
    anime_title = anime_data[anime_data['uid'] == anime_id]['title'].values[0]
    print(f"Anime ID: {anime_id}, Title: {anime_title}, Distance: {distance}")



Inputed Anime:
Kaguya-sama wa Kokurasetai: Tensai-tachi no Renai Zunousen

Similar Animes:
Anime ID: 37779, Title: Yakusoku no Neverland, Distance: 0.5904089801375199
Anime ID: 37450, Title: Seishun Buta Yarou wa Bunny Girl Senpai no Yume wo Minai, Distance: 0.6903990092752446
Anime ID: 38814, Title: Nobunaga-sensei no Osanazuma, Distance: 0.7184959349593496
Anime ID: 37349, Title: Goblin Slayer, Distance: 0.7209471418737278
Anime ID: 38680, Title: Fruits Basket 1st Season, Distance: 0.7350672899748115


### Evaluating the model - Hit rate at K
Rate of relevant recommendations at K.

In [108]:

from collections import Counter

# Step 1: Get Users in the Test Set
test_users = test_data['profile'].unique()

# Step 2: Identify liked anime for Each User
threshold_score = 7  # Since median score observed from EDA is 6.82, we would use 7 as treshold.
liked_anime_dict = {}

# Set value of K
k = 20;

for user in test_users:
    
    user_ratings = test_data[(test_data['profile'] == user) & (test_data['score'] >= threshold_score)] 
    user_ratings_unique = user_ratings.drop_duplicates(subset=['uid'])
    
     # Filter user_ratings_unique to retain only anime IDs present in train_anime_pivot.index
    user_ratings_filtered = user_ratings_unique[user_ratings_unique['uid'].isin(train_anime_pivot.index)]
    
    if len(user_ratings_filtered) >= k: # If user has at least k favorite animes in train set, add him to dict
        liked_anime_dict[user] = list(zip(user_ratings_filtered['uid'].values, user_ratings_filtered['score'].values))
        
# Step 3: Use Model to Get Similar Anime and Evaluate
n_closest_anime = 8  # Number of closest anime to retrieve
total_users = len(test_users)
total_hit_rate = 0


for user, liked_anime_list in liked_anime_dict.items():
    
    # Get top k animes
    liked_animes_sorted = sorted(liked_anime_list, key=lambda x: x[1], reverse=True)
    top_k_animes = liked_animes_sorted[:k]
    
    similar_anime_counter = Counter()

    for anime_id, score in top_k_animes:
        similar_animes = find_similar_animes(anime_id, k=n_closest_anime)
        # Update the counter with similar anime IDs
        for similar_anime_id, _ in similar_animes:
            similar_anime_counter[similar_anime_id] += 1

    # Select the top k most frequent similar animes
    top_k_similar_animes = similar_anime_counter.most_common(k)
    top_k_similar_ids = [anime_id for anime_id, _ in top_k_similar_animes]
    
    # Check how many of the users liked animes were hit
    correct_predictions = len(set(anime_id for anime_id, score in liked_anime_list) & set(top_k_similar_ids))
    
    hit_rate_for_user = correct_predictions/k
    total_hit_rate += hit_rate_for_user

# Step 4: Calculate Performance Metric
avg_hit_rate = total_hit_rate / (total_users)

print("Average Hit Rate at K:", avg_hit_rate)


Average Hit Rate at K: 0.012524084778420036


## Testing by inputing your liked animes
Here is a sample  
tokyo ghoul, Ansatsu_Kyoushitsu, Kono_Subarashii_Sekai_ni_Shukufuku_wo, Kaguya-sama, Nichijou  

liked_anime_ids = [22319, 24833, 30831, 37999, 10165]

#### First lets train the model with all the data instead of just train data

In [76]:

# Create user-item matrix for training data
merged_pivot = merged_data_improved.pivot_table(index='uid', columns='profile', values='score').fillna(0)

merged_matrix = csr_matrix(merged_pivot.values)

# Fit the Nearest Neighbors model using cosine similarity on the training data
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(train_anime_matrix)

In [104]:
liked_anime_ids = [22319, 24833, 30831, 37999, 10165]

similar_anime_counter = Counter()

for anime_id in liked_anime_ids:
    similar_animes = find_similar_animes(anime_id, k=n_closest_anime)
    # Update the counter with similar anime IDs
    for similar_anime_id, _ in similar_animes:
        similar_anime_counter[similar_anime_id] += 1

# Select the top 10 most frequent similar animes
top_10_similar_animes = similar_anime_counter.most_common(10)
top_10_similar_ids = [anime_id for anime_id, _ in top_10_similar_animes]


print("Liked Animes:")
for anime_id in liked_anime_ids:
    anime_info = anime_data[anime_data['uid'] == anime_id]
    if not anime_info.empty:
        title = anime_info.iloc[0]['title']
        print("Title:", title, "| ID:", anime_id)

        
print("\nRecommended")
for anime_id in top_10_similar_ids:
    anime_title = anime_data[anime_data['uid'] == anime_id]['title'].values[0]
    print(f"Anime ID: {anime_id}, Title: {anime_title}")



Liked Animes:
Title: Tokyo Ghoul | ID: 22319
Title: Ansatsu Kyoushitsu | ID: 24833
Title: Kono Subarashii Sekai ni Shukufuku wo! | ID: 30831
Title: Kaguya-sama wa Kokurasetai: Tensai-tachi no Renai Zunousen | ID: 37999
Title: Nichijou | ID: 10165

Recommended
Anime ID: 27899, Title: Tokyo Ghoul √A
Anime ID: 22199, Title: Akame ga Kill!
Anime ID: 23281, Title: Psycho-Pass 2
Anime ID: 21881, Title: Sword Art Online II
Anime ID: 11111, Title: Another
Anime ID: 1535, Title: Death Note
Anime ID: 28223, Title: Death Parade
Anime ID: 11757, Title: Sword Art Online
Anime ID: 30654, Title: Ansatsu Kyoushitsu 2nd Season
Anime ID: 25517, Title: Magic Kaito 1412
