
# Anime Item-Based Collaborative Filtering Recommender System

This notebook builds a simple **item‑based collaborative filtering (CF) model** for recommending anime titles.

The goal is to simulate how a recommender system could be built when you don't have access to user identifiers during cold‑start (e.g., during onboarding). Instead of finding similar users, the model recommends new titles based on the similarities between the items themselves (anime series).

In a real production scenario you would start from large tables like `animelist.csv`, `rating_complete.csv` and `anime.csv`. Those files contain tens of millions of user interactions and rich metadata. For illustration purposes this notebook creates a **tiny synthetic dataset** that follows the same structure but is small enough to run quickly.  You can later replace the synthetic data with the full datasets once they are available.


In [None]:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)



## 1. Prepare the data

We create two small tables:

1. `anime_metadata` – contains information about each anime, analogous to `anime.csv`.  In the real task this would include fields like MAL_ID, Name, Score, Type, Source and Synopsis.  Here we provide a few example titles.
2. `ratings` – contains user interactions (user_id, anime_id, score).  This mirrors the structure of `animelist.csv` and `rating_complete.csv`, but with a handful of users instead of millions.

Feel free to expand these tables or load the real data if you have the files available.  The workflow remains the same.


In [None]:

# Sample anime metadata (analogous to anime.csv)
anime_metadata = pd.DataFrame([
    {"MAL_ID": 20, "Name": "Naruto", "Score": 7.9, "Type": "TV", "Source": "Manga", "Synopsis": "A young ninja strives to become the leader of his village."},
    {"MAL_ID": 21, "Name": "One Piece", "Score": 8.5, "Type": "TV", "Source": "Manga", "Synopsis": "A boy with rubber powers searches for the ultimate treasure."},
    {"MAL_ID": 3,  "Name": "Death Note", "Score": 8.6, "Type": "TV", "Source": "Manga", "Synopsis": "A notebook allows its owner to kill anyone simply by writing their name."},
    {"MAL_ID": 4,  "Name": "Attack on Titan", "Score": 9.1, "Type": "TV", "Source": "Manga", "Synopsis": "Humans fight for survival inside walled cities against giant humanoids."},
    {"MAL_ID": 2,  "Name": "Fullmetal Alchemist: Brotherhood", "Score": 9.2, "Type": "TV", "Source": "Manga", "Synopsis": "Two brothers search for the Philosopher's Stone after a failed alchemy experiment."},
    {"MAL_ID": 5,  "Name": "Steins;Gate", "Score": 9.1, "Type": "TV", "Source": "Visual Novel", "Synopsis": "A group of friends discover time travel and face unforeseen consequences."},
    {"MAL_ID": 6,  "Name": "My Hero Academia", "Score": 7.9, "Type": "TV", "Source": "Manga", "Synopsis": "In a world where super powers are the norm, a powerless boy dreams of becoming a hero."},
    {"MAL_ID": 7,  "Name": "Cowboy Bebop", "Score": 8.8, "Type": "TV", "Source": "Original", "Synopsis": "A ragtag crew of bounty hunters travel the galaxy in search of criminals."},
    {"MAL_ID": 8,  "Name": "Demon Slayer", "Score": 8.7, "Type": "TV", "Source": "Manga", "Synopsis": "A boy becomes a demon slayer after his family is slaughtered."},
    {"MAL_ID": 22, "Name": "Hunter x Hunter", "Score": 9.0, "Type": "TV", "Source": "Manga", "Synopsis": "A young boy embarks on a journey to find his father by becoming a Hunter."},
])
anime_metadata


In [None]:

# Sample user‑item interactions (analogous to rating_complete.csv)
ratings = pd.DataFrame([
    {"user_id": 1, "anime_id": 20, "score": 9},  # Naruto
    {"user_id": 1, "anime_id": 21, "score": 8},  # One Piece
    {"user_id": 1, "anime_id": 22, "score": 10}, # Hunter x Hunter
    {"user_id": 2, "anime_id": 3,  "score": 7},  # Death Note
    {"user_id": 2, "anime_id": 5,  "score": 9},  # Steins;Gate
    {"user_id": 2, "anime_id": 4,  "score": 8},  # Attack on Titan
    {"user_id": 3, "anime_id": 2,  "score": 10}, # Fullmetal Alchemist: Brotherhood
    {"user_id": 3, "anime_id": 22, "score": 9},  # Hunter x Hunter
    {"user_id": 3, "anime_id": 6,  "score": 7},  # My Hero Academia
    {"user_id": 4, "anime_id": 7,  "score": 8},  # Cowboy Bebop
    {"user_id": 4, "anime_id": 8,  "score": 9},  # Demon Slayer
    {"user_id": 4, "anime_id": 20, "score": 6},  # Naruto
    {"user_id": 5, "anime_id": 4,  "score": 7},  # Attack on Titan
    {"user_id": 5, "anime_id": 20, "score": 9},  # Naruto
    {"user_id": 5, "anime_id": 2,  "score": 8},  # Fullmetal Alchemist: Brotherhood
])
ratings



## 2. Create the user–item matrix

An item‑based CF model operates on an **item–item similarity matrix**.  To build it we first pivot the ratings into a table where each row corresponds to a user and each column corresponds to an anime.  Missing ratings are filled with zeros.


In [None]:

# Create user–item matrix (users as rows, anime as columns)
user_item_matrix = ratings.pivot_table(index='user_id', columns='anime_id', values='score', fill_value=0)

# For demonstration, show the matrix with anime names rather than IDs
matrix_named = user_item_matrix.copy()
matrix_named.columns = [anime_metadata.set_index('MAL_ID').loc[col, 'Name'] for col in matrix_named.columns]
matrix_named



## 3. Compute the item–item similarity matrix

We use cosine similarity to measure how similar two anime series are based on their user rating vectors.  Cosine similarity ranges from 0 (no similarity) to 1 (identical).  Before computing similarities we normalize the rating vectors to remove popularity bias.

Since our dataset is very small, we can compute the full similarity matrix directly with `sklearn.metrics.pairwise.cosine_similarity`.  For large datasets you would typically use sparse matrices and approximate nearest neighbors to scale to millions of interactions.


In [None]:

# Normalize item vectors by subtracting the mean rating per item
item_mean = user_item_matrix.mean(axis=0)
normalized_matrix = user_item_matrix - item_mean
normalized_matrix = normalized_matrix.fillna(0)

# Compute cosine similarity between items (transpose so items are rows)
sim_matrix = pd.DataFrame(
    cosine_similarity(normalized_matrix.T),
    index=user_item_matrix.columns,
    columns=user_item_matrix.columns
)

# For readability, show a few entries (mapping IDs to names)
sim_matrix_named = sim_matrix.copy()
sim_matrix_named.index = [anime_metadata.set_index('MAL_ID').loc[idx, 'Name'] for idx in sim_matrix_named.index]
sim_matrix_named.columns = sim_matrix_named.index
sim_matrix_named.round(2)



## 4. Build the recommendation function

The `recommend_anime` function accepts a list of dictionaries representing the current user's ratings.  Each dictionary has two keys: `anime_id` and `score`.  The function computes a weighted sum of similarity scores for all candidate items (those not already rated by the user) and returns the top `N` recommendations.

We then merge the output with `anime_metadata` to provide a human‑readable table containing the MAL_ID, Name, Score, Type, Source and Synopsis for each recommended anime.


In [None]:

def recommend_anime(user_ratings, sim_matrix, anime_metadata, top_n=5):
    '''
    Generate item-based recommendations.

    Parameters
    ----------
    user_ratings : list of dicts
        A list where each element is {'anime_id': id, 'score': rating}.  This represents the user's past interactions.
    sim_matrix : pd.DataFrame
        Precomputed item–item similarity matrix with anime IDs as both index and columns.
    anime_metadata : pd.DataFrame
        Metadata table to enrich the recommendations.
    top_n : int
        Number of recommendations to return.

    Returns
    -------
    pd.DataFrame
        A table with MAL_ID, Name, Score, Type, Source and Synopsis for the recommended anime.
    '''
    user_series = pd.Series({item['anime_id']: item['score'] for item in user_ratings})
    
    # Accumulate scores for all items
    scores_accum = pd.Series(0.0, index=sim_matrix.index)
    for anime_id, rating in user_series.items():
        if anime_id in sim_matrix.columns:
            scores_accum += sim_matrix[anime_id] * rating
    
    # Remove already seen items
    scores_accum = scores_accum.drop(labels=user_series.index, errors='ignore')
    
    # Select top candidates
    top_candidates = scores_accum.sort_values(ascending=False).head(top_n)
    
    # Fetch metadata for the top candidates
    recommendations = (
        anime_metadata
        .set_index('MAL_ID')
        .loc[top_candidates.index]
        .reset_index()
        .rename(columns={'index': 'MAL_ID', 'anime_id': 'MAL_ID'})
    )
    
    # Add similarity scores
    recommendations['similarity_score'] = top_candidates.values
    
    return recommendations[['MAL_ID', 'Name', 'Score', 'Type', 'Source', 'Synopsis', 'similarity_score']]



## 5. Test the system with a sample user

Imagine a new user has finished watching three anime series and provides the following scores:

```python
input_ratings = [
    {"anime_id": 20, "score": 10},  # Naruto
    {"anime_id": 4,  "score": 8},   # Attack on Titan
    {"anime_id": 5,  "score": 9},   # Steins;Gate
]
```

The recommender should return a list of at least five anime titles that the user has not yet watched.  Below we call the function and show the results.


In [None]:

input_ratings = [
    {"anime_id": 20, "score": 10},
    {"anime_id": 4,  "score": 8},
    {"anime_id": 5,  "score": 9},
]

recommendations = recommend_anime(input_ratings, sim_matrix, anime_metadata, top_n=5)
recommendations



## 6. Discussion and next steps

This simple example illustrates how to build an item‑based collaborative filtering recommender system.  The key steps are:

1. **Load and clean the data.**  Filter out incomplete or low‑quality ratings.  In this notebook we used a synthetic dataset for demonstration purposes.
2. **Construct the user–item matrix.**  Each column represents an anime and each row a user.  Missing values are typically filled with zeros or the mean rating for that item.
3. **Compute item–item similarities.**  Cosine similarity is a common choice.  For large datasets you would use a sparse matrix representation and consider approximate methods for scalability.
4. **Generate recommendations.**  Aggregate the similarity scores for items the user has already rated, apply weighting by the user's own ratings, exclude items already seen, and pick the top candidates.

To adapt this system for production use, consider the following improvements:

- **Use the full `rating_complete.csv` dataset** to build a high‑quality similarity matrix, ensuring that only strong signals (completed anime with scores) are used.
- **Experiment with normalization techniques** such as centering ratings by user or by item, or applying TF–IDF weighting to penalize very popular titles.
- **Incorporate additional metadata** (genres, release year, etc.) as features for hybrid models, or to diversify recommendations.
- **Optimize performance** by using approximate nearest neighbor libraries (e.g., `annoy`, `faiss`) or precomputing the top k similar items offline.

Because the dataset used here is small, the recommendations may not always match real‑world preferences.  However, the core workflow remains the same when scaling up to millions of users and items.
