
# Anime Item‑Based CF Recommender (Full Data Version)

In this notebook we upgrade the simple prototype built earlier to work efficiently with the **full MyAnimeList datasets**.  The files are large—`animelist.csv` is ~1.9 GB and `rating_complete.csv` is ~780 MB—so reading them naïvely into memory and creating a dense pivot table will likely exceed the RAM available on most machines.  To make the model scalable we will:

* Read only the necessary columns and specify appropriate dtypes to reduce memory usage.
* Filter to the `rating_complete.csv` subset (completed and rated anime only) to capture stronger signals.
* Build a **sparse user–item matrix** using SciPy rather than a dense DataFrame.
* Use `NearestNeighbors` with a cosine metric to compute the top‑`k` similar anime for each title without materialising a full similarity matrix.
* Accumulate recommendations on the fly by looking up precomputed neighbours.

The end result still returns a list of anime with `MAL_ID`, `Name`, `Score`, `Type`, `Source` and `Synopsis` columns, but it can handle millions of interactions gracefully.  You can adjust the `n_neighbors` parameter for a balance between speed and recommendation quality.


In [3]:

import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix, csr_matrix
from sklearn.neighbors import NearestNeighbors


In [14]:
# Load only the first few rows (optional, speeds up processing for large files)
df = pd.read_csv('anime/anime.csv', nrows=6)  # adjust path and number of rows if needed
df1 = pd.read_csv('anime/animelist.csv', nrows=6)
df2 = pd.read_csv('anime/anime_with_synopsis.csv', nrows=6)
df3 = pd.read_csv('anime/rating_complete.csv', nrows=6)
df4 = pd.read_csv('anime/watching_status.csv')

# Get the column headers
headers = df.columns.tolist()
headers1 = df1.columns.tolist()
headers2 = df2.columns.tolist()
headers3 = df3.columns.tolist()
headers4 = df4.columns.tolist()

print("anime.csv:")
print(df.iloc[:, :6])
print("animelist.csv:")
print(df1.iloc[:, :6])
print("anime_with_synopsis.csv:")
print(df2.iloc[:, :6]) 
print("rating_complete.csv:") 
print(df3.iloc[:, :6])
print("watching_status.csv:")
print(df4.iloc[:, :6])

anime.csv:
   MAL_ID                             Name  Score  \
0       1                     Cowboy Bebop   8.78   
1       5  Cowboy Bebop: Tengoku no Tobira   8.39   
2       6                           Trigun   8.24   
3       7               Witch Hunter Robin   7.27   
4       8                   Bouken Ou Beet   6.98   
5      15                     Eyeshield 21   7.95   

                                              Genres            English name  \
0    Action, Adventure, Comedy, Drama, Sci-Fi, Space            Cowboy Bebop   
1              Action, Drama, Mystery, Sci-Fi, Space  Cowboy Bebop:The Movie   
2  Action, Sci-Fi, Adventure, Comedy, Drama, Shounen                  Trigun   
3  Action, Mystery, Police, Supernatural, Drama, ...      Witch Hunter Robin   
4          Adventure, Fantasy, Shounen, Supernatural  Beet the Vandel Buster   
5                    Action, Sports, Comedy, Shounen                 Unknown   

                      Japanese name  
0                 


## 2. Create a sparse user–item matrix

To avoid allocating a dense matrix with millions of rows or columns, we map each `user_id` and `anime_id` to contiguous indices and build a SciPy **Compressive Sparse Row (CSR)** matrix.  Each non‑zero element of the matrix stores a rating.


In [None]:

# Map user_ids and anime_ids to 0-based indices
user_codes = ratings_df['user_id'].astype('category').cat.codes
anime_codes = ratings_df['anime_id'].astype('category').cat.codes

# Keep track of the mapping back to original IDs (for recommendations)
user_id_map = pd.Series(ratings_df['user_id'].unique(), index=np.unique(user_codes))
anime_id_map = pd.Series(ratings_df['anime_id'].unique(), index=np.unique(anime_codes))

# Build the sparse matrix (users x items)
data = ratings_df['score'].astype(np.float32).values
user_item_sparse = coo_matrix((data, (user_codes, anime_codes))).tocsr()

print('User–item matrix shape:', user_item_sparse.shape)



## 3. Compute item similarities using NearestNeighbors

We compute the top‑`k` similar anime for each title using a cosine distance metric.  This avoids creating a dense similarity matrix.  The parameter `n_neighbors` determines how many neighbours (including itself) are returned for each item.  A value between 20 and 50 is a reasonable starting point for recommendation quality.

We transpose the user–item matrix so that items correspond to rows in the matrix passed to `NearestNeighbors`.


In [None]:

# Configure the number of neighbours to retrieve
n_neighbors = 30  # adjust based on memory and desired diversity

# Transpose to get items as rows
item_user_sparse = user_item_sparse.T

# Fit NearestNeighbors model on sparse item vectors
nn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=n_neighbors, n_jobs=-1)

print('Fitting NearestNeighbors model...')
nn_model.fit(item_user_sparse)

# Retrieve neighbours for all items (distances and indices)
print('Computing neighbours...')
distances, indices = nn_model.kneighbors(item_user_sparse, return_distance=True)

print('Computed nearest neighbours for', item_user_sparse.shape[0], 'items')



## 4. Define the recommendation function

When a new user provides a list of anime they have watched with corresponding scores (like the structure of `animelist.csv` but without `user_id`), we generate candidate recommendations by looking up the precomputed nearest neighbours for each item.  We weight each neighbour by the rating the user gave to the source item, accumulate scores, and then select the top items that the user hasn't already seen.

The function below accepts a list of dictionaries with keys `anime_id` and `score`, the neighbour index matrices (`indices` and `distances`), the `anime_id_map`, and the metadata.  It returns a DataFrame of the top `N` recommendations enriched with metadata.


In [None]:

def recommend_anime_full(user_ratings, anime_id_map, indices, distances, metadata, top_n=5):
    '''
    Recommend anime for a new user using precomputed nearest neighbours.

    Parameters
    ----------
    user_ratings : list of dict
        Each dict should contain an anime_id and a score. This list represents the past ratings of a user.
    anime_id_map : pd.Series
        Mapping from internal item index to original anime_id.
    indices : ndarray
        Precomputed indices of nearest neighbours for each item (output of NearestNeighbors.kneighbors).
    distances : ndarray
        Precomputed cosine distances for the nearest neighbours.
    metadata : pd.DataFrame
        Anime metadata with MAL_ID, Name, Score, Type, Source and optionally synopsis.
    top_n : int
        Number of recommendations to return.

    Returns
    -------
    pd.DataFrame
        Recommended titles with metadata and aggregated similarity scores.
    '''
    # Build a mapping from anime_id to its internal index
    id_to_internal = pd.Series(anime_id_map.index, index=anime_id_map.values)

    # Convert user_ratings list into a Series for easier lookup
    user_series = pd.Series({item['anime_id']: item['score'] for item in user_ratings})

    candidate_scores = {}

    for anime_id, rating in user_series.items():
        if anime_id not in id_to_internal:
            continue  # skip unknown anime ids
        internal_idx = id_to_internal[anime_id]
        neigh_indices = indices[internal_idx]
        neigh_distances = distances[internal_idx]

        # Convert distances to similarity (1 - distance)
        neigh_similarities = 1.0 - neigh_distances

        for neigh_internal, sim in zip(neigh_indices, neigh_similarities):
            candidate_anime_id = anime_id_map[neigh_internal]
            # Skip if the user already rated this anime
            if candidate_anime_id in user_series.index:
                continue
            candidate_scores[candidate_anime_id] = candidate_scores.get(candidate_anime_id, 0.0) + sim * rating

    if not candidate_scores:
        return pd.DataFrame(columns=['MAL_ID', 'Name', 'Score', 'Type', 'Source', 'similarity_score'])

    # Sort candidates by accumulated score
    sorted_candidates = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    candidate_ids, scores = zip(*sorted_candidates)

    # Create a DataFrame with metadata
    recs = metadata.set_index('MAL_ID').loc[list(candidate_ids)].reset_index()
    recs['similarity_score'] = scores

    # Include synopsis column if present
    synopsis_cols = [col for col in ['synopsis'] if col in recs.columns]

    return recs[['MAL_ID', 'Name', 'Score', 'Type', 'Source'] + synopsis_cols + ['similarity_score']]



## 5. Example usage

After fitting the nearest neighbour model and computing the `distances` and `indices` arrays, you can call `recommend_anime_full` with a list of the user's rated anime.  The example below shows the structure of the call.  Replace `input_ratings` with the actual list of anime IDs and scores for your new user.  This code is commented out by default because fitting the model on the full dataset may take several minutes.


In [None]:

# Example user input (replace with real user ratings)
input_ratings = [
    {'anime_id': 20, 'score': 9},
    {'anime_id': 5114, 'score': 8},
    {'anime_id': 32281, 'score': 10},
]

# Generate recommendations (top 5)
# recommendations = recommend_anime_full(input_ratings, anime_id_map, indices, distances, anime_meta, top_n=5)
# display(recommendations)

print("To generate recommendations, uncomment the call above once the nearest neighbour model has been fitted.")
