Anime Recommendation System assignment

In [2]:
#Step 1: Data Preprocessing

In [4]:
import pandas as pd

# Load dataset
anime = pd.read_excel("anime.xlsx")

# Explore dataset
print(anime.head())
print(anime.info())
print(anime.describe())

# Check for missing values
print(anime.isnull().sum())

# Handle missing values (example: fill NaN genres with 'Unknown')
anime['genre'] = anime['genre'].fillna('Unknown')


   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                         GintamaÂ°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  

In [6]:
#Step 2: Feature Extraction

We will use genres and rating to compute similarity.

Step 2.1: Convert genres into numeric using One-Hot Encoding



In [10]:
# Split genres by comma and one-hot encode
anime['genre'] = anime['genre'].astype(str)
genre_dummies = anime['genre'].str.get_dummies(sep=',')

# Combine with rating
features = pd.concat([genre_dummies, anime[['rating']].fillna(anime['rating'].mean())], axis=1)

print(features.head())


    Adventure   Cars   Comedy   Dementia   Demons   Drama   Ecchi   Fantasy  \
0           0      0        0          0        0       0       0         0   
1           1      0        0          0        0       1       0         1   
2           0      0        1          0        0       0       0         0   
3           0      0        0          0        0       0       0         0   
4           0      0        1          0        0       0       0         0   

    Game   Harem  ...  Slice of Life  Space  Sports  Super Power  \
0      0       0  ...              0      0       0            0   
1      0       0  ...              0      0       0            0   
2      0       0  ...              0      0       0            0   
3      0       0  ...              0      0       0            0   
4      0       0  ...              0      0       0            0   

   Supernatural  Thriller  Unknown  Vampire  Yaoi  rating  
0             0         0        0        0     0    9.3

In [12]:
#Step 2.2: Normalize features (optional for rating)

In [14]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
features[['rating']] = scaler.fit_transform(features[['rating']])


In [16]:
#Step 3: Recommendation System using Cosine Similarity

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between all anime
cos_sim = cosine_similarity(features)

# Create a function to get top N recommendations
def recommend_anime(anime_title, anime_df, similarity_matrix, top_n=5):
    # Get index of target anime
    idx = anime_df[anime_df['name'] == anime_title].index[0]
    
    # Get similarity scores for this anime
    sim_scores = list(enumerate(similarity_matrix[idx]))
    
    # Sort by similarity score in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get top N indices (skip first one because it is the anime itself)
    top_indices = [i[0] for i in sim_scores[1:top_n+1]]
    
    # Return recommended anime names
    return anime_df['name'].iloc[top_indices].values

# Example recommendation
recommend_anime("Naruto", anime, cos_sim, top_n=5)


array(['Naruto: Shippuuden',
       'Boruto: Naruto the Movie - Naruto ga Hokage ni Natta Hi',
       'Boruto: Naruto the Movie', 'Naruto x UT',
       'Naruto: Shippuuden Movie 4 - The Lost Tower'], dtype=object)

In [20]:
#Step 4: Evaluation

Split dataset into training/testing is not always necessary for content-based recommendation (cosine similarity).

Can evaluate by precision@k or recall@k if you have user-item interactions (ratings per user).

Otherwise, analyze recommendations qualitatively:

Do the recommended anime match the genres of the target anime?

Are the top-rated anime appearing in recommendations?

In [23]:
#Step 5: Interview Questions

Difference between user-based and item-based collaborative filtering:

User-based: recommends items liked by similar users.

Item-based: recommends items similar to items a user liked.

What is collaborative filtering, and how does it work?

Collaborative filtering uses user behavior (ratings/purchases) to find patterns.

It assumes users with similar tastes will like similar items.

In [26]:
#Anime Recommendation System – Complete Notebook#

In [28]:
# Step 1: Import Libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

# Step 2: Load Dataset
anime = pd.read_excel("anime.xlsx")

# Explore Dataset
print("First 5 rows:")
print(anime.head())
print("\nDataset Info:")
print(anime.info())
print("\nMissing Values:")
print(anime.isnull().sum())

# Fill missing genres
anime['genre'] = anime['genre'].fillna('Unknown')

# Step 3: Feature Extraction
# One-hot encode genres
anime['genre'] = anime['genre'].astype(str)
genre_dummies = anime['genre'].str.get_dummies(sep=',')

# Combine with normalized rating
features = pd.concat([genre_dummies, anime[['rating']].fillna(anime['rating'].mean())], axis=1)
scaler = MinMaxScaler()
features[['rating']] = scaler.fit_transform(features[['rating']])

print("\nFeature matrix:")
print(features.head())

# Step 4: Compute Cosine Similarity
cos_sim = cosine_similarity(features)

# Step 5: Recommendation Function
def recommend_anime(anime_title, anime_df, similarity_matrix, top_n=5):
    # Check if anime exists
    if anime_title not in anime_df['name'].values:
        return f"Anime '{anime_title}' not found in dataset."
    
    idx = anime_df[anime_df['name'] == anime_title].index[0]
    sim_scores = list(enumerate(similarity_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    top_indices = [i[0] for i in sim_scores[1:top_n+1]]
    return anime_df['name'].iloc[top_indices].values

# Example Recommendations
target_anime = "Naruto"
recommendations = recommend_anime(target_anime, anime, cos_sim, top_n=5)
print(f"\nTop 5 recommendations for '{target_anime}':")
print(recommendations)

# Step 6: Evaluation Notes
print("\nEvaluation Notes:")
print("- Precision, recall, or F1-score can be computed if user-item interactions are available.")
print("- Otherwise, evaluate recommendations qualitatively: Do genres match? Are popular animes included?")


First 5 rows:
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                         GintamaÂ°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Colu

In [6]:
#### Corrections

✅ 1. Evaluation Metrics Implementation (Precision, Recall, F1-Score)

Since recommendation systems usually provide a list of recommended items, we will evaluate top-N recommendations using these metrics:

Precision@N = Relevant items recommended / Total items recommended

Recall@N = Relevant items recommended / Total relevant items

F1-Score@N = 2 * (Precision * Recall) / (Precision + Recall)

In [9]:
from sklearn.metrics import precision_score, recall_score, f1_score

def evaluate_recommendations(actual_items, recommended_items):
    """
    actual_items: list of relevant items (ground truth)
    recommended_items: list of items recommended by the system
    """
    # Convert to binary for comparison
    y_true = [1 if item in actual_items else 0 for item in recommended_items]
    y_pred = [1]*len(recommended_items)

    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    return precision, recall, f1

# Example Usage:
actual = ['Movie A', 'Movie B', 'Movie C']
recommended = ['Movie A', 'Movie D', 'Movie E']

precision, recall, f1 = evaluate_recommendations(actual, recommended)
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1-Score: {f1:.2f}")


Precision: 0.33, Recall: 1.00, F1-Score: 0.50


In [11]:
##✅ 2. Analysis of Results


After calculating metrics for multiple users:

If Precision is high but Recall is low → Your recommendations are accurate but not covering enough relevant items.

If Recall is high but Precision is low → You are recommending many items, but most are irrelevant.

If F1-score is balanced → Your system is performing well.

Improvements:

Try different similarity measures (cosine, Pearson, Jaccard).

Use matrix factorization (SVD) for better predictions.



In [14]:
##✅ 3. Interview Questions with Answers

Q1. What are the types of Recommendation Systems?

Content-Based Filtering: Recommends similar items based on item features.

Collaborative Filtering: Based on user-user or item-item similarity.

Hybrid Systems: Combination of both.

Q2. What is Cosine Similarity and why is it used?

Cosine similarity measures the angle between two vectors (ignores magnitude).

It is useful for text or rating-based similarity because it normalizes values.

Q3. What are limitations of Content-Based Filtering?

Requires detailed item features.

Struggles with cold-start problem for new users or new items.

Q4. How do you evaluate a Recommendation System?

Metrics: Precision@K, Recall@K, F1-score@K, MAP, NDCG.

Offline vs Online evaluation (A/B testing).