# Recommendation System


# Tasks:


## Data Preprocessing:

In [1]:
import pandas as pd

In [3]:
# Load the dataset
anime_df = pd.read_csv("anime.csv")

In [5]:
# Display basic information
print("Initial Data Overview:")
print(anime_df.info())
print("\nSample Data:")
print(anime_df.head())

Initial Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None

Sample Data:
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie 

In [7]:
# Handle missing values
# Fill missing 'genre' and 'type' with 'Unknown'
anime_df['genre'] = anime_df['genre'].fillna('Unknown')
anime_df['type'] = anime_df['type'].fillna('Unknown')

In [9]:
# Convert 'episodes' to numeric, replacing 'Unknown' or non-numeric with NaN, then fill with median or 0
anime_df['episodes'] = pd.to_numeric(anime_df['episodes'], errors='coerce')
anime_df['episodes'] = anime_df['episodes'].fillna(anime_df['episodes'].median()) 

In [11]:
# Handle missing ratings (can fill with mean, median, or drop)
anime_df['rating'] = anime_df['rating'].fillna(anime_df['rating'].mean())

In [13]:
# Confirm cleaned data
print("\nCleaned Data Overview:")
print(anime_df.info())


Cleaned Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  float64
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 672.5+ KB
None


## Feature Extraction:


In [15]:
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import numpy as np

In [17]:
# 1. Genre splitting (multi-label)
anime_df['genre'] = anime_df['genre'].apply(lambda x: x.split(', ') if isinstance(x, str) else [])

In [19]:
# 2. Multi-label binarizer for genres
mlb = MultiLabelBinarizer()
genre_features = mlb.fit_transform(anime_df['genre'])

In [21]:
# 3. One-hot encoding for 'type'
ohe = OneHotEncoder(sparse_output=False)
type_features = ohe.fit_transform(anime_df[['type']])

In [23]:
# 4. Numeric features
numeric_features = anime_df[['episodes', 'rating', 'members']]

In [25]:
# 5. Normalize numeric features
scaler = MinMaxScaler()
numeric_features_scaled = scaler.fit_transform(numeric_features)

In [27]:
# 6. Combine all features
import numpy as np
from scipy.sparse import hstack

In [29]:
feature_matrix = np.hstack((genre_features, type_features, numeric_features_scaled))

print("Final feature matrix shape:", feature_matrix.shape)

Final feature matrix shape: (12294, 54)


## Recommendation System:


In [31]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

In [33]:
# Step 1: Compute cosine similarity matrix
cosine_sim_matrix = cosine_similarity(feature_matrix)

In [35]:
# Step 2: Build reverse lookup for anime titles to indices
anime_indices = pd.Series(anime_df.index, index=anime_df['name']).drop_duplicates()

In [37]:
# Step 3: Recommendation function
def recommend_anime(title, top_n=10, min_similarity=0.5):
    if title not in anime_indices:
        return f"Anime '{title}' not found in the dataset."
    
    idx = anime_indices[title]
    sim_scores = list(enumerate(cosine_sim_matrix[idx]))
    
    # Filter out the anime itself and sort by similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [(i, score) for i, score in sim_scores if i != idx and score >= min_similarity]
    
    # Get top N recommendations
    top_recommendations = sim_scores[:top_n]
    
    results = anime_df.iloc[[i for i, _ in top_recommendations]][['name', 'genre', 'type', 'rating']]
    results['similarity'] = [score for _, score in top_recommendations]
    
    return results.reset_index(drop=True)

In [39]:
# Example usage:
recommend_anime("Steins;Gate", top_n=5)

Unnamed: 0,name,genre,type,rating,similarity
0,Fireball Charming,[Sci-Fi],TV,6.94,0.805503
1,Escha Chron,[Sci-Fi],TV,6.473902,0.800088
2,Hoshi no Ko Poron,[Sci-Fi],TV,6.76,0.799925
3,Yuusei Kamen,[Sci-Fi],TV,6.44,0.799589
4,RoboDz,[Sci-Fi],TV,5.0,0.77878


## Evaluation:


In [41]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

In [43]:
# Step 1: Split the data
train_df, test_df = train_test_split(anime_df, test_size=0.2, random_state=42)

In [45]:
# Step 2: Restrict similarity to only train items
train_feature_matrix = feature_matrix[train_df.index]
test_feature_matrix = feature_matrix[test_df.index]

In [47]:
# Rebuild cosine sim between test items and all train items
cosine_sim_eval = cosine_similarity(test_feature_matrix, train_feature_matrix)

In [49]:
# Step 3: Evaluation loop
def evaluate_system(top_n=5, threshold=0.5):
    precision_list = []
    recall_list = []

    for i, test_index in enumerate(test_df.index):
        test_genres = set(anime_df.loc[test_index, 'genre'])

        # Get top-N similar anime from train set
        sim_scores = list(enumerate(cosine_sim_eval[i]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = [(idx, score) for idx, score in sim_scores if score >= threshold]
        top_recommendations = sim_scores[:top_n]
        recommended_indices = [train_df.index[idx] for idx, _ in top_recommendations]

        # Relevance: same genre = relevant
        relevant_items = 0
        for ridx in recommended_indices:
            genres = set(anime_df.loc[ridx, 'genre'])
            if test_genres.intersection(genres):
                relevant_items += 1

        # Precision = relevant / recommended
        precision = relevant_items / top_n if top_n > 0 else 0
        # Approximate true relevant items in training set (same genre anime)
        true_relevant = sum(1 for idx in train_df.index 
                    if test_genres.intersection(set(anime_df.loc[idx, 'genre'])))

        recall = relevant_items / true_relevant if true_relevant > 0 else 0


        precision_list.append(precision)
        recall_list.append(recall)

    # Aggregate
    avg_precision = sum(precision_list) / len(precision_list)
    avg_recall = sum(recall_list) / len(recall_list)
    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall + 1e-8)

    return {
        'Precision': round(avg_precision, 4),
        'Recall': round(avg_recall, 4),
        'F1-Score': round(f1, 4)
    }

In [51]:
# Example evaluation
evaluate_system(top_n=5, threshold=0.5)

{'Precision': 0.9996, 'Recall': 0.003, 'F1-Score': 0.006}

### Analyze the performance of the recommendation system and identify areas of improvement.


What’s Going Right (Strengths)
 High Precision: The recommendations it does give are highly relevant — likely due to strong genre/type matching.

Simplicity: Content-based filtering is easy to explain and interpret.

No User Data Needed: You don't rely on user ratings or history — good for cold-start scenarios.

What’s Going Wrong (Weaknesses)
Extremely Low Recall

You're recommending too few anime and missing many others that users might actually enjoy.

The system is too conservative and narrow in its similarity criteria.

Shallow Feature Usage

You're only using genre, type, episodes, rating, and members — no semantic understanding of anime plot or context.

Genre overlap is binary (either match or don’t), which ignores how similar genres really are.

Lack of Diversity

Recommendations are too "safe" — likely recommending only anime with exact genre/type matches.

Users might get bored of seeing near-identical suggestions.

 Areas of Improvement
1. Enhance Feature Representation
Current	Suggested
Genre (string match)	Convert to multi-hot vectors or embeddings
Type, Episodes	Add more metadata like studio, year, themes
No synopsis usage	Use TF-IDF, BERT, or Doc2Vec on anime descriptions

2. Use Semantic Similarity
Vectorize anime descriptions/synopses with TF-IDF or transformers (e.g., Sentence-BERT).

Combine description-based similarity with genre/type features for more nuanced recommendations.

3. Lower Filtering Threshold
Reduce min_similarity from 0.5 → 0.3 or less to include more candidates (boost recall).

Or increase top_n to allow more diverse recommendations.

4. Introduce Collaborative Filtering (if possible)
If you can get user rating data:

Use matrix factorization or k-NN collaborative filtering

Combine it with content features for a hybrid system

5. Use Weighted Similarity
Not all features should be equal.

Example: similarity = 0.7 * genre_similarity + 0.3 * description_similarity

Or prioritize rating and popularity more heavily.


# Interview Questions:

## 1. Can you explain the difference between user-based and item-based collaborative filtering?

Both are types of memory-based collaborative filtering used in recommendation systems, but they differ in how similarity is calculated and applied.

1.User-Based Collaborative Filtering (UBCF):
Idea:
Find users who are similar to the target user and recommend items that those similar users liked.

How it works:

Calculate similarity between users (e.g., using cosine similarity, Pearson correlation).

For a target user, find the k most similar users.

Recommend items that these similar users liked but the target user hasn’t rated yet.

Example:

If Alice and Bob have watched many of the same anime, and Bob liked "Attack on Titan", but Alice hasn’t seen it, the system recommends it to Alice.

Pros:

Captures user taste effectively.

Personalized recommendations.

Cons:

Doesn’t scale well with many users.

Sparse data (users rate few items) can hurt performance.



2.Item-Based Collaborative Filtering (IBCF):
Idea:
Find items similar to what the user has already liked and recommend those.

How it works:

Calculate similarity between items based on user ratings.

For a target user, look at items they've rated highly.

Recommend similar items based on item-to-item similarity.

Example:

If Alice liked "Fullmetal Alchemist: Brotherhood", and users who liked that also liked "Steins;Gate", the system recommends "Steins;Gate" to Alice.

Pros:

More scalable than user-based.

Item relationships are more stable over time.

Cons:

Less personalized if users have few ratings.

Might miss niche preferences.

## 2. What is collaborative filtering, and how does it work?

Collaborative Filtering is a popular technique used in recommendation systems to suggest items (like movies, anime, books) based on user behavior — not item content.

It’s called “collaborative” because it relies on the opinions of many users to make recommendations for an individual.

How It Works:
Collaborative Filtering assumes:

If two users liked similar items in the past, they will likely enjoy similar items in the future.

If two items were liked by the same users, they are likely similar.

Two Main Types:
Type	Description
User-Based Collaborative Filtering	Recommends items liked by similar users.
Item-Based Collaborative Filtering	Recommends items similar to what the user already liked.

