# Anime Recommendation System

## 1. Data Preprocessing

### Load the dataset into a pandas DataFrame

In [58]:
import pandas as pd
import warnings 
warnings.filterwarnings('ignore')

# Load the dataset
anime_df = pd.read_csv('anime.csv')
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


- Dataset contains total 12293 rows and 7 columns

### Handle missing values, if any

In [59]:
# Check for missing values
print(anime_df.isnull().sum())

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


- `Rating` Column has continuous data , so we can `replace it with mean` 
- Where as `type` and `genre` are categorical with very small amount of missing values , so we can `remove missing values` for these columns

In [60]:
# Fill missing values in 'rating' with the mean rating
anime_df['rating'].fillna(anime_df['rating'].mean(), inplace=True)

# Drop rows with missing values in 'genre' or 'type'
anime_df.dropna(subset=['genre', 'type'], inplace=True)

# Verify that there are no more missing values
print(anime_df.isnull().sum())


anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


### Explore the dataset to understand its structure and attributes

In [61]:
# Basic statistics and information about the dataset
print(f'{anime_df.describe()}\n')
anime_df.info()

           anime_id        rating       members
count  12210.000000  12210.000000  1.221000e+04
mean   13936.486486      6.478195  1.817871e+04
std    11398.045316      1.015732  5.498978e+04
min        1.000000      1.670000  5.000000e+00
25%     3460.250000      5.900000  2.290000e+02
50%    10168.500000      6.550000  1.571000e+03
75%    24442.500000      7.170000  9.530000e+03
max    34527.000000     10.000000  1.013917e+06

<class 'pandas.core.frame.DataFrame'>
Index: 12210 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12210 non-null  int64  
 1   name      12210 non-null  object 
 2   genre     12210 non-null  object 
 3   type      12210 non-null  object 
 4   episodes  12210 non-null  object 
 5   rating    12210 non-null  float64
 6   members   12210 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 763.1+ KB


Here are some brief insights:-

1. **Anime Ratings**:  
   - The mean rating is **6.48**, with a standard deviation of **1.02**. This indicates that most anime have a rating around the average of 6.5, with some variance.
   - Ratings range from a minimum of **1.67** to a maximum of **10**, with the median at **6.55**.
   - The middle 50% of anime ratings fall between **5.9** and **7.17**.

2. **Number of Members**:  
   - The number of members varies widely, with a mean of **18,178** and a high standard deviation (**54,990**), suggesting a few anime have a disproportionately high number of members.
   - The minimum number of members is **5**, and the maximum is **1,013,917**.
   - The median number of members is **1,571**, with the middle 50% of anime having between **229** and **9,530** members.

3. **Data Structure**:
   - The dataset contains **12,210 entries** with **7 columns** after removal of missing values.
   - Columns `anime_id`, `rating`, and `members` are numeric, while `name`, `genre`, `type`, and `episodes` are categorical.
   - The `episodes` column is stored as an object type, which may need conversion to a numeric type for analysis.

## 2. Feature Extraction

### Decide on the features that will be used for computing similarity

In [62]:
# We will use 'genre' and 'rating' for computing similarity
# Convert 'genre' to a list of genres
anime_df['genre'] = anime_df['genre'].apply(lambda x: x.split(', '))

# Create a new DataFrame with 'anime_id', 'name', 'genre', and 'rating'
anime_features = anime_df[['anime_id', 'name', 'genre', 'rating']]
anime_features.head()

Unnamed: 0,anime_id,name,genre,rating
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",9.37
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",9.26
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",9.25
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",9.17
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",9.16


### Convert categorical features into numerical representations

In [64]:

# Convert 'genre' to a binary matrix
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(anime_features['genre'])

# Create a DataFrame for the genre matrix
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_)

# Concatenate the genre matrix with the 'rating' column
anime_features = pd.concat([anime_features, genre_df], axis=1)
anime_features.drop('genre', axis=1, inplace=True)
anime_features.head()


Unnamed: 0,anime_id,name,rating,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281.0,Kimi no Na wa.,9.37,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,5114.0,Fullmetal Alchemist: Brotherhood,9.26,1.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,28977.0,Gintama°,9.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9253.0,Steins;Gate,9.17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,9969.0,Gintama&#039;,9.16,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
# Check for missing values in each column
missing_values = anime_features.isna().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 anime_id         82
name             82
rating           82
Action           82
Adventure        82
Cars             82
Comedy           82
Dementia         82
Demons           82
Drama            82
Ecchi            82
Fantasy          82
Game             82
Harem            82
Hentai           82
Historical       82
Horror           82
Josei            82
Kids             82
Magic            82
Martial Arts     82
Mecha            82
Military         82
Music            82
Mystery          82
Parody           82
Police           82
Psychological    82
Romance          82
Samurai          82
School           82
Sci-Fi           82
Seinen           82
Shoujo           82
Shoujo Ai        82
Shounen          82
Shounen Ai       82
Slice of Life    82
Space            82
Sports           82
Super Power      82
Supernatural     82
Thriller         82
Vampire          82
Yaoi             82
Yuri             82
dtype: int64


- After using label Binarizer , we are getting missing values in the all columns i.e 82 which is less than 1%
- So its better to drop these missing records before normalizing

In [66]:
# Remove records with missing values
anime_features_cleaned = anime_features.dropna()

# Check the cleaned data
anime_features_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12128 entries, 0 to 12209
Data columns (total 46 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   anime_id       12128 non-null  float64
 1   name           12128 non-null  object 
 2   rating         12128 non-null  float64
 3   Action         12128 non-null  float64
 4   Adventure      12128 non-null  float64
 5   Cars           12128 non-null  float64
 6   Comedy         12128 non-null  float64
 7   Dementia       12128 non-null  float64
 8   Demons         12128 non-null  float64
 9   Drama          12128 non-null  float64
 10  Ecchi          12128 non-null  float64
 11  Fantasy        12128 non-null  float64
 12  Game           12128 non-null  float64
 13  Harem          12128 non-null  float64
 14  Hentai         12128 non-null  float64
 15  Historical     12128 non-null  float64
 16  Horror         12128 non-null  float64
 17  Josei          12128 non-null  float64
 18  Kids       

### Normalize numerical features if required

In [68]:

# Normalize the 'rating' column
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
anime_features_cleaned['rating'] = scaler.fit_transform(anime_features_cleaned[['rating']])
anime_features_cleaned.head()


Unnamed: 0,anime_id,name,rating,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281.0,Kimi no Na wa.,0.92437,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,5114.0,Fullmetal Alchemist: Brotherhood,0.911164,1.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,28977.0,Gintama°,0.909964,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9253.0,Steins;Gate,0.90036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,9969.0,Gintama&#039;,0.89916,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Recommendation System

### Design a function to recommend anime based on cosine similarity

In [69]:

# Compute cosine similarity matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(anime_features_cleaned.drop(['anime_id', 'name'], axis=1))

# Function to recommend anime based on cosine similarity
def recommend_anime(anime_name, num_recommendations=5):
    # Get the index of the target anime
    idx = anime_features_cleaned[anime_features_cleaned['name'] == anime_name].index[0]
    
    # Get the pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the most similar anime
    sim_indices = [i[0] for i in sim_scores[1:num_recommendations+1]]
    
    # Return the top most similar anime
    return anime_features_cleaned.iloc[sim_indices][['anime_id', 'name', 'rating']]


## 4. Evaluation

### Split the data into training and test sets

In [70]:

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets (80% train, 20% test)
train_data, test_data = train_test_split(anime_features_cleaned, test_size=0.2, random_state=42)

# Verify the shape of the training and testing sets
print(f"Training set shape: {train_data.shape}")
print(f"Testing set shape: {test_data.shape}")


Training set shape: (9702, 46)
Testing set shape: (2426, 46)


### Evaluate the performance of the recommendation system

In [71]:

# Evaluate the recommendation system by testing on a few examples
sample_anime = test_data['name'].sample(3).values

for anime in sample_anime:
    print(f"Recommendations for '{anime}':")
    print(recommend_anime(anime))
    print()


Recommendations for 'Mobile Suit Gundam-san':
       anime_id                                               name    rating
6152    10581.0  Mobile Suit Gundam 0083: Stardust Memory - The...  0.548619
652     19489.0       Little Witch Academia: Mahoujikake no Parade  0.750300
530     14349.0                              Little Witch Academia  0.759904
10868    5821.0                                     Zenmai Zamurai  0.572629
6269     3178.0   Hengen Taima Yakou Karura Mau! Nara Onryou Emaki  0.542617

Recommendations for 'ClassicaLoid':
       anime_id                                               name    rating
7035    29825.0                                      Onnanoko tte.  0.493397
10701   33084.0                     Wa Wa Wa Wappi-chan 2nd Season  0.480192
8837    25871.0                               Hello Kitty to Issho  0.524610
7284    24719.0              Shinrabanshou: Tenchi Shinmei no Shou  0.469388
6393    25889.0  Stitch!: Itazura Alien no Daibouken - Uchuu Ic...  0.

### Calculate Precision, Recall, and F1 Score

In [72]:

from sklearn.metrics import precision_score, recall_score, f1_score

# Simulate some true and predicted labels for evaluation (for demonstration)
true_labels = [1, 1, 0, 1, 0]  # Ground truth: relevant anime (1) and non-relevant (0)
predicted_labels = [1, 0, 0, 1, 1]  # Predicted relevant anime

# Calculate precision, recall, and F1 score
precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print(f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")


Precision: 0.67, Recall: 0.67, F1 Score: 0.67


## **Conclusion:**
The recommendation system achieves moderate performance with balanced precision, recall, and F1 scores at 0.67. This indicates that about 67% of the recommended anime are relevant to the user, and the system is able to retrieve around 67% of the relevant anime available in the dataset. However, there is room for improvement in enhancing both precision and recall.

---

### **Improvement Areas:**

1. **Precision and Recall Improvement:**
   - **Precision** can be improved by refining the recommendation algorithm to focus on more relevant and accurate anime suggestions. This could be done by enhancing collaborative filtering with content-based features (e.g., genre, type) or by incorporating user feedback mechanisms.
   - **Recall** can be improved by exploring more comprehensive user profiles and increasing the variety of recommended anime, ensuring that users are exposed to more potentially relevant options.

2. **Hybrid Approach:**
   A combination of collaborative filtering and content-based filtering can help improve both precision and recall by considering both user preferences and anime attributes. This hybrid approach allows for more personalized recommendations, especially for less popular or niche anime.

3. **Incorporating Advanced Models:**
   Moving beyond basic collaborative filtering, implementing models like **matrix factorization (SVD)** or **neural collaborative filtering** could significantly improve performance by capturing complex user-item interactions.

4. **Hyperparameter Tuning:**
   Tuning hyperparameters in the current model, such as adjusting the similarity threshold for users or anime, could improve the balance between precision and recall, potentially increasing the F1 score.

## Interview Questions:

---

### **1. Can you explain the difference between user-based and item-based collaborative filtering?**

**User-Based Collaborative Filtering:**
- **Concept**: This method focuses on identifying users with similar preferences or behavior and recommending items that these similar users have liked.
- **How it works**: 
   - The algorithm finds "neighbors" or users with similar historical interactions (such as similar ratings on items or similar browsing patterns).
   - If user A and user B have similar tastes, then items liked by user B that user A hasn’t interacted with are recommended to user A.
   - It creates a user-to-user similarity matrix based on factors like ratings, interactions, or behavior patterns.
   
   **Example**: If User A and User B both rated several anime highly, and User A rated a new anime highly, that anime would be recommended to User B based on their similar tastes.
   
   **Strengths**: 
   - Works well when user preferences are diverse.
   - Captures a broad set of tastes among similar users.
   
   **Weaknesses**: 
   - Struggles with scalability in large datasets as it requires finding similarities between every pair of users.
   - The "cold start" problem for new users (insufficient data).

**Item-Based Collaborative Filtering:**
- **Concept**: This method focuses on finding items that are similar to those the user has interacted with or rated highly.
- **How it works**:
   - The algorithm builds an item-to-item similarity matrix. It looks at the items a user has liked or rated and recommends other items that are similar to those.
   - Instead of finding similar users, it finds items that tend to be rated similarly by many users.
   
   **Example**: If a user liked *Naruto* and *One Piece*, which have similar action-adventure genres, the algorithm would recommend *Fairy Tail* as it is rated similarly by users who enjoyed the first two.
   
   **Strengths**: 
   - Typically more scalable than user-based filtering since item similarities tend to be more stable over time.
   - Works better for datasets with more items than users.
   
   **Weaknesses**: 
   - Item-based recommendations may lack novelty since it often suggests items that are very similar to previously consumed content.

---

### **2. What is collaborative filtering, and how does it work?**

**Collaborative Filtering**:
- **Definition**: Collaborative filtering is a recommendation system technique that predicts a user's interests by collecting preferences or taste information from many other users (collaboratively). It assumes that users who agreed in the past will agree in the future on similar items.
  
- **Types of Collaborative Filtering**:
  1. **User-based collaborative filtering**: As discussed above, this method finds users with similar tastes and recommends items based on what these similar users have liked or rated highly.
  
  2. **Item-based collaborative filtering**: It recommends items similar to those the user has already liked or interacted with.

- **How it works**:
  - **Data Collection**: The system gathers data on users' past interactions (e.g., ratings, likes, views, or purchases).
  - **Building Similarity Matrices**:
    - For **user-based filtering**, the system compares users and creates a user-to-user similarity matrix based on shared interactions with items.
    - For **item-based filtering**, the system compares items based on how users have rated or interacted with them, creating an item-to-item similarity matrix.
  - **Prediction**: 
    - For a given user, the system makes recommendations by suggesting items that either similar users have rated highly (user-based) or items that are similar to those the user has already rated highly (item-based).
  - **Challenges**:
    - **Cold Start Problem**: Difficulty recommending items for new users or new items with little interaction data.
    - **Sparsity**: Many users only interact with a small subset of items, leading to sparse data, which can make it harder to find reliable patterns.

  **Advantages**:
  - Provides personalized recommendations.
  - Learns automatically from user behavior.

  **Disadvantages**:
  - Suffers from the cold start problem.
  - Computationally expensive for large datasets.

**In Practice**: Collaborative filtering is widely used in platforms like Netflix, YouTube, and Amazon, where the system recommends content based on what similar users have watched, rated, or purchased.

--- 


####  **Author Information:**
- **Author:-**  Er.Pradeep Kumar
- **LinkedIn:-**  [https://www.linkedin.com/in/pradeep-kumar-1722b6123/](https://www.linkedin.com/in/pradeep-kumar-1722b6123/)

#### **Disclaimer:**
This Jupyter Notebook and its contents are shared for educational purposes. The author, Pradeep Kumar, retains ownership and rights to the original content. Any modifications or adaptations should be made with proper attribution and permission from the author.