# **Recommendation System**

**Data Description:**

Unique ID of each anime.

Anime title.

Anime broadcast type, such as TV, OVA, etc.

anime genre.

The number of episodes of each anime.

The average rating for each anime compared to the number of users who gave ratings.

Number of community members for each anime.

**Objective:**

The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset.

**Dataset:**
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

**Tasks:**

Task 1: Data Preprocessing

1) Load the dataset into a suitable data structure (e.g., pandas DataFrame).

2) Handle missing values, if any.

3) Explore the dataset to understand its structure and attributes.

Task 2: Feature Extraction

1) Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

2) Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

Task 3: Recommendation System

1) Design a function to recommend anime based on cosine similarity.

2) Given a target anime, recommend a list of similar anime based on cosine similarity scores.

3) Experiment with different threshold values for similarity scores to adjust the recommendation list size.

Task 4: Evaluation

1) Split the dataset into training and testing sets.

2) Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.

3) Analyze the performance of the recommendation system and identify areas of improvement.


**Task 1: Data Preprocessing**

1) Load the dataset into a suitable data structure (e.g., pandas DataFrame).

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/anime.csv')
df


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [2]:
# Display the first few rows and column names
print(df.head())

   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   
3      9253                       Steins;Gate   
4      9969                     Gintama&#039;   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   
3                                   Sci-Fi, Thriller     TV       24    9.17   
4  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.16   

   members  
0   200630  
1   793665  
2   114262  
3   673572  
4   151266  


In [3]:
print(df.columns)

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')


2) Handle missing values, if any.

In [4]:
# Check for missing values
print(df.isnull().sum())


anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


In [5]:
# Handle missing values (e.g., drop rows with missing values in critical columns)
df = df.dropna(subset=['rating','type','genre'])

3) Explore the dataset to understand its structure and attributes.

In [6]:
# Basic statistics and info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 12017 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12017 non-null  int64  
 1   name      12017 non-null  object 
 2   genre     12017 non-null  object 
 3   type      12017 non-null  object 
 4   episodes  12017 non-null  object 
 5   rating    12017 non-null  float64
 6   members   12017 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 751.1+ KB
None


In [7]:
print(df.describe())


           anime_id        rating       members
count  12017.000000  12017.000000  1.201700e+04
mean   13638.001165      6.478264  1.834888e+04
std    11231.076675      1.023857  5.537250e+04
min        1.000000      1.670000  1.200000e+01
25%     3391.000000      5.890000  2.250000e+02
50%     9959.000000      6.570000  1.552000e+03
75%    23729.000000      7.180000  9.588000e+03
max    34519.000000     10.000000  1.013917e+06


In [8]:
print(df['genre'].value_counts())

genre
Hentai                                                   816
Comedy                                                   521
Music                                                    297
Kids                                                     197
Comedy, Slice of Life                                    174
                                                        ... 
Adventure, Comedy, Horror, Shounen, Supernatural           1
Comedy, Harem, Romance, School, Seinen, Slice of Life      1
Comedy, Ecchi, Sci-Fi, Shounen                             1
Adventure, Shounen, Sports                                 1
Hentai, Slice of Life                                      1
Name: count, Length: 3229, dtype: int64


**Task 2: Feature Extraction**

1) Decide on the features that will be used for computing similarity (e.g., genres, user ratings).

2) Convert categorical features into numerical representations if necessary.

In [9]:
# One-hot encode the 'genre' column
df_genres = df['genre'].str.get_dummies(sep=', ')
df = pd.concat([df, df_genres], axis=1)

# Drop the original 'genre' column
df = df.drop(columns=['genre'])


In [10]:
# if(df['episodes'].values == "Unknown"):
#   df.dropna('episodes')

3) Normalize numerical features if required.

In [11]:
from sklearn.preprocessing import MinMaxScaler

# Normalize the 'rating' column
scaler = MinMaxScaler()
df['rating'] = scaler.fit_transform(df[['rating']])


**Task 3: Recommendation System**

1) Design a function to recommend anime based on cosine similarity.

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Create a feature matrix
features = df.drop(columns=['anime_id', 'name', 'type','episodes'])
feature_matrix = features.values

# Compute cosine similarity
similarity_matrix = cosine_similarity(feature_matrix)


2) Given a target anime, recommend a list of similar anime based on cosine similarity scores.

In [13]:
def recommend_anime(anime_name, df, similarity_matrix, top_n=5):
    # Get the index of the target anime
    anime_idx = df[df['name'] == anime_name].index[0]

    # Get similarity scores for the target anime
    sim_scores = list(enumerate(similarity_matrix[anime_idx]))

    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the indices of the most similar anime
    sim_indices = [i[0] for i in sim_scores[1:top_n+1]]

    # Get the names of the most similar anime
    similar_animes = df['name'].iloc[sim_indices]

    return similar_animes.tolist()

# Example usage
print(recommend_anime('Naruto', df, similarity_matrix))


['Naruto: Shippuuden', 'Bleach', 'Shingeki no Kyojin', 'Kill la Kill', 'Angel Beats!']


3) Experiment with different threshold values for similarity scores to adjust the recommendation list size.

In [14]:
def recommend_anime_with_threshold(anime_name, df, similarity_matrix, threshold=0.5):
    # Get the index of the target anime
    anime_idx = df[df['name'] == anime_name].index[0]

    # Get similarity scores for the target anime
    sim_scores = list(enumerate(similarity_matrix[anime_idx]))

    # Filter by threshold
    sim_scores = [s for s in sim_scores if s[1] > threshold]

    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the indices of the most similar anime
    sim_indices = [i[0] for i in sim_scores]

    # Get the names of the most similar anime
    similar_animes = df['name'].iloc[sim_indices]

    return similar_animes.tolist()

# Example usage with threshold
print(recommend_anime_with_threshold('Naruto', df, similarity_matrix, threshold=0.6))


['Naruto', 'Naruto: Shippuuden', 'Bleach', 'Shingeki no Kyojin', 'Kill la Kill', 'Angel Beats!', 'Hunter x Hunter (2011)', 'Fairy Tail', 'Soul Eater', 'Sword Art Online', 'Code Geass: Hangyaku no Lelouch', 'Fullmetal Alchemist: Brotherhood', 'Noragami', 'Steins;Gate', 'Durarara!!', 'One Piece', 'Ao no Exorcist', 'Death Note', 'Mirai Nikki (TV)', 'Toradora!', 'Tengen Toppa Gurren Lagann', 'Guilty Crown', 'Akame ga Kill!', 'One Punch Man', 'Fullmetal Alchemist', 'Darker than Black: Kuro no Keiyakusha', 'Code Geass: Hangyaku no Lelouch R2', 'Deadman Wonderland', 'Highschool of the Dead', 'Fate/Zero', 'Psycho-Pass', 'Black Lagoon', 'D.Gray-man', 'Dragon Ball Z', 'No Game No Life', 'Sword Art Online II', 'Elfen Lied', 'Tokyo Ghoul', 'Clannad', 'Cowboy Bebop', 'Sen to Chihiro no Kamikakushi', 'Ano Hi Mita Hana no Namae wo Bokutachi wa Mada Shiranai.', 'Ansatsu Kyoushitsu (TV)', 'K', 'Hataraku Maou-sama!', 'Bakemonogatari', 'Another', 'Shokugeki no Souma', 'Samurai Champloo', 'Zankyou no Terr

**Task 4: Evaluation**

1) Split the dataset into training and testing sets.

In [15]:
from sklearn.model_selection import train_test_split

# Create a training set and a test set
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)


2) Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.

In [16]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Example evaluation (pseudo-code, assuming you have ground truth and predictions)
y_true = ['Naruto']  # True labels or preferences
y_pred = ['Naruto']  # Predicted labels or preferences

precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')

print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')


Precision: 1.0
Recall: 1.0
F1 Score: 1.0


**Interview Questions:**

1) Can you explain the difference between user-based and item-based collaborative filtering?

Ans:

**User-Based Collaborative Filtering**

Concept:

User-based collaborative filtering recommends items to a user based on the preferences of similar users. The core idea is that if User A and User B have similar preferences in the past, then User B’s likes and dislikes can help predict what User A might like in the future.

How It Works:

Find Similar Users: Calculate the similarity between users based on their ratings or interactions. Common similarity metrics include Pearson correlation, cosine similarity, or Jaccard similarity.

Identify Neighbors: Identify a set of users who are most similar to the target user.

Aggregate Preferences: Use the preferences of these similar users to recommend items that the target user hasn’t interacted with yet but are liked by the similar users.

Example:

If User A and User B have similar movie ratings, and User B likes a movie that User A hasn’t watched, then the system might recommend that movie to User A.

Strengths:

Can effectively capture complex user preferences and interactions.
Can work well if there is a rich set of user interactions.

Limitations:

Scalability: As the number of users grows, calculating similarities and finding neighbors can become computationally expensive.

Sparsity: If the user-item matrix is sparse (many users have rated only a few items), finding similar users can be challenging.

**Item-Based Collaborative Filtering**

Concept:

Item-based collaborative filtering recommends items based on the similarity between items rather than users. The idea is that if an item is similar to other items that a user has liked, then the user might also like the similar items.

How It Works:

Find Similar Items: Calculate the similarity between items based on user interactions. This involves comparing items based on their co-occurrence in user interactions or ratings.

Identify Similar Items: For each item that the user has interacted with, find items that are most similar to it.

Aggregate Recommendations: Recommend items that are similar to the items the user has liked or interacted with.

Example:

If a user likes “Inception” and “Interstellar,” the system might recommend “The Prestige” if it is similar to “Inception” and “Interstellar.”

Strengths:

Scalability: Generally more scalable than user-based methods, as the number of items is often smaller than the number of users.

Simplicity: Often easier to implement and compute since item similarities can be precomputed and reused.

Limitations:

Cold Start Problem: New items with few interactions may not be well-represented, making recommendations for these items less accurate.
Limited Personalization: May not capture the full complexity of user preferences as well as user-based methods.

Key Differences
Focus:

User-Based: Focuses on finding similar users to make recommendations.
Item-Based: Focuses on finding similar items to make recommendations.

Similarity Calculation:

User-Based: Similarity is calculated between users based on their ratings or interactions.

Item-Based: Similarity is calculated between items based on user ratings or interactions.

Scalability:

User-Based: Can be less scalable due to the large number of users.

Item-Based: Typically more scalable due to fewer items compared to users.

Handling New Items:

User-Based: Can recommend new items based on similar users’ preferences.

Item-Based: Can struggle with new items that do not have enough user interactions.


2)  What is collaborative filtering, and how does it work?

Ans:

Collaborative filtering is a popular technique used in recommendation systems to predict a user’s preferences based on the preferences and behaviors of other users. It operates on the principle that users who have agreed in the past will agree in the future. Collaborative filtering can be broadly categorized into two main types: user-based and item-based.

Collaborative filtering is a method of making recommendations by leveraging the collective preferences and behaviors of users. It works on the assumption that if users have similar preferences or behaviors in the past, they are likely to have similar preferences in the future. The technique does not rely on content-based information but rather on the patterns of user interactions or ratings.

How Collaborative Filtering Works:

1) Data Collection:

User-Item Interactions: Collaborative filtering requires a dataset that includes user interactions with items, such as ratings, clicks, purchases, or other forms of feedback.

2) Similarity Calculation:

User-Based Collaborative Filtering: Computes similarity between users based on their interaction patterns. For example, if two users have rated a set of movies similarly, they are considered similar.

Item-Based Collaborative Filtering: Computes similarity between items based on the users who have interacted with them. For example, if two movies are liked by the same users, they are considered similar.

3) Recommendation Generation:

User-Based: For a given user, find other users who have similar preferences and recommend items that these similar users have liked but the target user has not yet interacted with.

Item-Based: For a given item that the user has interacted with, find similar items and recommend those that the user has not yet interacted with.


Summary:

Collaborative filtering is a powerful and widely used technique for generating recommendations based on the behaviors and preferences of users. By leveraging the wisdom of the crowd, it can provide personalized recommendations even without detailed knowledge of the content. However, it does face challenges like the cold start problem and scalability, which can be mitigated through hybrid approaches and advanced techniques such as matrix factorization or deep learning.