# Recommendation System

Data Description:

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.


Number of community members for each anime.
Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset.
Dataset:
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

Tasks:

Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.

Feature Extraction:

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

Recommendation System:

Design a function to recommend anime based on cosine similarity.
Given a target anime, recommend a list of similar anime based on cosine similarity scores.
Experiment with different threshold values for similarity scores to adjust the recommendation list size.

Evaluation:

Split the dataset into training and testing sets.
Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
Analyze the performance of the recommendation system and identify areas of improvement.

Interview Questions:
1. Can you explain the difference between user-based and item-based collaborative filtering?
2. What is collaborative filtering, and how does it work?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")


# Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.


In [2]:
df=pd.read_csv("anime.csv")
df.head(5)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


In [4]:
print(df['genre'].isnull().sum())
print(df['type'].isnull().sum())
print(df['rating'].isnull().sum())
print(df['name'].isnull().sum())
print(df['episodes'].isnull().sum())
print(df['anime_id'].isnull().sum())
print(df['members'].isnull().sum())

62
25
230
0
0
0
0


In [5]:
#handling null values of genre columns
print((df['genre'].isnull().sum()/df['genre'].shape[0])*100)
#only 0.5 % of data is missing so we can delete that


0.504311046038718


In [6]:

#handling null values of type columns
print((df['type'].isnull().sum()/df['type'].shape[0])*100)
#only 0.2 % of data is missing so we can delete that


0.20335122824141857


In [7]:

#handling null values of type columns
print((df['rating'].isnull().sum()/df['rating'].shape[0])*100)


1.8708312998210508


In [8]:
print(df['rating'].value_counts())
print("mode",df['rating'].mode())
print("mean",df['rating'].mean())
print("median",df['rating'].median())
df_cleaned=pd.DataFrame(df)
df_cleaned['rating']=df_cleaned['rating'].fillna(6)
df_cleaned.dropna(inplace=True)

rating
6.00    141
7.00     99
6.50     90
6.25     84
5.00     76
       ... 
3.47      1
3.71      1
3.87      1
3.91      1
3.14      1
Name: count, Length: 598, dtype: int64
mode 0    6.0
Name: rating, dtype: float64
mean 6.473901690981432
median 6.57


In [9]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12210 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12210 non-null  int64  
 1   name      12210 non-null  object 
 2   genre     12210 non-null  object 
 3   type      12210 non-null  object 
 4   episodes  12210 non-null  object 
 5   rating    12210 non-null  float64
 6   members   12210 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 763.1+ KB


In [10]:
#for rating column mode is 6 so I am replacing the null values with the mode
print(df_cleaned['rating'].isnull().sum())
print(df_cleaned['episodes'].isnull().sum())
print(df_cleaned['members'].isnull().sum())
print(df_cleaned['anime_id'].isnull().sum())


0
0
0
0


# Feature Extraction:

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

In [11]:

#Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
#genre	type	episodes	rating	members

In [12]:
df_cleaned.dtypes
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12210 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12210 non-null  int64  
 1   name      12210 non-null  object 
 2   genre     12210 non-null  object 
 3   type      12210 non-null  object 
 4   episodes  12210 non-null  object 
 5   rating    12210 non-null  float64
 6   members   12210 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 763.1+ KB


In [13]:
print(df['name'].value_counts)
print(df['name'].value_counts)

<bound method IndexOpsMixin.value_counts of 0                                           Kimi no Na wa.
1                         Fullmetal Alchemist: Brotherhood
2                                                 Gintama°
3                                              Steins;Gate
4                                            Gintama&#039;
                               ...                        
12289         Toushindai My Lover: Minami tai Mecha-Minami
12290                                          Under World
12291                       Violence Gekiga David no Hoshi
12292    Violence Gekiga Shin David no Hoshi: Inma Dens...
12293                     Yasuji no Pornorama: Yacchimae!!
Name: name, Length: 12294, dtype: object>
<bound method IndexOpsMixin.value_counts of 0                                           Kimi no Na wa.
1                         Fullmetal Alchemist: Brotherhood
2                                                 Gintama°
3                                           

In [14]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Step 1: One-hot encoding for categorical columns ('type')
df = pd.get_dummies(df, columns=['type'], prefix='type')

# Since 'genre' contains multiple genres separated by commas, we'll split them first
# Create a new DataFrame with individual genre columns
genre_dummies = df['genre'].str.get_dummies(sep=', ')
df = pd.concat([df, genre_dummies], axis=1)

# Drop the original 'genre' column
df.drop('genre', axis=1, inplace=True)


# Step 2: Convert 'episodes' to integer (if it's not already)
df['episodes'] = pd.to_numeric(df['episodes'], errors='coerce')

# Handling missing values by filling them with 0 or mean (for example)
df['episodes'].fillna(0, inplace=True)

# Step 3: Identify numerical columns for normalization
numerical_cols = ['episodes', 'rating', 'members']

# Handling missing values in numerical columns
#df[numerical_cols] = df[numerical_cols].fillna(0)

# Applying MinMaxScaler for normalization
scaler = MinMaxScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])


In [15]:
df.head()

Unnamed: 0,anime_id,name,episodes,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,0.00055,0.92437,0.197872,True,False,False,False,False,...,0,0,0,0,0,1,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,0.035204,0.911164,0.78277,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,0.028053,0.909964,0.112689,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,0.013201,0.90036,0.664325,False,False,False,False,False,...,0,0,0,0,0,0,1,0,0,0
4,9969,Gintama&#039;,0.028053,0.89916,0.149186,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0


In [16]:
#Calculating Cosine Similarity between Users
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

In [21]:
df_cleaned=df
#df_cleaned.drop('name', axis=1, inplace=True)
df_cleaned=df_cleaned.dropna()

In [22]:
user_sim = 1 - pairwise_distances(df_cleaned.values,metric='cosine')
user_sim

array([[1.        , 0.99999983, 0.99999999, ..., 0.99999997, 0.99999997,
        1.        ],
       [0.99999983, 1.        , 0.99999984, ..., 0.9999998 , 0.9999998 ,
        0.99999982],
       [0.99999999, 0.99999984, 1.        , ..., 0.99999996, 0.99999997,
        0.99999999],
       ...,
       [0.99999997, 0.9999998 , 0.99999996, ..., 1.        , 1.        ,
        0.99999997],
       [0.99999997, 0.9999998 , 0.99999997, ..., 1.        , 1.        ,
        0.99999998],
       [1.        , 0.99999982, 0.99999999, ..., 0.99999997, 0.99999998,
        1.        ]])

In [23]:
#Store the results in a dataframe
user_sim_df = pd.DataFrame(user_sim)
user_sim_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12054,12055,12056,12057,12058,12059,12060,12061,12062,12063
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12059,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
12060,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
12061,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
12062,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [24]:
#Set the index and column names to user ids
user_sim_df.index = df_cleaned.anime_id.unique()
user_sim_df.columns = df_cleaned.anime_id.unique()

In [25]:
user_sim_df.shape[0]

12064

In [26]:
np.fill_diagonal(user_sim,0)
user_sim_df.iloc[0:10, 0:10]

Unnamed: 0,32281,5114,28977,9253,9969,32935,11061,820,15335,15417
32281,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0
5114,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0
28977,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0
9253,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.999996,1.0,1.0
9969,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.999996,1.0,1.0
32935,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.999996,1.0,1.0
11061,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.999996,1.0,1.0
820,0.999996,0.999996,0.999996,0.999996,0.999996,0.999996,0.999996,0.0,0.999996,0.999996
15335,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,0.0,1.0
15417,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,0.0


In [27]:
user_sim_df

Unnamed: 0,32281,5114,28977,9253,9969,32935,11061,820,15335,15417,...,29992,26031,10368,9352,5541,9316,5543,5621,6133,26081
32281,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5114,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
28977,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9253,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9969,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9316,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
5543,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
5621,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
6133,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999996,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0


In [28]:
#Most Similar Users
user_sim_df.idxmax(axis=1)[0:40]

Unnamed: 0,0
32281,28725
5114,6702
28977,25313
9253,10348
9969,15417
32935,28891
11061,13271
820,1241
15335,21899
15417,28977


In [29]:
df_cleaned[(df_cleaned['anime_id']==32281) | (df_cleaned['anime_id']==28725)]

Unnamed: 0,anime_id,episodes,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,0.00055,0.92437,0.197872,True,False,False,False,False,False,...,0,0,0,0,0,1,0,0,0,0
208,28725,0.00055,0.798319,0.058829,True,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0


In [30]:
df_cleaned[(df_cleaned['anime_id']==32281) | (df_cleaned['anime_id']==28725)]

Unnamed: 0,anime_id,episodes,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,0.00055,0.92437,0.197872,True,False,False,False,False,False,...,0,0,0,0,0,1,0,0,0,0
208,28725,0.00055,0.798319,0.058829,True,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0


In [31]:
df_cleaned[(df_cleaned['anime_id']==34096) | (df_cleaned['anime_id']==28977)]

Unnamed: 0,anime_id,episodes,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
2,28977,0.028053,0.909964,0.112689,False,False,False,False,False,True,...,0,0,0,0,0,0,0,0,0,0


In [32]:
df_cleaned[(df_cleaned['anime_id']==10348) | (df_cleaned['anime_id']==9253)]

Unnamed: 0,anime_id,episodes,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
3,9253,0.013201,0.90036,0.664325,False,False,False,False,False,True,...,0,0,0,0,0,0,1,0,0,0
3581,10348,0.007151,0.632653,0.005558,False,False,False,False,False,True,...,0,0,0,0,0,0,0,0,0,0


In [33]:
df_cleaned[(df_cleaned['anime_id']==1) | (df_cleaned['anime_id']==15) | (df_cleaned['anime_id']==164)]

Unnamed: 0,anime_id,episodes,rating,members,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
22,1,0.014301,0.858343,0.480139,False,False,False,False,False,True,...,0,0,1,0,0,0,0,0,0,0
24,164,0.00055,0.857143,0.334892,True,False,False,False,False,False,...,0,0,0,0,0,0,0,0,0,0
433,15,0.079758,0.769508,0.082495,False,False,False,False,False,True,...,0,0,0,1,0,0,0,0,0,0


In [35]:
#Evaluation:

#Split the dataset into training and testing sets.
#Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
#Analyze the performance of the recommendation system and identify areas of improvement.







In [36]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score,recall_score,f1_score

In [38]:
x=df_cleaned.drop('anime_id',axis=1)
y=df_cleaned['anime_id']

In [39]:

#Spliting the data into traning and test sets
train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.3,random_state=0)

In [51]:
print(user_sim.shape)  # Check the size of user_sim
print(max(test_y.index))

(12064, 12064)
12293


In [46]:
# Filter valid indices
valid_indices = [i for i in test_y.index if i < user_sim.shape[0]]


# Generate predictions only for valid indices
y_pred = [1 if user_sim[i, 0] >= 0.8 else 0 for i in valid_indices]

# Adjust the ground truth to match valid indices
y_true = [1] * len(valid_indices)

In [47]:

precision=precision_score(y_true,y_pred,average='binary')
print('Precision score:- ',precision)

Precision score:-  1.0


In [48]:

recall=recall_score(y_true,y_pred,average='binary')
print('Recall score:- ',recall)

Recall score:-  1.0


In [49]:

f1=f1_score(y_true,y_pred,average='binary')
print('F1 score:- ',f1)

F1 score:-  1.0


In [None]:
#Interview Questions:

#Can you explain the difference between user-based and item-based collaborative filtering?


#User-based collaborative filtering: Finds similar users and recommends items based on what they liked.
#Calculate user similarities using methods like cosine similarity or Pearson correlation, identify the most similar users to the target user,
#and aggregate their preferences to make personalized recommendations.
#Item-based collaborative filtering: Finds similar items and recommends them based on how users interacted with them
#Calculate item similarities based on ratings or co-occurrence, identify items most similar to those the user interacted with,
#and recommend these similar items to the user.










#What is collaborative filtering, and how does it work?


#Collaborative filtering is a recommendation technique that predicts a user's preferences based on the preferences of other users
# or the relationship between items. It relies on the idea that people who have agreed in the past will likely agree in the future.

#Collaborative filtering works by analyzing patterns of user interactions, such as ratings, purchases, or likes, to recommend items.
#It works by identifying similarities either between users (user-based) or items (item-based) to suggest relevant options.
#There are two main types:
#1)User-Based Collaborative Filtering
#2)Item-Based Collaborative Filtering

