# Recommendation System

Data Description:

Unique ID of each anime.
Anime title.
Anime broadcast type, such as TV, OVA, etc.
anime genre.
The number of episodes of each anime.
The average rating for each anime compared to the number of users who gave ratings.


Number of community members for each anime.
Objective:
The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset. 
Dataset:
Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

Tasks:

Data Preprocessing:

Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
Explore the dataset to understand its structure and attributes.

Feature Extraction:

Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

Recommendation System:

Design a function to recommend anime based on cosine similarity.
Given a target anime, recommend a list of similar anime based on cosine similarity scores.
Experiment with different threshold values for similarity scores to adjust the recommendation list size.

Evaluation:

Split the dataset into training and testing sets.
Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
Analyze the performance of the recommendation system and identify areas of improvement.

Interview Questions:
1. Can you explain the difference between user-based and item-based collaborative filtering?
2. What is collaborative filtering, and how does it work?

In [1]:
# importing basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# loading data
data = pd.read_csv('anime.csv', encoding='latin1')

# Data Exploration

In [3]:
data.shape

(12294, 7)

In [4]:
data.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,GintamaÂ°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [5]:
data.tail()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175
12293,26081,Yasuji no Pornorama: Yacchimae!!,Hentai,Movie,1,5.46,142


# Quick Data Check

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


# Statistical summary

In [7]:
data.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


# Data Preprocessing

### Identifying Duplicates

In [8]:
data.duplicated().sum()

0

There is no duplicate values present in dataset.

### Identifying Missing Values

In [14]:
data.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [16]:
data['genre'].unique()  # Explore unique genres

array(['Drama, Romance, School, Supernatural',
       'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen',
       'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen',
       ..., 'Hentai, Sports', 'Drama, Romance, School, Yuri',
       'Hentai, Slice of Life'], dtype=object)

# Feature Extraction:

### Convert categorical features: For genres, use one-hot encoding to represent each genre as a binary vector.

In [19]:
data['genre'] = data['genre'].str.split(', ')  # Convert string genres to lists

In [20]:
genres_dummies = data['genre'].str.join('|').str.get_dummies()  # One-hot encode

In [21]:
data = pd.concat([data, genres_dummies], axis=1)

### Normalize numerical features: Scale features like the number of episodes or ratings if they vary widely.

### Convert 'Unknown' values to NaN:
You can replace all occurrences of 'Unknown' with NaN (Not a Number) using pandas, which allows you to handle missing values more easily.

In [24]:
import numpy as np

# Replace 'Unknown' with NaN for columns like 'episodes' or 'rating'
data['episodes'].replace('Unknown', np.nan, inplace=True)
data['rating'].replace('Unknown', np.nan, inplace=True)


### Convert columns to numeric:
Once the unknown values are replaced, you can convert these columns to a numeric type, which will also handle any remaining issues of non-numeric data.

In [25]:
data['episodes'] = pd.to_numeric(data['episodes'], errors='coerce')
data['rating'] = pd.to_numeric(data['rating'], errors='coerce')


### Handle missing values:
Decide how you want to handle the missing values. You can either fill them with a meaningful value (such as the median or mean), or drop the rows where important values like ratings are missing.

In [26]:
# Option 1: Fill missing values with a default value (e.g., the median)
data['episodes'].fillna(data['episodes'].median(), inplace=True)
data['rating'].fillna(data['rating'].median(), inplace=True)

# Option 2: Drop rows with missing values in important columns
data.dropna(subset=['episodes', 'rating'], inplace=True)


### Continue with preprocessing:
Now that you have clean numerical data, you can proceed with normalization and further processing.

In [27]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data[['rating', 'episodes']] = scaler.fit_transform(data[['rating', 'episodes']])


# Recommendation System:

### Cosine similarity: To recommend anime based on cosine similarity, create a feature matrix that includes both genre and numerical features.

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

# Define feature matrix (genre + rating + episodes, etc.)
feature_matrix = data[genres_dummies.columns.tolist() + ['rating', 'episodes']].values

# Compute cosine similarity
similarity_matrix = cosine_similarity(feature_matrix)


### Recommendation function: Write a function that returns the most similar anime based on cosine similarity.

In [33]:
def recommend_anime(anime_title, similarity_matrix, data, threshold=0.8):
    idx = data[data['title'] == anime_title].index[0]
    sim_scores = list(enumerate(similarity_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Filter based on threshold
    anime_indices = [i[0] for i in sim_scores if i[1] >= threshold]
    return data.iloc[anime_indices]['title'].tolist()


# Evaluation:

### Train-test split: Since it's a recommendation system, you can evaluate by holding out some user ratings or using cross-validation.

In [34]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data, test_size=0.2)


### Evaluation metrics:

###### Precision: The ratio of relevant anime in your recommendation list.
###### Recall: The ratio of relevant anime retrieved to the total relevant anime in the dataset.
###### F1-Score: The harmonic mean of precision and recall