# Anime Recommender system

<tt>Dated: 26-09-2022</tt>

<tt>
    <ul>
        <li><b>Prepared by: Aditya Kahol </b></li>
        <li><tt>Dataset: <a href = "https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?resource=download">Kaggle - anime.csv</a></tt></li>
    </ul>
<tt>

<p>This is a mini project build to understand the use-case of <i>k-nearest neighbours classifier</i>.</p>
<p>It was surprising to me, that such a simple classifier can be used to make a recommender systems. Without any further ado, let's begin writing the code.</p>
<p>Below mentioned are few references for the reader to follow: </p>
<tt>
<ul>
        <li>Reference 1: <a href = "https://www.youtube.com/watch?v=ngLyX54e1LU&list=PLqnslRFeH2Upcrywf-u2etjdxxkL8nl7E&index=1">Youtube</a> </li>
        <li>Reference 2: <a href = "https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761">Medium</a> </li>
</ul>
</tt>

<b>The notebook is partitoned into following sections</b>
<ul>
    <li>Reading the data</li>
    <li>Preprocessing and feature engineering</li>
    <li>Building the k-nn classifier</li>
    <li>Results and Conclusion</li>
</ul>

### Read and understand the data

In [1]:
#importing necessary modules first.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:
#read the data
anime_df = pd.read_csv("animes.csv")

In [3]:
anime_df.shape

(19311, 12)

In [4]:
anime_df.head(5)

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


As can be seen, the dataset contains mixed type of features, and a K-NN classifier works best with numeric data, hence we will have to remove some of the columns.

Columns such as `img_url`, `link` and `uid` are not important to make a recommender system, however, `synopsis`, `aired` and `members` can be useful, but we will still remove all of them, and work only with:
    <ul>
        <li>`title`</li>
        <li>`genre`</li>
        <li>`episodes`</li>
        <li>`popularity`</li>
        <li>`ranked`</li>
        <li>`score`</li>
     </ul>

### Preprocess the data for task at hand

In [5]:
anime_df.columns

Index(['uid', 'title', 'synopsis', 'genre', 'aired', 'episodes', 'members',
       'popularity', 'ranked', 'score', 'img_url', 'link'],
      dtype='object')

In [6]:
anime_df = anime_df[['title',
                      'genre',
                      'episodes',
                      'popularity',
                      'ranked',
                      'score']]

In [7]:
anime_df.head(2)

Unnamed: 0,title,genre,episodes,popularity,ranked,score
0,Haikyuu!! Second Season,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,141,25.0,8.82
1,Shigatsu wa Kimi no Uso,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,28,24.0,8.83


In [8]:
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19311 entries, 0 to 19310
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       19311 non-null  object 
 1   genre       19311 non-null  object 
 2   episodes    18605 non-null  float64
 3   popularity  19311 non-null  int64  
 4   ranked      16099 non-null  float64
 5   score       18732 non-null  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 905.3+ KB


    As can be seen, there are few null entries for the column 'episodes', 'ranked' and 'score', since this is just a sample project, we can remove those rows, it won't affect the classification.

In [9]:
#removing rows with null entries.
anime_df = anime_df.dropna(axis = 0, how = 'any').reset_index(drop = True)
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15875 entries, 0 to 15874
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       15875 non-null  object 
 1   genre       15875 non-null  object 
 2   episodes    15875 non-null  float64
 3   popularity  15875 non-null  int64  
 4   ranked      15875 non-null  float64
 5   score       15875 non-null  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 744.3+ KB


In [10]:
anime_df.shape

(15875, 6)

In [11]:
anime_df.head()

Unnamed: 0,title,genre,episodes,popularity,ranked,score
0,Haikyuu!! Second Season,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,141,25.0,8.82
1,Shigatsu wa Kimi no Uso,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,28,24.0,8.83
2,Made in Abyss,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...",13.0,98,23.0,8.83
3,Fullmetal Alchemist: Brotherhood,"['Action', 'Military', 'Adventure', 'Comedy', ...",64.0,4,1.0,9.23
4,Kizumonogatari III: Reiketsu-hen,"['Action', 'Mystery', 'Supernatural', 'Vampire']",1.0,502,22.0,8.83


### Feature engineering step

- Working with the `genre` column in this present form is difficult.
- Solution: One-hot encoding for each genre type.

In [12]:
#Selecting each genre from the 'genre' category column
def get_genre(df):
    """
        This method is specifically defined for anime_df.
    """
    genres = list()
    for i in range(df.shape[0]):
        #make a list out of a string of list.
        for genre in df['genre'][i].strip('[]').split(', '):
            if genre == '' or genre == "":
                continue
            if genre[1:-1] not in genres:
                genres.append(genre[1:-1])
    return genres

In [13]:
#check
genres = get_genre(anime_df)
print(f"Genres: {genres}")

Genres: ['Comedy', 'Sports', 'Drama', 'School', 'Shounen', 'Music', 'Romance', 'Sci-Fi', 'Adventure', 'Mystery', 'Fantasy', 'Action', 'Military', 'Magic', 'Supernatural', 'Vampire', 'Slice of Life', 'Demons', 'Historical', 'Super Power', 'Mecha', 'Parody', 'Samurai', 'Seinen', 'Police', 'Psychological', 'Josei', 'Space', 'Kids', 'Shoujo Ai', 'Ecchi', 'Shoujo', 'Horror', 'Shounen Ai', 'Cars', 'Martial Arts', 'Game', 'Thriller', 'Dementia', 'Harem']


In [14]:
#add columns for each genre
for genre in genres:
    anime_df[genre] = 0

In [15]:
anime_df.head(2)

Unnamed: 0,title,genre,episodes,popularity,ranked,score,Comedy,Sports,Drama,School,...,Ecchi,Shoujo,Horror,Shounen Ai,Cars,Martial Arts,Game,Thriller,Dementia,Harem
0,Haikyuu!! Second Season,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,141,25.0,8.82,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Shigatsu wa Kimi no Uso,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,28,24.0,8.83,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
#one-hot encoding.
for i in range(anime_df.shape[0]):
    for genre in anime_df['genre'][i].strip('[]').split(', '):
        if genre == '' or genre == "":
            continue
        genre = genre[1:-1]
        anime_df.loc[i,genre] = 20 
        #giving a weight of 20 for each genre present, for better classification.

In [17]:
anime_df.head(2)

Unnamed: 0,title,genre,episodes,popularity,ranked,score,Comedy,Sports,Drama,School,...,Ecchi,Shoujo,Horror,Shounen Ai,Cars,Martial Arts,Game,Thriller,Dementia,Harem
0,Haikyuu!! Second Season,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,141,25.0,8.82,20,20,20,20,...,0,0,0,0,0,0,0,0,0,0
1,Shigatsu wa Kimi no Uso,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,28,24.0,8.83,0,0,20,20,...,0,0,0,0,0,0,0,0,0,0


In [18]:
#remove genre column
anime_df.drop('genre', axis = 1, inplace = True)

In [19]:
#have a look at the summary table for anime_df now.
anime_df.describe()

Unnamed: 0,episodes,popularity,ranked,score,Comedy,Sports,Drama,School,Shounen,Music,...,Ecchi,Shoujo,Horror,Shounen Ai,Cars,Martial Arts,Game,Thriller,Dementia,Harem
count,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,...,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0
mean,13.052976,7773.616756,6816.238929,6.483913,7.627087,0.952441,3.504882,2.088819,2.80063,2.391181,...,0.954961,0.967559,0.546772,0.122205,0.167559,0.486299,0.462362,0.186457,0.519055,0.435906
std,51.737151,4978.638763,4373.450523,1.040107,9.714691,4.259439,7.603755,6.116827,6.940613,6.489108,...,4.264787,4.291407,3.261463,1.558626,1.822996,3.0806,3.005667,1.922134,3.179988,2.920382
min,1.0,1.0,1.0,1.9,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3118.0,2856.5,5.79,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,7803.0,6899.0,6.46,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,12.0,12196.5,10503.5,7.26,20.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3057.0,16320.0,14675.0,9.23,20.0,20.0,20.0,20.0,20.0,20.0,...,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0


As can be seen, the mean and std for `popularity` and `ranked` column are too high, we need to fit it to a smaller value otherwise, recommendations will be poor!

In [20]:
#importing MinMaxScaler from sklearn.
from sklearn.preprocessing import MinMaxScaler

In [21]:
scaler = MinMaxScaler(feature_range = (0,10))

scaler.fit(anime_df[[
    'popularity','ranked'
]])

anime_df[['popularity','ranked']] = scaler.transform(anime_df[['popularity','ranked']])

In [22]:
anime_df.describe()

Unnamed: 0,episodes,popularity,ranked,score,Comedy,Sports,Drama,School,Shounen,Music,...,Ecchi,Shoujo,Horror,Shounen Ai,Cars,Martial Arts,Game,Thriller,Dementia,Harem
count,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,...,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0,15875.0
mean,13.052976,4.762925,4.644432,6.483913,7.627087,0.952441,3.504882,2.088819,2.80063,2.391181,...,0.954961,0.967559,0.546772,0.122205,0.167559,0.486299,0.462362,0.186457,0.519055,0.435906
std,51.737151,3.050823,2.980408,1.040107,9.714691,4.259439,7.603755,6.116827,6.940613,6.489108,...,4.264787,4.291407,3.261463,1.558626,1.822996,3.0806,3.005667,1.922134,3.179988,2.920382
min,1.0,0.0,0.0,1.9,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.910044,1.945959,5.79,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,4.78093,4.700831,6.46,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,12.0,7.473191,7.157217,7.26,20.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3057.0,10.0,10.0,9.23,20.0,20.0,20.0,20.0,20.0,20.0,...,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0


In [23]:
anime_df.head(2)

Unnamed: 0,title,episodes,popularity,ranked,score,Comedy,Sports,Drama,School,Shounen,...,Ecchi,Shoujo,Horror,Shounen Ai,Cars,Martial Arts,Game,Thriller,Dementia,Harem
0,Haikyuu!! Second Season,25.0,0.08579,0.016355,8.82,20,20,20,20,20,...,0,0,0,0,0,0,0,0,0,0
1,Shigatsu wa Kimi no Uso,22.0,0.016545,0.015674,8.83,0,0,20,20,20,...,0,0,0,0,0,0,0,0,0,0


### Build Classifier

Now that we have separated each genre as a separate feature for our movie recommender system, we are ready to build a recommender system using K-nearest neighbour algorithm.

In [24]:
def euclidean_distance(x1,x2):
    return np.sqrt(np.sum(x1-x2)**2)

In [25]:
class KNN:
    def __init__(self, num_recommendations = 5):
        self.num_recommendations = num_recommendations
    
    def fit(self,X):
        """
            X is a dataframe
        """
        self.anime_bucket = X.values
        #note: anime_bucket is an array
    
    def predict(self, query_anime):
        x1 = query_anime[1:]
        #Movie name is not needed to calculate the distance.
        
        distances = self.eval_distances(x1)
        #calculate the distance of the query anime features with all the training features.
        #note: 'title' feature is not considered for distance evaluation.
        
        indices = np.argsort(distances)[:self.num_recommendations]
        #look for the indices of the most similar matches for the query anime
        
        animes_recommended = [self.anime_bucket[idx, 0] for idx in indices]
        #making a list of anime names from our anime bucket
        
        return animes_recommended
    
    def eval_distances(self,query):
        distances = []
        
        for feature in self.anime_bucket:
            x2 = feature[1:] #Since, first entry is the anime title.
            distances.append(euclidean_distance(query,x2))
        
        return distances

In [26]:
def feature_builder(features_dict):
    """
        input: a dictionary object with following keys:
        0) title (string)
        1) episodes (number)
        2) popularity (number)
        3) ranked (number)
        4) score (number)
        5) genres (list)
    """
    L = []
    for feature in ['title','episodes','popularity','ranked','score']:
        L.append(features_dict[feature])
    
    P = scaler.transform(np.array(L[2:4]).reshape(1,-1))
    L[2],L[3] = P[0,0],P[0,1]
    
    for genre in genres:
        if genre in features_dict['genres']:
            L.append(20)
        else:
            L.append(0)
            
    return L

### Results

In [27]:
#making a query.
query_dict = dict(title = 'Baki',
                  episodes = 26,
                  popularity = 574,
                  ranked = 2556,
                  score = 7.28,
                  genres = ['Action','Martial Arts','Shounen']
                 )
query_anime = feature_builder(query_dict)
print(query_anime)

['Baki', 26, 0.3511244561554017, 1.7411748671118987, 7.28, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0]


In [28]:
#build knn class object
knn_ob = KNN(num_recommendations = 5)

In [29]:
#we will find recommendations from any 1000 samples from anime_df
df = anime_df.sample(1000)

In [30]:
knn_ob.fit(df)

In [31]:
knn_ob.predict(query_anime)

['Pokemon: Pikachu no Natsumatsuri',
 'Ryo',
 'Mobile Suit Gundam MS IGLOO: The Hidden One Year War',
 'Go! Princess Precure Movie: Go! Go!! Gouka 3-bondate!!!',
 'Kurogane no Linebarrels']

### Concluding remarks

- The recommendations are working well, for now :) :P
- This was just a personal project which showcases the use of KNN classifier to make a recommendation system.
- It must be noted, that if the number of datapoints increases, this algorithm will become slower, hence this cannot be used in industrial applications.
- Also, important features such as synopsis was removed from this, but having such an information in our recommender system would obviously make it even better. 