**Anime Recommendation System : Content Based Item to Item Filtering**

MSCI 623 Project Spring 2020

**Importing the Dataset & Libraries**

In [None]:
#Importing the colab drive library to get the dataset
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Importing required libraries for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import seaborn as sns
from sklearn.preprocessing import MaxAbsScaler
from sklearn.neighbors import NearestNeighbors
%matplotlib inline

In [None]:
# Importing our Anime dataset, which includes 30 more animes, not present in original dataset. Though this dataset is not preprocessed and contains NaN and unknowns
dataset = pd.read_csv("/content/drive/My Drive/Anime_Recommend_System/anime_new.csv")

**Pre-Processing of Dataset**

In [None]:
#Checking the dataset few top rows 
dataset.head(3)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,7081,Danball Senki,"Action, Kids, Mecha",TV,44,7.07,6305
1,731,Interstellar 5555,"Adventure, Drama, Music, Sci-Fi",Music,1,8.31,52585
2,5876,Izumo,"Adventure, Historical, Fantasy",OVA,2,5.87,682


In [None]:
#viewing full dataset to see columns,and how the data looks like
dataset

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,7081,Danball Senki,"Action, Kids, Mecha",TV,44,7.07,6305
1,731,Interstellar 5555,"Adventure, Drama, Music, Sci-Fi",Music,1,8.31,52585
2,5876,Izumo,"Adventure, Historical, Fantasy",OVA,2,5.87,682
3,2169,Ice,"Action , Military, Sci-Fi, Shoujo Ai",OVA,3,5.41,7198
4,37675,Overlord III,"Action, Magic, Fantasy, Game, Supernatural",TV,13,8.02,434458
...,...,...,...,...,...,...,...
12319,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12320,5543,Under World,Hentai,OVA,1,4.28,183
12321,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12322,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [None]:
#Now we need to filter and see if there are any episodes which are not available in our dataset
filtered_data = dataset[dataset["episodes"]=="Unknown"]


In [None]:
#Taking a look at the animes which have unknown episodes, we see there are quite a lot of anime which have missing data
filtered_data

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
104,21,One Piece,"Action, Adventure, Comedy, Drama, Fantasy, Sho...",TV,Unknown,8.58,504862
282,235,Detective Conan,"Adventure, Comedy, Mystery, Police, Shounen",TV,Unknown,8.25,114702
645,1735,Naruto: Shippuuden,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,Unknown,7.94,533578
1021,966,Crayon Shin-chan,"Comedy, Ecchi, Kids, School, Shounen, Slice of...",TV,Unknown,7.73,26267
1051,33157,Tanaka-kun wa Itsumo Kedaruge Specials,"Comedy, School, Slice of Life",Special,Unknown,7.72,5400
...,...,...,...,...,...,...,...
12295,34361,Kyonyuu Reijou MC Gakuen,Hentai,OVA,Unknown,,205
12304,34492,Nuki Doki! Tenshi to Akuma no Sakusei Battle -...,Hentai,OVA,Unknown,,392
12310,34312,Saimin Class,Hentai,OVA,Unknown,,240
12312,34388,Shikkoku no Shaga The Animation,Hentai,OVA,Unknown,,195


In [None]:
#Now importing our scrapped file which has updated ratings and episodes which we scraped from the original source
new_ratings=pd.read_csv("/content/drive/My Drive/Anime_Recommend_System/anime_updated.csv")

In [None]:
#Now viewing the newly imported table which is replica of the old dataset but with new ratings and updated episodes number
new_ratings

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


In [None]:
#As we have now new updated ratings and episodes numbers from the scrapped source website, we update our main dataset with the updated values from new_ratings file
dataset.update(new_ratings)


In [None]:
#Even after scrapping the data, there might be some Unknown episodes or ratings, thus creating another table to view them
new_filtered_data = dataset[dataset["episodes"]=="Unknown"]
new_filtered_data

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
9118,30151.0,Kamiusagi Rope: Warau Asa ni wa Fukuraitaru tt...,"Comedy, Slice of Life",TV,Unknown,6.25,143.0
9284,24775.0,Koishite!! Namashi-chan,"Kids, Slice of Life",TV,Unknown,6.14,112.0
9295,30129.0,Konna Ko Iru kana,Kids,TV,Unknown,4.0,38.0
9389,33782.0,Life!,"Comedy, Slice of Life",TV,Unknown,5.25,91.0
10760,31150.0,Xi Yang Yang Yu Hui Tai Lang,"Adventure, Comedy, Kids",TV,Unknown,5.87,72.0
10836,30119.0,Yowamushi Monsters,Kids,TV,Unknown,6.33,85.0
10936,33839.0,Alice in Deadly School,"Comedy, School, Shounen",TV,Unknown,,1648.0
10991,32455.0,Gekidol,Music,TV,Unknown,,586.0
10995,28613.0,Ginga Jinpuu Jinraiger,"Action, Adventure, Mecha",ONA,Unknown,,627.0
11031,34151.0,Landreaall,"Action, Adventure, Fantasy, Martial Arts, Romance",OVA,Unknown,,414.0


In [None]:
#We see that there are view type of animes which we can interpolate values because these types of animes have fixed number of episodes
#The Hentai animes are adult rated animes and are of 1 episode
dataset.loc[(dataset["genre"]=="Hentai") & (dataset["episodes"]=="Unknown"),"episodes"] = "1"
# OVAs are another special episodes or animes with  1 episodes only.
dataset.loc[(dataset["type"]=="OVA") & (dataset["episodes"]=="Unknown"),"episodes"] = "1"
#Movies are of 1 episode only (ie there are no episodes for the movies)
dataset.loc[(dataset["type"] == "Movie") & (dataset["episodes"] == "Unknown"),"episodes"] = "1"

In [None]:
#After our above values are updated, checking again for the unknown values
dataset[dataset["episodes"]=="Unknown"]

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
9118,30151.0,Kamiusagi Rope: Warau Asa ni wa Fukuraitaru tt...,"Comedy, Slice of Life",TV,Unknown,6.25,143.0
9284,24775.0,Koishite!! Namashi-chan,"Kids, Slice of Life",TV,Unknown,6.14,112.0
9295,30129.0,Konna Ko Iru kana,Kids,TV,Unknown,4.0,38.0
9389,33782.0,Life!,"Comedy, Slice of Life",TV,Unknown,5.25,91.0
10760,31150.0,Xi Yang Yang Yu Hui Tai Lang,"Adventure, Comedy, Kids",TV,Unknown,5.87,72.0
10836,30119.0,Yowamushi Monsters,Kids,TV,Unknown,6.33,85.0
10936,33839.0,Alice in Deadly School,"Comedy, School, Shounen",TV,Unknown,,1648.0
10991,32455.0,Gekidol,Music,TV,Unknown,,586.0
10995,28613.0,Ginga Jinpuu Jinraiger,"Action, Adventure, Mecha",ONA,Unknown,,627.0
11032,33735.0,Locker Room,Sports,ONA,Unknown,,162.0


In [None]:
# Now the missing values we have got, we cant predict or assume their episodes count or rating, thus, removing these datapoints
dataset.drop(dataset.loc[dataset["episodes"]=="Unknown"].index, inplace=True)


In [None]:
#converting rating column to float data type
dataset["rating"] = dataset["rating"].astype(float)

In [None]:
#dropping all null values in rating column
dataset.dropna(subset = ["rating"], inplace=True)
dataset

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281.0,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630.0
1,5114.0,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665.0
2,28977.0,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262.0
3,9253.0,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572.0
4,9969.0,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266.0
...,...,...,...,...,...,...,...
12319,9316.0,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211.0
12320,5543.0,Under World,Hentai,OVA,1,4.28,183.0
12321,5621.0,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219.0
12322,6133.0,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175.0


In [None]:
#Checking again the whole dataset if there are still any unknown values left or not
(dataset[dataset["episodes"]=="Unknown"]).count()

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

**Implementation of Content Based Recommendation System**

In [None]:
#Now we create dummy variable where each row represents an anime and column represents the type of anime.If a anime corresponds to one type of anime, value of 1 would be assigned
# or else 1  (one hot encoding)
pd.get_dummies(dataset[["type"]]).head()

Unnamed: 0,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV
0,1,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,0,0,0,1


In [None]:
#Converting members column as float datatype
dataset["members"] = dataset["members"].astype(float)

In [None]:
#Feature Selection
anime_features = pd.concat([dataset["genre"].str.get_dummies(sep=","),pd.get_dummies(dataset[["type"]]),dataset[["rating"]],dataset[["members"]],dataset["episodes"]],axis=1)

In [None]:
#Animes are Japanese, there might be many special characters present in the Animes names, thus using regex to keep all animes names in normal english language
dataset["name"] = dataset["name"].map(lambda name:re.sub('[^A-Za-z0-9]+', " ", name))

In [None]:
#Viewing the animes_features table we created one step above
anime_features.head()

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,Hentai,Historical,Horror,Josei,Kids,Magic,Martial Arts,Mecha,Military,Music,Mystery,Parody,Police,Psychological,Romance,Samurai,School,Sci-Fi,Seinen,Shoujo,Shoujo Ai,Shounen,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,...,Game.1,Harem.1,Hentai.1,Historical.1,Horror.1,Josei.1,Kids.1,Magic.1,Martial Arts.1,Mecha.1,Military.1,Music.1,Mystery.1,Parody.1,Police.1,Psychological.1,Romance.1,Samurai.1,School.1,Sci-Fi.1,Seinen.1,Shoujo.1,Shounen.1,Slice of Life.1,Space.1,Sports.1,Super Power.1,Supernatural.1,Thriller.1,Vampire.1,Yaoi,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,rating,members,episodes
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,9.37,200630.0,1
1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,9.26,793665.0,64
2,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,9.25,114262.0,51
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,9.17,673572.0,24
4,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,9.16,151266.0,51


In [None]:
#Viewing all the columns of the above dataframe, to get glimpse of various genre of animes
anime_features.columns

Index([' Adventure', ' Cars', ' Comedy', ' Dementia', ' Demons', ' Drama',
       ' Ecchi', ' Fantasy', ' Game', ' Harem', ' Hentai', ' Historical',
       ' Horror', ' Josei', ' Kids', ' Magic', ' Martial Arts', ' Mecha',
       ' Military', ' Music', ' Mystery', ' Parody', ' Police',
       ' Psychological', ' Romance', ' Samurai', ' School', ' Sci-Fi',
       ' Seinen', ' Shoujo', ' Shoujo Ai', ' Shounen', ' Shounen Ai',
       ' Slice of Life', ' Space', ' Sports', ' Super Power', ' Supernatural',
       ' Thriller', ' Vampire', ' Yaoi', ' Yuri', 'Action', 'Adventure',
       'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy',
       'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei', 'Kids',
       'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery',
       'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School',
       'Sci-Fi', 'Seinen', 'Shoujo', 'Shounen', 'Slice of Life', 'Space',
       'Sports', 'Super Power', 'Supernatural'

In [None]:
# translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, 
#and thus does not destroy any sparsity.
max_abs_scaler = MaxAbsScaler()
anime_features = max_abs_scaler.fit_transform(anime_features)

**Applying KNN Machine Learning Algorithm**

In [None]:
#Using KNN to find similar animes and then fitting it on our anime_features
nearest_neighbours = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(anime_features)
distances, indices = nearest_neighbours.kneighbors(anime_features)

In [None]:
#Function to get index of anime by inputting anime name
def get_index_from_name(name):
    return dataset[dataset["name"]==name].index.tolist()[0]

In [None]:
#converting and putting all anime names in a list
all_anime_names = list(dataset.name.values)

In [None]:
# calling the function to get index of anime
get_index_from_name("Hajime no Ippo")

20

In [None]:
#Printing the array for the index of anime computed in previous cell
distances[20]

array([0.        , 1.5285287 , 2.09990476, 2.53300217, 2.68717324])

In [None]:
#Printing the indices of the above anime
indices[20]

array([  20,  254,  318, 1189,  183])

**Getting Recommendations from the Model**

In [None]:
#Function which takes anime name as an input and using the index gives the animes names list related to it
def recommend_animes(query=None):
    if query:
        found_id = get_index_from_name(query)
        print("Your Search:",query," | Its  Genre is :",dataset.loc[found_id]["genre"]," || Rating:",dataset.loc[found_id]["rating"])
        print("==================================")
        print("RECOMMENDATIONS--")
        print("==================================")
        for id in indices[found_id][1:]:
            print(dataset.loc[id]["name"],"|| Genre :",dataset.loc[id]["genre"]," || Rating:",dataset.loc[id]["rating"])
    else:
      print("Please enter an anime present in the dataset to get recommendation")
      


In [None]:
#Inputting an anime into the function and getting the closely related animes
recommend_animes("Hajime no Ippo")

Your Search: Hajime no Ippo  | Its  Genre is : Comedy, Drama, Shounen, Sports  || Rating: 8.83
RECOMMENDATIONS--
Diamond no Ace || Genre : Comedy, School, Shounen, Sports  || Rating: 8.25
Hikaru no Go || Genre : Comedy, Game, Shounen, Supernatural  || Rating: 8.19
Marmalade Boy || Genre : Comedy, Drama, Romance, Shoujo  || Rating: 7.64
SKET Dance || Genre : Comedy, School, Shounen  || Rating: 8.36


**Trying subsets of features for KNN Model**

In [None]:
# What if we change our feature selection? # Removing members feature from our model.
#Feature Selection
anime_features_2 = pd.concat([dataset["genre"].str.get_dummies(sep=","),pd.get_dummies(dataset[["type"]]),dataset[["rating"]],dataset["episodes"]],axis=1)
anime_features_2.head(4)
max_abs_scaler_2 = MaxAbsScaler()
anime_features = max_abs_scaler_2.fit_transform(anime_features_2)

#Using KNN again
nearest_neighbours = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(anime_features_2)
distances, indices = nearest_neighbours.kneighbors(anime_features_2)
recommend_animes("Hajime no Ippo")

Your Search: Hajime no Ippo  | Its  Genre is : Comedy, Drama, Shounen, Sports  || Rating: 8.83
RECOMMENDATIONS--
Diamond no Ace || Genre : Comedy, School, Shounen, Sports  || Rating: 8.25
Hikaru no Go || Genre : Comedy, Game, Shounen, Supernatural  || Rating: 8.19
Marmalade Boy || Genre : Comedy, Drama, Romance, Shoujo  || Rating: 7.64
SKET Dance || Genre : Comedy, School, Shounen  || Rating: 8.36


In [None]:
# What if we change our feature selection? # Removing members, type feature from our feature set
#Feature Selection
anime_features_2 = pd.concat([dataset["genre"].str.get_dummies(sep=","),pd.get_dummies(dataset[["rating"]]),dataset["episodes"]],axis=1)
anime_features_2.head(4)
max_abs_scaler_2 = MaxAbsScaler()
anime_features = max_abs_scaler_2.fit_transform(anime_features_2)

#Using KNN again
nearest_neighbours = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(anime_features_2)
distances, indices = nearest_neighbours.kneighbors(anime_features_2)
recommend_animes("Hajime no Ippo")

#Here we see that there is discovery also (Genre : Slice of Life is recommended rather then just Comedy,drama,shounen and sports), but here we see that out of 4 recommendations 
#2 recommendations are not rated near to the rating of the queried anime.

Your Search: Hajime no Ippo  | Its  Genre is : Comedy, Drama, Shounen, Sports  || Rating: 8.83
RECOMMENDATIONS--
Diamond no Ace || Genre : Comedy, School, Shounen, Sports  || Rating: 8.25
Hikaru no Go || Genre : Comedy, Game, Shounen, Supernatural  || Rating: 8.19
Marmalade Boy || Genre : Comedy, Drama, Romance, Shoujo  || Rating: 7.64
Puchimas Petit Petit iDOLM STER || Genre : Comedy, Slice of Life  || Rating: 7.35
