# Anime Recommendation System

<img src="https://piunikaweb.com/wp-content/uploads/2020/02/EQNwyKLUUAUGPen-750x354.jpg" style="width: 100%; height: 100%" align = "left">


# Table of contents

[<h3>1. Exploratory data analysis and data cleaning</h3>](#1)

[<h3>2. Collaborative Recommendation System</h3>](#2)

[<h3>3. Recommendations</h3>](#3)

   [<h4>3.1. Naruto</h4>](#4)

   [<h4>3.2. Deathbook</h4>](#5)

In the notebook we will build a basic anime collaborative recommendation system. First of all let's have a look at the dataset.

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings. Its composition in numbers:
* 20.000.000 ratings
* 460.000 tags
* 27.000 movies

<h2> Content:</h2>

**Anime.csv that contains ratings of movies by users:**
* **anime_id** - myanimelist.net's unique id identifying an anime.
* **name** - full name of anime.
* **genre** - comma separated list of genres for this anime.
* **type** - movie, TV, OVA, etc.
* **episodes** - how many episodes in this show. (1 if movie).
* **rating** - average rating out of 10 for this anime.
* **members** - number of community members that are in this anime's
"group".

**Rating.csv that contains movie information:**
* **user_id** - non identifiable randomly generated user id.
* **anime_id** - the anime that this user has rated.
* **rating** - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).





## 1. Exploratory data analysis and data cleaning

Before we start with the recommender system, let's have a closer look at the datasets.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
anime = pd.read_csv('../data/anime.csv')
rating = pd.read_csv('../data/rating.csv')

In [2]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [3]:
rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [4]:
anime.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


In [5]:
rating.describe()

Unnamed: 0,user_id,anime_id,rating
count,7813737.0,7813737.0,7813737.0
mean,36727.96,8909.072,6.14403
std,20997.95,8883.95,3.7278
min,1.0,1.0,-1.0
25%,18974.0,1240.0,6.0
50%,36791.0,6213.0,7.0
75%,54757.0,14093.0,9.0
max,73516.0,34519.0,10.0


In [6]:
# Lets have a look the distribution of ratings, because those "-1" are suspicious
rating.rating.value_counts()

 8     1646019
-1     1476496
 7     1375287
 9     1254096
 10     955715
 6      637775
 5      282806
 4      104291
 3       41453
 2       23150
 1       16649
Name: rating, dtype: int64

`Ratings goes from 1 up to 10. Maybe -1 means that no rating are available. Therefore we will delete the row with "-1" in rating`

In [2]:
rating = rating[rating["rating"] != -1]

In [3]:
print(f"anime.csv - rows: {anime.shape[0]}, columns: {anime.shape[1]}")
print(f"rating.csv - rows: {rating.shape[0]}, columns: {rating.shape[1]}")

anime.csv - rows: 12294, columns: 7
rating.csv - rows: 6337241, columns: 3


In [4]:
plt.figure(figsize=(8,6))
sns.heatmap(anime.isnull())
plt.title("Missing values in anime?", fontsize = 15)
plt.show()

`The anime dataset has some missing values in rating and genre, but we can ignore them, because we won't use those columns later.`

In [10]:
plt.figure(figsize=(8,6))
sns.heatmap(rating.isnull())
plt.title("Missing values in rating?", fontsize = 15)
plt.show()

## 1.1. Prepare the data

In [4]:
# Merge anime and rating using "anime_id" as reference
# Keep only the columns we will use
df = pd.merge(rating,anime[["anime_id","name"]], left_on = "anime_id", right_on = "anime_id").drop("anime_id", axis = 1)
df.head()

Unnamed: 0,user_id,rating,name
0,1,10,Highschool of the Dead
1,3,6,Highschool of the Dead
2,5,2,Highschool of the Dead
3,12,6,Highschool of the Dead
4,14,6,Highschool of the Dead


In [5]:
# Count the number of ratings for each anime
count_rating = df.groupby("name")["rating"].count().sort_values(ascending = False)
count_rating

name
Death Note                         34226
Sword Art Online                   26310
Shingeki no Kyojin                 25290
Code Geass: Hangyaku no Lelouch    24126
Angel Beats!                       23565
                                   ...  
Ashita no Eleventachi                  1
Ashita e Mukau Hito                    1
Shounen Ninja Kaze no Fujimaru         1
Hi no Tori: Hagoromo-hen               1
Mechakko Dotakon                       1
Name: rating, Length: 9926, dtype: int64

In [6]:
# Some animes have only 1 rating, therefore it is better for the recommender system to ignore them
# We will keep only the animes with at least r ratings
r = 5000
more_than_r_ratings = count_rating[count_rating.apply(lambda x: x >= r)].index

# Keep only the animes with at least r ratings in the DataFrame
df_r = df[df['name'].apply(lambda x: x in more_than_r_ratings)]

In [7]:
before = len(df.name.unique())
after = len(df_r.name.unique())
rows_before = df.shape[0]
rows_after = df_r.shape[0]
print(f'''There are {before} animes in the dataset before filtering and {after} animes after the filtering.

{before} animes => {after} animes
{rows_before} rows before filtering => {rows_after} rows after filtering''')

There are 9926 animes in the dataset before filtering and 279 animes after the filtering.

9926 animes => 279 animes
6337239 rows before filtering => 2517097 rows after filtering


# 2. Collaborative Recommendation System<a class="anchor" id="2"></a>

In [8]:
# Create a matrix with userId as rows and the titles of the movies as column.
# Each cell will have the rating given by the user to the animes.
# There will be a lot of NaN values, because each user hasn't watched most of the animes
df_recom = df_r.pivot_table(index='user_id',columns='name',values='rating')
df_recom.iloc[:5,:5]

name,Accel World,Afro Samurai,Air,Air Gear,Akame ga Kill!
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,,,,
2,,,,,
3,7.0,6.0,,,8.0
5,3.0,,,,4.0
7,8.0,,,,


In [9]:
df_r.name.value_counts().head(10)

Death Note                            34226
Sword Art Online                      26310
Shingeki no Kyojin                    25290
Code Geass: Hangyaku no Lelouch       24126
Angel Beats!                          23565
Elfen Lied                            23528
Naruto                                22071
Fullmetal Alchemist: Brotherhood      21494
Fullmetal Alchemist                   21332
Code Geass: Hangyaku no Lelouch R2    21124
Name: name, dtype: int64

In [12]:
# pickle the recommendor dataframe
df_recom.to_pickle("../models/df.zip", compression='zip')

In [13]:
new_df= pd.read_pickle("../models/df.zip")
new_df.head()


name,Accel World,Afro Samurai,Air,Air Gear,Akame ga Kill!,Akira,Aldnoah.Zero,Amagi Brilliant Park,Angel Beats!,Angel Beats!: Another Epilogue,...,Yahari Ore no Seishun Love Comedy wa Machigatteiru.,Yahari Ore no Seishun Love Comedy wa Machigatteiru. Zoku,"Yosuga no Sora: In Solitude, Where We Are Least Alone.",Yuu☆Yuu☆Hakusho,Zankyou no Terror,Zero no Tsukaima,Zero no Tsukaima F,Zero no Tsukaima: Futatsuki no Kishi,Zero no Tsukaima: Princesses no Rondo,Zetsuen no Tempest
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,7.0,6.0,,,8.0,,7.0,,,,...,,,,,,,,,,8.0
5,3.0,,,,4.0,8.0,,6.0,3.0,,...,3.0,4.0,,7.0,,1.0,1.0,1.0,1.0,
7,8.0,,,,,,,,,,...,,,7.0,,,,,,,


In [12]:
def find_corr(df, name):
    '''
    Get the correlation of one anime with the others
    
    Args
        df (DataFrame):  with user_id as rows and movie titles as column and ratings as values
        name (str): Name of the anime
    
    Return
        DataFrame with the correlation of the anime with all others
    '''
    
    similar_to_movie = df.corrwith(df[name])
    similar_to_movie = pd.DataFrame(similar_to_movie,columns=['Correlation'])
    similar_to_movie = similar_to_movie.sort_values(by = 'Correlation', ascending = False)
    return similar_to_movie

# 3. Recommendations

Let's try the recommendation system on three animes.

* The higher the correlation, the higher the possibility that the viewer of the selected anime will like the recommended anime
* Negative correlation means that the viewer is likely to dislikes the anime

## 3.1. [Naruto](https://en.wikipedia.org/wiki/Naruto)<a class="anchor" id="4"></a>
<img src="https://upload.wikimedia.org/wikipedia/en/9/94/NarutoCoverTankobon1.jpg" style="width: 20%; height: 20%" align = "left">


In [18]:
# Let's choose an anime
anime1 = 'Naruto'

# Let's try with "Naruto"

# Recommendations
find_corr(df_recom, anime1).head(10)

Unnamed: 0_level_0,Correlation
name,Unnamed: 1_level_1
Fullmetal Alchemist,1.0
Fullmetal Alchemist: The Conqueror of Shamballa,0.549444
Fullmetal Alchemist: Brotherhood,0.329108
Soul Eater,0.326991
Ao no Exorcist,0.326512
Vampire Knight Guilty,0.318152
Ouran Koukou Host Club,0.312287
Full Metal Panic!,0.310996
Claymore,0.310494
Vampire Knight,0.304586


In [15]:
# Not recommended
find_corr(df_recom, anime1).tail(10)

Unnamed: 0_level_0,Correlation
name,Unnamed: 1_level_1
Suzumiya Haruhi no Shoushitsu,0.109187
Neon Genesis Evangelion,0.092781
Byousoku 5 Centimeter,0.091883
Bakemonogatari,0.075505
Neon Genesis Evangelion: The End of Evangelion,0.070812
Mahou Shoujo Madoka★Magica,0.070293
Cowboy Bebop,0.064961
NHK ni Youkoso!,0.063474
FLCL,0.046725
Baccano!,0.033748


## 3.2. [Death Note](https://en.wikipedia.org/wiki/Death_Note)<a class="anchor" id="5"></a>
<img src="https://upload.wikimedia.org/wikipedia/en/6/6f/Death_Note_Vol_1.jpg" style="width: 20%; height: 20%" align = "left">

In [16]:
# Let's choose an anime
anime2 = 'Death Note'

# Recommendations
find_corr(df_recom, anime2).head(10)

Unnamed: 0_level_0,Correlation
name,Unnamed: 1_level_1
Death Note,1.0
Code Geass: Hangyaku no Lelouch R2,0.358927
Code Geass: Hangyaku no Lelouch,0.35129
Shingeki no Kyojin,0.34677
Naruto,0.313676
Bleach,0.301795
Elfen Lied,0.297404
Dragon Ball Z,0.296798
Mirai Nikki (TV),0.2938
Kuroshitsuji,0.293409


In [17]:
# Not recommended
find_corr(df_recom, anime2).tail(10)

Unnamed: 0_level_0,Correlation
name,Unnamed: 1_level_1
Mononoke Hime,0.143574
Baccano!,0.142693
Ookami to Koushinryou,0.137454
Byousoku 5 Centimeter,0.135572
Akira,0.128373
NHK ni Youkoso!,0.126408
Neon Genesis Evangelion,0.108453
Neon Genesis Evangelion: The End of Evangelion,0.104087
Cowboy Bebop,0.074552
FLCL,0.062253
