# Content Based Recommendation System
In this notebook, I will be developing a content based recommendation system for the anime and rating datasets.

In [31]:
# Import libraries
import pandas as pd
import numpy as np

# Change pandas settings so we can see the all columns in the dataframe
# pd.set_option('max_columns', 99)

## Preprocessing the Data
Before we develop the system, the data must be preprocessed first.
### Anime Dataset
Let's read the data into a pandas dataframe:

In [32]:
# Read in the anime dataset
anime_df = pd.read_csv("datasets/cleaned_anime.csv")
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


For the content based filtering, I will only be using the genre of an anime to create recommendations. As this is the case, I will drop columns that won't be needed to save memory.

In [33]:
# We will drop columns that will not be needed
anime_df.drop(["type", "episodes","rating", "members"], axis=1, inplace=True)
anime_df.head()

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
3,9253,Steins;Gate,"Sci-Fi, Thriller"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."


As we are using the genre of the animes to make our recommendations, it is a good idea to remove any animes that are uncategorised.

In [34]:
# Drop rows where there are empty values in the "genre" column
anime_df.dropna(subset=["genre"], inplace=True)

Finally, I will be hot one encoding the genres of each anime. To do this, I first needed to clean the data a bit more before encoding.

In [35]:
# I found that the list of genres that a row contains was inconsistently formatted with some rows having ", " as a delimiter and others ","
anime_df["genre"] = anime_df["genre"].str.replace(", ", ",")

# Then convert the genre column into a list so you can hot one encode the genres.
anime_df["genre"] = anime_df["genre"].str.split(",")

anime_df.head()

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]"
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ..."
3,9253,Steins;Gate,"[Sci-Fi, Thriller]"
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ..."


In [36]:
# Using scikit learn's MLB package to one hot encode the genres
from sklearn.preprocessing import MultiLabelBinarizer

# Code from https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list
mlb = MultiLabelBinarizer(sparse_output=True)

anime_df = anime_df.join(pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(anime_df["genre"]),
                index=anime_df.index,
                columns=mlb.classes_))

# Drop the origininal genre column
anime_df.drop("genre", axis=1, inplace=True)
anime_df

Unnamed: 0,anime_id,name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,9969,Gintama&#039;,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12290,5543,Under World,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12291,5621,Violence Gekiga David no Hoshi,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
# mlb = MultiLabelBinarizer()
# anime_df = anime_df.join(pd.DataFrame(mlb.fit_transform(anime_df.pop('genre')),
#                           columns=mlb.classes_,
#                           index=anime_df.index))


The final anime dataframe:

In [38]:
anime_df.head()

Unnamed: 0,anime_id,name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,9969,Gintama&#039;,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Ratings Dataset
Read in the data as a pandas dataframe:

In [39]:
rating_df = pd.read_csv("datasets/cleaned_rating.csv")
rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,
1,1,24,
2,1,79,
3,1,226,
4,1,241,


There isn't much to do to the ratings dataset except for remove missing values.

In [40]:
# Remove missing values from the data
rating_df.dropna(inplace=True)

In [41]:
# How many missing values do we have?
rating_df.isnull().sum()

user_id     0
anime_id    0
rating      0
dtype: int64

## Building the Recommendation System
Now that preprocessing of the data is complete, it is time to build the recommendation system. For now, I will only be developing the recommendation based on a single user example.

In [42]:
# Use the random library to generate a random user id
import random
# Set random seed (for reproducibility)
# random.seed(10)

# Pick a random id from the ratings dataset
user = random.randrange(rating_df["user_id"].min(), rating_df["user_id"].max())
user

7209

We have a user id to base our recommendation system now. From this we create a dataframe containing the animes that user 4271 has viewed and rated.

In [43]:
user_df = rating_df[rating_df["user_id"]==64237]

# Reset the indexes
user_df.reset_index(drop=True, inplace=True)

# Drop the columns that are not needed
user_df = user_df.drop("user_id", axis=1)

In [44]:
user_df

Unnamed: 0,anime_id,rating
0,20,7.0
1,226,4.0
2,249,7.0
3,269,7.0
4,1887,9.0
5,5114,9.0
6,8841,9.0
7,9919,7.0


After that, we find the relevant animes and their genre information from the anime dataset.

In [45]:
user_genre_df = anime_df[anime_df["anime_id"].isin(user_df["anime_id"])]
user_genre_df

Unnamed: 0,anime_id,name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
1,5114,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
582,269,Bleach,1,0,0,1,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
643,9919,Ao no Exorcist,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
690,249,InuYasha,1,1,0,1,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
729,1887,Lucky☆Star,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
760,226,Elfen Lied,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
841,20,Naruto,1,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1123,8841,Kore wa Zombie Desu ka?,1,0,0,1,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0


Note that this dataframe contains fewer values as some of the animes they have rated have not been categorised into at least 1 genre.

In [46]:
# Sort the genre animes by the anime_id's so that the rows correspond to the same anime in the user's rated dataframe
user_genre_df = user_genre_df.sort_values("anime_id")
user_genre_df.reset_index(drop=True, inplace=True)
user_genre_df

Unnamed: 0,anime_id,name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,20,Naruto,1,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,226,Elfen Lied,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2,249,InuYasha,1,1,0,1,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,269,Bleach,1,0,0,1,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
4,1887,Lucky☆Star,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,5114,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,8841,Kore wa Zombie Desu ka?,1,0,0,1,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
7,9919,Ao no Exorcist,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


In [47]:
user_df

Unnamed: 0,anime_id,rating
0,20,7.0
1,226,4.0
2,249,7.0
3,269,7.0
4,1887,9.0
5,5114,9.0
6,8841,9.0
7,9919,7.0


After that, we find the relevant animes and their genre information from the anime dataset.

In [48]:
user_genre_matrix = user_genre_df.drop(["anime_id", "name"], axis=1)
user_genre_matrix

Unnamed: 0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,1,1,0,1,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
7,1,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0


With content based filtering, we create a matrix of the genres that the user watches. We then create weights for each genre where the higher the weight, the more likely the user enjoys that genre. To do this, we use the dot product between a matrix and a vector: the genre matrix and the ratings of each of the animes they watched (vector).

In [49]:
# Vector 
user_df["rating"]


0    7.0
1    4.0
2    7.0
3    7.0
4    9.0
5    9.0
6    9.0
7    7.0
Name: rating, dtype: float64

In [50]:
# Dot product
weights = user_genre_matrix.transpose().dot(user_df["rating"])

weights

Action           50.0
Adventure        16.0
Cars              0.0
Comedy           39.0
Dementia          0.0
Demons           14.0
Drama            13.0
Ecchi             9.0
Fantasy          23.0
Game              0.0
Harem             9.0
Hentai            0.0
Historical        0.0
Horror            4.0
Josei             0.0
Kids              0.0
Magic            25.0
Martial Arts      7.0
Mecha             0.0
Military          9.0
Music             0.0
Mystery           0.0
Parody            9.0
Police            0.0
Psychological     4.0
Romance          11.0
Samurai           0.0
School            9.0
Sci-Fi            0.0
Seinen            4.0
Shoujo            0.0
Shoujo Ai         0.0
Shounen          37.0
Shounen Ai        0.0
Slice of Life     9.0
Space             0.0
Sports            0.0
Super Power      14.0
Supernatural     34.0
Thriller          0.0
Vampire           0.0
Yaoi              0.0
Yuri              0.0
dtype: float64

These are the weights or in other words, the genre preferences, of user 4271. We can then use these weights to recommend animes to the user. First let's grab the full anime dataset with the genres hot one encoded. We then set the anime id as the index and remove the name column which will not be needed.

In [51]:
# Set the index of the dataframe to the anime_id
recommendation_table = anime_df.set_index("anime_id")
# Drop the name column
recommendation_table.drop("name", axis=1, inplace=True)
recommendation_table.head()

Unnamed: 0_level_0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
32281,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5114,1,1,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
28977,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9253,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9969,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


After this, we multiply the genres in the matrix with the weights and then divide it by the sum of the weights for a weighted average of the animes. This will mean the resulting values will be based on the user's weights and the genre(s) an anime is categorised as.

In [52]:
# Get the weighted average
recommendation_series = (recommendation_table * weights).sum(axis=1) / weights.sum()
recommendation_series.head()

anime_id
32281    0.191977
5114     0.495702
28977    0.386819
9253     0.000000
9969     0.386819
dtype: float64

The last step before we can finally recommend our user anime shows is to sort the values in descending order so we get the animes that would most appeal to the user at the top.

In [53]:
# Sort in descending order
recommendations = recommendation_series.sort_values(ascending=False)
recommendations.head(10)

anime_id
231      0.724928
249      0.713467
6811     0.713467
25157    0.704871
808      0.667622
33581    0.607450
121      0.607450
9135     0.607450
157      0.601719
28285    0.598854
dtype: float64

## Final Result
After all that work creating the recommendation system, we now have the top 10 recommendations for 
The final top 10 recommendations for user 4271:

In [54]:
# Find the top 10 animes in the recommendations in the anime dataset and put it in a new dataframe
recommendations_df = anime_df.loc[anime_df["anime_id"].isin(recommendations.head(10).keys())]
# Set the index of the dataframe to the anime ids
recommendations_df.set_index("anime_id", inplace=True)
# Use loc and the anime ids of the top 10 anime recommendations to preserve the order and output that to the user
recommendations_df.loc[recommendations.head(10).keys()][["name"]]
# recommendations_df

Unnamed: 0_level_0,name
anime_id,Unnamed: 1_level_1
231,Asagiri no Miko
249,InuYasha
6811,InuYasha: Kanketsu-hen
25157,Trinity Seven
808,Bakuretsu Hunters OVA
33581,Trinity Seven Movie: Eternity Library to Alche...
121,Fullmetal Alchemist
9135,Fullmetal Alchemist: The Sacred Star of Milos
157,Mahou Sensei Negima!
28285,Trinity Seven OVA
