# Content Based Recommendation System
In this notebook, I will be developing a content based recommendation system for the anime and rating datasets.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Change pandas settings so we can see the all columns in the dataframe
pd.set_option('max_columns', 99)


## Preprocessing the Data
Before we develop the system, the data must be preprocessed first.
### Anime Dataset
Let's read the data into a pandas dataframe:

In [2]:
# Read in the anime dataset
anime_df = pd.read_csv("datasets/cleaned_anime.csv")
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665
1,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262
2,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572
3,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10.0,9.15,93351


For the content based filtering, I will only be using the genre of an anime to create recommendations. As this is the case, I will drop columns that won't be needed to save memory.

In [3]:
# We will drop columns that will not be needed
anime_df.drop(["type", "episodes","rating", "members"], axis=1, inplace=True)
anime_df.head()

Unnamed: 0,anime_id,name,genre
0,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
1,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
2,9253,Steins;Gate,"Sci-Fi, Thriller"
3,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports"


As we are using the genre of the animes to make our recommendations, it is a good idea to remove any animes that are uncategorised.

In [4]:
# Drop rows where there are empty values in the "genre" column
anime_df.dropna(subset=["genre"], inplace=True)

Finally, I will be hot one encoding the genres of each anime. To do this, I first needed to clean the data a bit more before encoding.

In [5]:
# I found that the list of genres that a row contains was inconsistently formatted with some rows having ", " as a delimiter and others ","
anime_df["genre"] = anime_df["genre"].str.replace(", ", ",")

# Then convert the genre column into a list so you can hot one encode the genres.
anime_df["genre"] = anime_df["genre"].str.split(",")

anime_df.head()

Unnamed: 0,anime_id,name,genre
0,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
1,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ..."
2,9253,Steins;Gate,"[Sci-Fi, Thriller]"
3,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ..."
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"[Comedy, Drama, School, Shounen, Sports]"


In [6]:
# Using scikit learn's MLB package to one hot encode the genres
from sklearn.preprocessing import MultiLabelBinarizer

# Code from https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list
mlb = MultiLabelBinarizer(sparse_output=True)

anime_df = anime_df.join(pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(anime_df["genre"]),
                index=anime_df.index,
                columns=mlb.classes_))

# Drop the origininal genre column
anime_df.drop("genre", axis=1, inplace=True)

The final anime dataframe:

In [10]:
anime_df.head()

Unnamed: 0,anime_id,name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,Historical,Horror,Josei,Kids,Magic,Martial Arts,Mecha,Military,Music,Mystery,Parody,Police,Psychological,Romance,Samurai,School,Sci-Fi,Seinen,Shoujo,Shoujo Ai,Shounen,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire
0,5114,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,28977,Gintama°,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0
2,9253,Steins;Gate,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
3,9969,Gintama&#039;,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0


### Ratings Dataset
Read in the data as a pandas dataframe:

In [9]:
rating_df = pd.read_csv("datasets/cleaned_rating.csv")
rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,
1,1,24,
2,1,79,
3,1,226,
4,1,241,


There isn't much to do to the ratings dataset except for remove missing values.

In [11]:
# Remove missing values from the data
rating_df.dropna(inplace=True)

In [17]:
# How many missing values do we have?
rating_df.isnull().sum()

user_id     0
anime_id    0
rating      0
dtype: int64

## Building the Recommendation System