# Content Based Recommendation System
In this notebook, I will be developing a content based recommendation system for the anime and rating datasets.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# Change pandas settings so we can see the all columns in the dataframe
pd.set_option('max_columns', 99)

## Preprocessing the Data
Before we develop the system, the data must be preprocessed first.
### Anime Dataset
Let's read the data into a pandas dataframe:

In [2]:
# Read in the anime dataset
anime_df = pd.read_csv("datasets/cleaned_anime.csv")
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665
1,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262
2,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572
3,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10.0,9.15,93351


For the content based filtering, I will only be using the genre of an anime to create recommendations. As this is the case, I will drop columns that won't be needed to save memory.

In [3]:
# We will drop columns that will not be needed
anime_df.drop(["type", "episodes","rating", "members"], axis=1, inplace=True)
anime_df.head()

Unnamed: 0,anime_id,name,genre
0,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
1,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
2,9253,Steins;Gate,"Sci-Fi, Thriller"
3,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports"


As we are using the genre of the animes to make our recommendations, it is a good idea to remove any animes that are uncategorised.

In [4]:
# Drop rows where there are empty values in the "genre" column
anime_df.dropna(subset=["genre"], inplace=True)

Finally, I will be hot one encoding the genres of each anime. To do this, I first needed to clean the data a bit more before encoding.

In [5]:
# I found that the list of genres that a row contains was inconsistently formatted with some rows having ", " as a delimiter and others ","
anime_df["genre"] = anime_df["genre"].str.replace(", ", ",")

# Then convert the genre column into a list so you can hot one encode the genres.
anime_df["genre"] = anime_df["genre"].str.split(",")

anime_df.head()

Unnamed: 0,anime_id,name,genre
0,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil..."
1,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ..."
2,9253,Steins;Gate,"[Sci-Fi, Thriller]"
3,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ..."
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"[Comedy, Drama, School, Shounen, Sports]"


In [6]:
# Using scikit learn's MLB package to one hot encode the genres
from sklearn.preprocessing import MultiLabelBinarizer

# Code from https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list
mlb = MultiLabelBinarizer(sparse_output=True)

anime_df = anime_df.join(pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(anime_df["genre"]),
                index=anime_df.index,
                columns=mlb.classes_))

# Drop the origininal genre column
anime_df.drop("genre", axis=1, inplace=True)

The final anime dataframe:

In [7]:
anime_df.head()

Unnamed: 0,anime_id,name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,Historical,Horror,Josei,Kids,Magic,Martial Arts,Mecha,Military,Music,Mystery,Parody,Police,Psychological,Romance,Samurai,School,Sci-Fi,Seinen,Shoujo,Shoujo Ai,Shounen,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire
0,5114,Fullmetal Alchemist: Brotherhood,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,28977,Gintama°,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0
2,9253,Steins;Gate,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
3,9969,Gintama&#039;,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0


### Ratings Dataset
Read in the data as a pandas dataframe:

In [None]:
rating_df = pd.read_csv("datasets/cleaned_rating.csv")
rating_df.head()

There isn't much to do to the ratings dataset except for remove missing values.

In [None]:
# Remove missing values from the data
rating_df.dropna(inplace=True)

In [None]:
# How many missing values do we have?
rating_df.isnull().sum()

## Building the Recommendation System
Now that preprocessing of the data is complete, it is time to build the recommendation system. For now, I will only be developing the recommendation based on a single user example.

In [None]:
# Use the random library to generate a random user id
import random
# Set random seed (for reproducibility)
random.seed(10)

# Pick a random id from the ratings dataset
user = random.randint(rating_df["user_id"].min(), rating_df["user_id"].max())
user

We have a user id to base our recommendation system now. From this we create a dataframe containing the animes that user 4271 has viewed and rated.

In [None]:
user_df = rating_df[rating_df["user_id"]==4271]

# Reset the indexes
user_df.reset_index(drop=True, inplace=True)
# Drop the columns that are not needed
user_df = user_df.drop("user_id", axis=1)

In [None]:
user_df

After that, we find the relevant animes and their genre information from the anime dataset.

In [None]:
user_genre_df = anime_df[anime_df["anime_id"].isin(user_df["anime_id"])]
user_genre_df

Note that this dataframe contains fewer values as some of the animes they have rated have not been categorised into at least 1 genre.

In [None]:
# Sort the genre animes by the anime_id's so that the rows correspond to the same anime in the user's rated dataframe
user_genre_df = user_genre_df.sort_values("anime_id")
user_genre_df.reset_index(drop=True, inplace=True)
user_genre_df

In [None]:
# Drop the animes in the user's rated dataframe that are not categorised by at least 1 genre
user_df.drop([0, 1], axis=0, inplace=True)
user_df.reset_index(drop=True, inplace=True)
user_df

For the rated animes with genres, all we need are the genres.

In [None]:
user_genre_matrix = user_genre_df.drop(["anime_id", "name"], axis=1)
user_genre_matrix

With content based filtering, we create a matrix of the genres that the user watches. We then create weights for each genre where the higher the weight, the more likely the user enjoys that genre. To do this, we use the dot product between a matrix and a vector: the genre matrix and the ratings of each of the animes they watched (vector).

In [None]:
# Vector 
user_df["rating"]

In [None]:
# Dot product
weights = user_genre_matrix.transpose().dot(user_df["rating"])

weights

These are the weights or in other words, the genre preferences, of user 4271. We can then use these weights to recommend animes to the user. First let's grab the full anime dataset with the genres hot one encoded. We then set the anime id as the index and remove the name column which will not be needed.

In [None]:
# Set the index of the dataframe to the anime_id
recommendation_table = anime_df.set_index("anime_id")
# Drop the name column
recommendation_table.drop("name", axis=1, inplace=True)
recommendation_table.head()

After this, we multiply the genres in the matrix with the weights and then divide it by the sum of the weights for a weighted average of the animes. This will mean the resulting values will be based on the user's weights and the genre(s) an anime is categorised as.

In [None]:
# Get the weighted average
recommendation_series = (recommendation_table * weights).sum(axis=1) / weights.sum()
recommendation_series.head()

The last step before we can finally recommend our user anime shows is to sort the values in descending order so we get the animes that would most appeal to the user at the top.

In [None]:
# Sort in descending order
recommendations = recommendation_series.sort_values(ascending=False)
recommendations.head(10)

## Final Result
After all that work creating the recommendation system, we now have the top 10 recommendations for 
The final top 10 recommendations for user 4271:

In [None]:
# Find the top 10 animes in the recommendations in the anime dataset and put it in a new dataframe
recommendations_df = anime_df.loc[anime_df["anime_id"].isin(recommendations.head(10).keys())]
# Set the index of the dataframe to the anime ids
recommendations_df.set_index("anime_id", inplace=True)
# Use loc and the anime ids of the top 10 anime recommendations to preserve the order and output that to the user
recommendations_df.loc[recommendations.head(10).keys()][["name"]]