# Collaborative Filtering
This notebook will contain the steps to create a collaborative fitering recommendation system for the anime and ratings datasets.

In [None]:
import pandas as pd
import numpy as np

## Preprocessing the Data
### Anime Dataset
Let's read the anime dataset into a pandas dataframe

In [2]:
anime_df = pd.read_csv("datasets/cleaned_anime.csv")
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665
1,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262
2,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572
3,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10.0,9.15,93351


With collaborative filtering, we actually don't need a lot of the animes' metadata as the recommendations are primarily dictated by the users who have watched a certain anime and not the actual contents. This contrasts with content based filtering which uses the anime information to predict what a user would like. As a result of this, we can remove a lot of information that we don't need from the anime data both to make the process clearer and save memory.

In [3]:
# Remove information we don't need
anime_df = anime_df.loc[:, ["anime_id", "name", "rating"]]
anime_df.head()

Unnamed: 0,anime_id,name,rating
0,5114,Fullmetal Alchemist: Brotherhood,9.26
1,28977,Gintama°,9.25
2,9253,Steins;Gate,9.17
3,9969,Gintama&#039;,9.16
4,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,9.15


### Ratings Dataset
Read in the data into a dataframe

In [4]:
rating_df = pd.read_csv("datasets/cleaned_rating.csv")
rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,
1,1,24,
2,1,79,
3,1,226,
4,1,241,


The only thing to do for the ratings data is to remove missing values

In [5]:
# Remove missing values
rating_df.dropna(inplace=True)
# How many missing values do we have?
rating_df.isnull().sum()

user_id     0
anime_id    0
rating      0
dtype: int64

# Building the Recommendation System
To build the collaborative filtering system, I will be using a sample user input from animes I have watched in the past. To build it, I will be using the SciKit Surprise library.

In [6]:
%%time
# Import the surprise library
from surprise import Reader
from surprise import Dataset

# reader parses the file containing the ratings
# Our rating scale is from 1 to 10 inclusive
reader = Reader(rating_scale=(1, 10))

# Load the dataframe into the model's dataset
data = Dataset.load_from_df(rating_df[["user_id", "anime_id", "rating"]], reader)
data

Wall time: 16.5 s


<surprise.dataset.DatasetAutoFolds at 0x25d62585070>

Now that we have our rating data loaded into surprise, we can create our collaborative filtering system. For this, I decided to use the famous SVD algorithm that is included with surprise. I will also be using surprise's GridSearchCV which is very similar to sklearn's GridSearchCV to find the best parameters for the best accuracy.

In [7]:
# Import the SVD algorithm
from surprise import SVD

In [12]:
# Import GridSearchCV to tune parameters
from surprise.model_selection import GridSearchCV

In [13]:
# This cell took a LONG time to run on my computer
# Don't suggest running it unless you have time to waste

%%time

# Create parameters combinations
params = {
    "n_epochs": [10, 15], "lr_all": [0.003, 0.005, 0.007], "reg_all": [0.01, 0.02, 0.03]
}

# Run the grid search using SVD and the parameters to find the best parameters for Root Mean Square Error and Mean Absolute Error
gs = GridSearchCV(SVD, params, measures=['rmse', 'mae'], cv=3, joblib_verbose=2, n_jobs=-2)

gs.fit(data)

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  27 tasks      | elapsed: 18.3min


Wall time: 41min 48s


[Parallel(n_jobs=-2)]: Done  54 out of  54 | elapsed: 41.1min finished


In [14]:
# Best RMSE score
print(gs.best_score["rmse"])

# Best combination of parameters for the best RMSE
print(gs.best_params["rmse"])

1.138933300338331
{'n_epochs': 15, 'lr_all': 0.007, 'reg_all': 0.03}


Now that we have the best parameters from the GridSearchCV we ran, we can now set our recommendation algorithm to the parameters

In [16]:
%%time
algo = gs.best_estimator["rmse"]
algo.fit(data.build_full_trainset())

Wall time: 4min 40s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x25d62ac7880>

This is as far as my Google searches and tutorials got me. What I suspect is that using `algo.predict(user_id, anime_id).est` I can get the estimated rating that the user would give for that specific anime. Perhaps if I made a numpy array that stored the estimated ratings for a number of animes for a given user, I could find the ones with the highest estimated ratings and recommend those.

In [46]:
# Just testing out the prediction method on user 4271 on anime id 7088
pred = algo.predict(4271, 7088).est
pred

8.01538233688656

Last thing to do is to export the recommendation system into a Pickle file for potentially usage in the future.

In [48]:
# Import the dump function
from surprise.dump import dump
# Create the Pickle file for the SVD algorithm
dump("recommender.pkl", algo)