# **Anime Recommender System**
The number of anime is growing exponentially. That's why we found useful to create recommendation systems about that topic. This type of system could be especially helpful for users who are new to anime and are not sure where to start.

In this notebook we are going to explore the various types of recommender systems at 360°. In particular we want to create:
*  Popularity Based
*  Content Based Filtering. We have thought of two versions of this type, in order to be able to compare their efficiency.
*  Collaborative Filtering, using the ALS algorithm.

To evaluate the performance of these recommendation systems, we use various metrics, including Precision and Recall, RMSE, MAE, and MSE. These metrics help to assess the quality of the recommendations generated by each system.





---


## **1) Setup, installing packages and dependencies**

In [None]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
!pip install pyspark
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=1dd99abdf7b563f34f184e06d7dea2f9506df6f7a3869dfadd16d0a48092310a
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [None]:
from google.colab import drive

GDRIVE_DIR = "/content/gdrive"

drive.mount(GDRIVE_DIR, force_remount=True)

Mounted at /content/gdrive


In [None]:
# create the session
conf = SparkConf().set("spark.ui.port", "4051").set('spark.executor.memory', '12G').set('spark.driver.memory', '45G').set('spark.driver.maxResultSize', '10G').set('spark.worker.memory', '12G')

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()



---


## **2) Datasets Preprocessing**
The dataset is taken from [Kaggle](https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020) and it contains information about 
*  **17,562 anime** 
*  **325,772 different users**
*  **109 Million rows**



### **2.1) Upload and cleaning the dataset containing ratings**
The columns of this file are as follows:
*   **user_id**: non identifiable randomly generated user id.
*   **anime_id**: MyAnemlist ID of the anime. (e.g. 1).
*   **rating**: score between 1 to 10 given by the user. 0 if the user didn't assign a score. (e.g. 10)
*   **watching_status**: state ID from this anime in the anime list of this user. (e.g. 2)
*   **watched_episodes**: numbers of episodes watched by the user. (e.g. 24)

In [None]:
GDRIVE_RATINGS_FILE = "../models/animelist.csv"

spark = SparkSession.builder.appName("Raccomandation").getOrCreate()
ratings_df = spark.read.load(GDRIVE_RATINGS_FILE,  format="csv",  sep=",",  inferSchema="true",  header="true" )

In [None]:
#Limiting the dataset is useful for optimizing the construction of the recommender system (limited to 20 million rows)
ratings_df = ratings_df.limit(20000000)

In [None]:
#Let's visualize the structure of the dataset
df = ratings_df.filter(ratings_df.user_id == 8)
df.show(truncate=False)

In [None]:
print("The shape of the dataset is {:d} rows by {:d} columns".format(ratings_df.count(), len(ratings_df.columns)))

In [None]:
for c in ratings_df.columns:
  print("N. of missing values of column `{:s}` = {:d}".format(c, ratings_df.where(col(c).isNull()).count()))

In [None]:
ratings_df = ratings_df.select(ratings_df.user_id, ratings_df.anime_id, ratings_df.rating)

In [None]:
print(f'Data types of all the columns is : {ratings_df.dtypes}')

In [None]:
ratings_df = ratings_df.withColumn("user_id", ratings_df["user_id"].cast('int'))
ratings_df = ratings_df.withColumn("anime_id", ratings_df["anime_id"].cast('int'))
ratings_df = ratings_df.withColumn("rating", ratings_df["rating"].cast('int'))

In [None]:
print("The number of unique users are: {:d}".format(ratings_df.select("user_id").distinct().count())) 
print("The number of unique anime are: {:d}".format(ratings_df.select("anime_id").distinct().count())) 

In [None]:
dropDisDF = ratings_df.dropDuplicates(["user_id","anime_id"])
ratings_df = dropDisDF

In [None]:
ratings_df.describe().show()

In [None]:
# Removing reviews with rating = 0 -> if there is rating = 0 means that the user did not give a rating to the anime (they are not useful, they may even be counterproductive)
ratings_df = ratings_df.filter(ratings_df.rating != 0)

In [None]:
print(ratings_df.count())

62397712


### **2.2) Upload and cleaning the dataset containing anime data**
The columns of this file are as follows:
*   **MAL_ID**: MyAnimelist ID of the anime. (e.g. 1)
*   **Name**: full name of the anime. (e.g. Cowboy Bebop)
*   **Score**: average score of the anime given from all users in MyAnimelist database. (e.g. 8.78)
*   **Genres**: comma separated list of genres for this anime. (e.g. Action, *   *
   Adventure, Comedy, Drama, Sci-Fi, Space)
*   **English name**: full name in english of the anime. (e.g. Cowboy Bebop)
*   **Japanese name**: full name in japanses of the anime. (e.g. カウボーイビバップ)
*   **Type**: TV, movie, OVA, etc. (e.g. TV)
*   **Episodes**': number of chapters. (e.g. 26)
*   **Aired**: broadcast date. (e.g. Apr 3, 1998 to Apr 24, 1999)
*   **Premiered**: season premiere. (e.g. Spring 1998)
*   **Producers**: comma separated list of produducers (e.g. Bandai Visual)
*   **Licensors**: comma separated list of licensors (e.g. Funimation, Bandai Entertainment)
*   **Studios**: comma separated list of studios (e.g. Sunrise)
*   **Source**: Manga, Light novel, Book, etc. (e.g Original)
*   **Duration**: duration of the anime per episode (e.g 24 min. per ep.)
*   **Rating**: age rate (e.g. R - 17+ (violence & profanity))
*   **Ranked**: position based in the score. (e.g 28)
*   **Popularity**: position based in the the number of users who have added the anime to their list. (e.g 39)
*   **Members**: number of community members that are in this anime's "group". (e.g. 1251960)
*   **Favorites**: number of users who have the anime as "favorites". (e.g. 61,971)
*   **Watching**: number of users who are watching the anime. (e.g. 105808)
*   **Completed**: number of users who have complete the anime. (e.g. 718161)
*   **On-Hold**: number of users who have the anime on Hold. (e.g. 71513)
*   **Dropped**: number of users who have dropped the anime. (e.g. 26678)
*   **Plan to Watch'**: number of users who plan to watch the anime. (e.g. 329800)

In [None]:
GDRIVE_CONTENT_FILE = "../models/anime.csv""

content_df = spark.read.load(GDRIVE_CONTENT_FILE,  format="csv",  sep=",",  inferSchema="true",  header="true" )

In [None]:
for c in content_df.columns:
  print("N. of missing values of column `{:s}` = {:d}".format(c, content_df.where(col(c).isNull()).count()))

In [None]:
print(f'Data types of all the columns is : {content_df.dtypes}')

In [None]:
#5144 rows have in the "Score" column the value "Unknown", therefore they are not of interest to us
content_df = content_df.withColumn("Score", content_df["Score"].cast('double'))

In [None]:
#We get rid of anime whose genre is not known. It is a fundamental feature to carry out our recommendations
content_df = content_df.filter(content_df.Genres != "Unknown")

In [None]:
#removing 3 animes which have a number in the column of Genres.
content_df = content_df.filter( (content_df.MAL_ID != 37490) & (content_df.MAL_ID != 31630) & (content_df.MAL_ID != 16187) )



---


## **3) Popularity Based Recommender System**
We use multiple factors to determine the popularity of an anime, such as its score, number of members, and number of favorites. This approach can help to provide a more accurate view of an anime's popularity. We weigh these values according to their importance. 
1. **Score** 50%
2. **Members** 25% 
3. **Favorites** 25%

In [None]:
popularity_df = content_df.select(content_df.MAL_ID, content_df.Name, content_df.Score, content_df.Members, content_df.Favorites)

In [None]:
popularity_df.show(truncate=False)

In [None]:
popularity_pdf = popularity_df.toPandas()

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()
popularity_pdf[['Score','Members','Favorites']]=scaler.fit_transform(popularity_pdf[['Score','Members','Favorites']])

In [None]:
popularity_pdf['Weighted_score']=popularity_pdf['Score']*0.5 + popularity_pdf['Members']*0.25 + popularity_pdf['Favorites']*0.25
popularity_pdf.sort_values('Weighted_score',ascending=False).head(50)

Unnamed: 0,MAL_ID,Name,Score,Members,Favorites,Weighted_score
3971,5114,Fullmetal Alchemist: Brotherhood,1.0,0.86828,1.0,0.96707
1393,1535,Death Note,0.923706,1.0,0.789505,0.909229
7448,16498,Shingeki no Kyojin,0.90327,0.977542,0.706004,0.872521
5683,9253,Steins;Gate,0.989101,0.683965,0.807182,0.867337
6474,11061,Hunter x Hunter (2011),0.987738,0.646414,0.800776,0.855667
11,21,One Piece,0.908719,0.522377,0.68861,0.757107
11281,32281,Kimi no Na wa.,0.968665,0.666779,0.386344,0.747613
1431,1575,Code Geass: Hangyaku no Lelouch,0.935967,0.611643,0.492007,0.743896
10438,30276,One Punch Man,0.915531,0.820167,0.295981,0.736803
9881,28851,Koe no Katachi,0.974114,0.535848,0.339572,0.705912




---


## **4) Collaborative Filtering** 

Now we give anime recommendations using Collaborative filtering approach. In particular, we use [Alternating Least of Squares (ALS)](https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html) model.

We took inspiration from the [notebook](https://github.com/gtolomei/big-data-computing/blob/master/notebooks/MF_Recommender_Systems.ipynb) provided by Professor Tolomei.


### **4.1) Splitting the ratings dataset into Training and Test Set**
80% training, 20% test

In [None]:
# Randomly split our original dataset `ratings_df` into 80÷20 for training and test, respectively
RANDOM_SEED = 42 # for reproducibility

train_df, test_df = ratings_df.randomSplit([0.8, 0.2], seed=RANDOM_SEED)

In [None]:
print("Training set size: {:d} instances".format(train_df.count()))
print("Test set size: {:d} instances".format(test_df.count()))

### **4.2) Alternating Least Square (ALS)**
The ALS algorithm is a matrix factorization method that decomposes a matrix R into two factors U and V such that R≈UTV. In the context of recommendation systems, the matrices U and V can be thought of as the user and item matrices, respectively. It uses an iterative approach to minimize the loss function by applying gradient descent. This allows it to find the optimal values for the factors U and V that best approximate the original matrix R.

**Advantages** 
* Scalability: can handle large-scale datasets
* Sparsity: can handle sparse matrices. This allows to make predictions even for users and items with few ratings.
* Customization: allows for the specification of different parameters, such as the number of latent factors and the regularization parameter.

**Disadvantages** 
* Cold start: may have difficulty making predictions for users or items with no ratings.
* Sensitivity to initialization: may be sensitive to the initial values of the factors U and V, which can affect the quality of the final solution.

In [None]:
from pyspark.ml.recommendation import ALS
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="user_id", itemCol="anime_id", ratingCol="rating", coldStartStrategy='drop')
model = als.fit(train_df)

In [None]:
predictions =  model.transform(test_df)

In [None]:
predictions.show(truncate=False)

### **4.3) Model Evaluation with Root Mean Square Error (RMSE)**
It gives an idea of how close the predictions are to the actual values, and a lower value indicates a better performance. The RMSE is sensitive to outliers, meaning that large errors will have a greater effect on the final value of the RMSE.

\begin{align}
\hspace{2cm}
RMSE = \sqrt{\frac{\sum_{i=1}({ŷ_i} – y_i)^2}{n}}
\hspace{2cm}
\end{align}



*  Σ represents the sum, or total, of the values being calculated
*  ŷi is the predicted value for the i-th observation
*  yi is the observed value for the i-th observation
*  n is the number of data points

In [None]:
#evaluate the model by computing the RMSE on the test data
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))


**Hyperparameter Tunings**

We try to summarize the whole pipeline making use also of $k$-fold cross validation to get a better estimate of the generalization performance of our matrix factorization model.

More specifically, we will tune the three hyperparameters: rank, regParam, and maxIter.

In [None]:
# This function defines the general pipeline for logistic regression
def matrix_factorization(train, k_fold=5):

    from pyspark.ml.recommendation import ALS
    from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
    from pyspark.ml.evaluation import RegressionEvaluator
    from pyspark.ml import Pipeline

    als = ALS(userCol="user_id", itemCol="anime_id", ratingCol="rating", coldStartStrategy="drop")

    #pipeline = Pipeline(stages=stages)

    # We use a ParamGridBuilder to construct a grid of parameters to search over.
    # A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
    # We use a ParamGridBuilder to construct a grid of parameters to search over.
    # With 2 values for als.rank, 2 values for als.regParam, and 1 value for als.maxIter,
    # this grid will have 2 x 2 x 1 = 4 parameter settings for CrossValidator to choose from.
    param_grid = ParamGridBuilder()\
    .addGrid(als.rank, [10, 25]) \
    .addGrid(als.regParam, [0.01, 0.1]) \
    .addGrid(als.maxIter, [10]) \
    .build()
    
    cross_val = CrossValidator(estimator=als, 
                               estimatorParamMaps=param_grid,
                               evaluator=RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction"),
                               numFolds=k_fold,
                               collectSubModels=True # this flag allows us to store ALL the models trained during k-fold cross validation
                               )

    # Run cross-validation, and choose the best set of parameters.
    cv_model = cross_val.fit(train)

    return cv_model

In [None]:
cv_model = matrix_factorization(train_df) 

In [None]:
# This function summarizes all the models trained during k-fold cross validation
def summarize_all_models(cv_models):
    for k, models in enumerate(cv_models):
        print("*************** Fold #{:d} ***************\n".format(k+1))
        for i, m in enumerate(models):
            print("--- Model #{:d} out of {:d} ---".format(i+1, len(models)))
            print("\tParameters: rank=[{:d}]".format(m.rank))
            print("\tModel summary: {}\n".format(m))
        print("***************************************\n")
summarize_all_models(cv_model.subModels)

In [None]:
for i, avg_rmse in enumerate(cv_model.avgMetrics):
    print("Avg. RMSE computed across k-fold cross validation for model setting #{:d}: {:.3f}".format(i+1, avg_rmse))

In [None]:
print("Best model according to k-fold cross validation: rank=[{:d}]".
      format(cv_model.bestModel.rank)
      )
print(cv_model.bestModel)

**Using the best model for making prediction**

In [None]:
test_predictions = cv_model.transform(test_df)

In [None]:
def evaluate_model(predictions, metric="rmse", labelCol="rating", predictionCol="prediction"):
    
    from pyspark.ml.evaluation import RegressionEvaluator

    evaluator = RegressionEvaluator(metricName=metric, labelCol=labelCol, predictionCol=predictionCol)

    return evaluator.evaluate(predictions)

In [None]:
print("***** Test Set *****")
print("RMSE: {:.3f}".format(evaluate_model(test_predictions)))
print("***** Test Set *****")

In [None]:
k = 10 # number of recommended items for each user
cv_model.bestModel.recommendForAllUsers(k).show(10, truncate=False)

### **4.4) Model Evaluation with Mean Absolute Error (MAE)**
The Mean absolute error represents the average of the absolute difference between the actual and predicted values in the dataset. It gives an idea of how accurate the predictions are, and a lower value indicates a better performance. Unlike the RMSE, the MAE is not sensitive to outliers, meaning that large errors will have the same effect on the final value of the MAE regardless of their size.

\begin{align}
\hspace{2cm}
MAE = \frac{1}{N} * {\sum_{i=1} | y_i-ŷ_i|}
\hspace{2cm}
\end{align}

*  Σ is a symbol that means “sum”
*  yi is the observed value for the ith observation
*  ŷi is the predicted value for the ith observation
*  N is the number of data points


In [None]:
evaluator = RegressionEvaluator(metricName="mae", labelCol="rating", predictionCol="prediction")
mae = evaluator.evaluate(predictions)

print("Mean Absolute Error = {:.5f}".format(mae))

### **4.5) Model Evaluation with Mean Squared Error (MSE)**
Mean Squared Error represents the average of the squared difference between the original and predicted values in the data set. It gives an idea of how close the predictions are to the actual values, and a lower value indicates a better performance. Similar to the RMSE, the MSE is also sensitive to outliers, meaning that large errors will have a greater effect on the final value of the MSE.

\begin{align}
\hspace{2cm}
MSE = \frac{1}{N} * {\sum_{i=1}  (y_i-ŷ_i)^2 }
\hspace{2cm}
\end{align}



*  Σ is a symbol that means “sum”
*  ŷi is the predicted value for the ith observation
*  yi is the observed value for the ith observation
*  N is the number of data points


In [None]:
evaluator = RegressionEvaluator(metricName="mse", labelCol="rating", predictionCol="prediction")
mse = evaluator.evaluate(predictions)

print("Mean Squared Error = {:.5f}".format(mse))



---


## **5) Content-Based Filtering with User Profile**

**Advantages** 
* Personalized recommendations: takes into account the user's ratings, so it can provide more personalized recommendations based on anime's genres.
* Easy to implement: The hot encoding technique and dot product used in this approach are relatively simple to implement
* Flexible: This approach can be easily adapted to include other factors


**Disadvantage** 
* Limited by user ratings: This approach relies on the user having rated a sufficient number of anime in order to generate accurate recommendations

* Ignores other factors: this method does not take into account any other information about the anime

In [None]:
content_pdf_splitted = content_df.toPandas()
content_pdf_splitted['Genres'] = content_pdf_splitted.Genres.str.split(',')

In [None]:
content_pdf_splitted.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"[Action, Adventure, Comedy, Drama, Sci-Fi,...",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"[Action, Drama, Mystery, Sci-Fi, Space]",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"[Action, Sci-Fi, Adventure, Comedy, Drama,...",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,7,Witch Hunter Robin,7.27,"[Action, Mystery, Police, Supernatural, Dr...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,8,Bouken Ou Beet,6.98,"[Adventure, Fantasy, Shounen, Supernatural]",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0


One Hot Encoding technique to convert a list of genres into a dataframe (1 if the anime has that genre and 0 if doesn't).

In [None]:
#Copying the anime dataframe into a new one since we won't need to use the genre information in our first case.
animeWithGenre_pdf = content_df.select(content_df.MAL_ID, content_df.Name).toPandas()
for index, row in content_pdf_splitted.iterrows():
  for genre in row['Genres']:
      clean_genre = genre.strip()
      animeWithGenre_pdf.at[index,clean_genre] = 1

animeWithGenre_pdf = animeWithGenre_pdf.fillna(0)



In [None]:
animeWithGenre_pdf.sample(10)

Unnamed: 0,MAL_ID,Name,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
10718,30940,Shounen Muku Hatojuu Monogatari,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
13775,36732,Qin Shi Ming Yue: Tian Xing Jiu Ge,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1286,1417,Lupin III: Moeyo Zantetsuken!,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14386,37695,Pa Para Papa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9902,28955,Columbos,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9788,28497,Rokka no Yuusha,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7288,15605,Brothers Conflict,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10923,31422,Minami Kamakura Koukou Joshi Jitenshabu,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3895,4941,Tezuka Osamu ga Kieta?! 20 Seiki Saigo no Kaij...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
251,275,Love♥Love?,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We create an example of User Profile and trasform it into a Pandas Dataframe

In [None]:
userInput = [
            {'Name':'Doraemon: Nobita to Mirai Note', 'rating':7},
            {'Name':'Imouto Paradise! 3 The Animation', 'rating':5},
            {'Name':'Hatena☆Illusion', 'rating':8},
            {'Name':"Lupin tai Holmes", 'rating':6},
            {'Name':'Captain Tsubasa J', 'rating':7},
            {'Name':'Uma Musume: Pretty Derby (TV) Season 2', 'rating':5},
            {'Name':'Ansatsu Kyoushitsu', 'rating':3},
            {'Name':'Kyoei Tankou-sho', 'rating':4},
            {'Name':"Tate no Yuusha no Nariagari Season 3", 'rating':10},
            {'Name':'Sazae-san', 'rating':1}
         ] 
inputAnime = pd.DataFrame(userInput)
inputAnime

Unnamed: 0,Name,rating
0,Doraemon: Nobita to Mirai Note,7
1,Imouto Paradise! 3 The Animation,5
2,Hatena☆Illusion,8
3,Lupin tai Holmes,6
4,Captain Tsubasa J,7
5,Uma Musume: Pretty Derby (TV) Season 2,5
6,Ansatsu Kyoushitsu,3
7,Kyoei Tankou-sho,4
8,Tate no Yuusha no Nariagari Season 3,10
9,Sazae-san,1


We add the ANIME_ID column for those anime

In [None]:
#Filtering out the anime by title
inputId = content_pdf_splitted[content_pdf_splitted['Name'].isin(inputAnime['Name'].tolist())]

inputAnime = pd.merge(inputId, inputAnime)

inputAnime = inputAnime[['MAL_ID', 'Name', 'rating']]
inputAnime.head(20)

Unnamed: 0,MAL_ID,Name,rating
0,1674,Captain Tsubasa J,7
1,2406,Sazae-san,1
2,10755,Lupin tai Holmes,6
3,16702,Doraemon: Nobita to Mirai Note,7
4,24833,Ansatsu Kyoushitsu,3
5,35252,Hatena☆Illusion,8
6,37360,Imouto Paradise! 3 The Animation,5
7,39841,Kyoei Tankou-sho,4
8,40357,Tate no Yuusha no Nariagari Season 3,10
9,42941,Uma Musume: Pretty Derby (TV) Season 2,5


We also apply One-Hot-Encoding for the user profile

In [None]:
#Filtering out the anime from the input
userAnime = animeWithGenre_pdf[animeWithGenre_pdf['MAL_ID'].isin(inputAnime['MAL_ID'].tolist())]
userAnime

Unnamed: 0,MAL_ID,Name,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
1522,1674,Captain Tsubasa J,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2203,2406,Sazae-san,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6359,10755,Lupin tai Holmes,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7500,16702,Doraemon: Nobita to Mirai Note,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9210,24833,Ansatsu Kyoushitsu,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12901,35252,Hatena☆Illusion,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14161,37360,Imouto Paradise! 3 The Animation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
15771,39841,Kyoei Tankou-sho,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16046,40357,Tate no Yuusha no Nariagari Season 3,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17114,42941,Uma Musume: Pretty Derby (TV) Season 2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We clean up a bit by resetting the index and dropping the MAL_ID and name columns.

In [None]:
#Resetting the index to avoid future issues
userAnime = userAnime.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = userAnime.drop('MAL_ID', 1).drop('Name', 1)
userGenreTable

  userGenreTable = userAnime.drop('MAL_ID', 1).drop('Name', 1)


Unnamed: 0,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,Police,Supernatural,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We turn each genre into weights doing a dot product between a matrix and a vector.

In [None]:
#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(inputAnime['rating'])
#The user profile
userProfile

Action           20.0
Adventure        17.0
Comedy           17.0
Drama            10.0
Sci-Fi           11.0
Space             0.0
Mystery           6.0
Shounen          17.0
Police            0.0
Supernatural      8.0
Magic             0.0
Fantasy          17.0
Sports           12.0
Josei             0.0
Romance           8.0
Slice of Life     6.0
Cars              0.0
Seinen            0.0
Horror            0.0
Psychological     0.0
Thriller          0.0
Super Power       0.0
Martial Arts      0.0
School            3.0
Ecchi             8.0
Vampire           0.0
Military          0.0
Historical        0.0
Dementia          0.0
Mecha             4.0
Demons            0.0
Samurai           0.0
Game              0.0
Shoujo            0.0
Harem             0.0
Music             0.0
Shoujo Ai         0.0
Shounen Ai        0.0
Kids              7.0
Hentai            5.0
Parody            0.0
Yuri              0.0
Yaoi              0.0
dtype: float64

We extract the genre table from the original dataframe

In [None]:
#Now let's get the genres of every anime in our original dataframe
genreTable = animeWithGenre_pdf.set_index(animeWithGenre_pdf['MAL_ID'])

#And drop the unnecessary information
genreTable = genreTable.drop('MAL_ID', 1).drop('Name', 1)
genreTable

  genreTable = genreTable.drop('MAL_ID', 1).drop('Name', 1)


Unnamed: 0_level_0,Action,Adventure,Comedy,Drama,Sci-Fi,Space,Mystery,Shounen,Police,Supernatural,...,Shoujo,Harem,Music,Shoujo Ai,Shounen Ai,Kids,Hentai,Parody,Yuri,Yaoi
MAL_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48481,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48483,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48488,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48491,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
genreTable.shape

(17495, 43)

We take the weighted average of every anime based on the input profile and recommend those with a higher value. We also Remove anime in the user input from the recommendation table.

In [None]:
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head(10)

MAL_ID
1     0.426136
5     0.267045
6     0.522727
7     0.250000
8     0.335227
15    0.375000
16    0.232955
17    0.295455
18    0.238636
19    0.090909
dtype: float64

In [None]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
recommendationTable_df = recommendationTable_df.drop(inputAnime["MAL_ID"].to_list()) 
#Just a peek at the values
recommendationTable_df.head()

MAL_ID
451     0.647727
450     0.647727
449     0.647727
452     0.647727
1186    0.607955
dtype: float64

In [None]:
#The final recommendation table
content_pdf_splitted.loc[content_pdf_splitted['MAL_ID'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
201,225,Dragon Ball GT,6.48,"[Action, Sci-Fi, Adventure, Comedy, Super ...",Dragon Ball GT,ドラゴンボールGT,TV,64,"Feb 7, 1996 to Nov 19, 1997",Winter 1996,...,22680.0,23836.0,44045.0,74704.0,67689.0,42801.0,26843.0,11272.0,5903.0,4003.0
225,249,InuYasha,7.85,"[Action, Adventure, Comedy, Historical, De...",InuYasha,犬夜叉,TV,167,"Oct 16, 2000 to Sep 13, 2004",Fall 2000,...,48201.0,51683.0,79233.0,71630.0,29876.0,14032.0,4881.0,1804.0,906.0,776.0
272,296,Dragon Drive,6.73,"[Action, Sci-Fi, Adventure, Comedy, Fantas...",Dragon Drive,ドラゴンドライブ,TV,38,"Jul 4, 2002 to Mar 27, 2003",Summer 2002,...,410.0,648.0,1444.0,2546.0,1902.0,1154.0,434.0,171.0,110.0,47.0
421,449,InuYasha Movie 4: Guren no Houraijima,7.54,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie 4:Fire on the Mystic Island,犬夜叉 紅蓮の蓬莱島,Movie,1,"Dec 23, 2004",Unknown,...,5230.0,6127.0,9865.0,11837.0,5135.0,2190.0,671.0,225.0,92.0,73.0
422,450,InuYasha Movie 2: Kagami no Naka no Mugenjo,7.66,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie 2:The Castle Beyond the Loo...,犬夜叉 鏡の中の夢幻城,Movie,1,"Dec 21, 2002",Unknown,...,6722.0,7566.0,11990.0,12862.0,5409.0,2184.0,607.0,206.0,96.0,71.0
423,451,InuYasha Movie 3: Tenka Hadou no Ken,7.8,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie 3:Swords of an Honorable Ruler,犬夜叉 天下覇道の剣,Movie,1,"Dec 20, 2003",Unknown,...,6718.0,7647.0,11985.0,11322.0,4397.0,1687.0,395.0,160.0,60.0,66.0
424,452,InuYasha Movie 1: Toki wo Koeru Omoi,7.56,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie:Affections Touching Across ...,犬夜叉 時代を越える想い,Movie,1,"Dec 22, 2001",Unknown,...,6033.0,6802.0,11048.0,13002.0,5767.0,2369.0,620.0,255.0,104.0,93.0
538,573,Saber Marionette J,7.35,"[Action, Adventure, Comedy, Drama, Harem, ...",Saber Marionette J,セイバーマリオネットJ,TV,25,"Oct 1, 1996 to Mar 25, 1997",Fall 1996,...,973.0,1552.0,3065.0,3663.0,1765.0,842.0,270.0,101.0,53.0,30.0
730,808,Bakuretsu Hunters OVA,6.54,"[Action, Adventure, Harem, Comedy, Super P...",Sorcerer Hunters,元祖　爆れつハンター,OVA,3,"Dec 21, 1996 to Apr 23, 1997",Unknown,...,101.0,171.0,346.0,638.0,477.0,315.0,119.0,63.0,30.0,18.0
816,901,"Dragon Ball Z Movie 08: Moetsukiro!! Nessen, R...",7.34,"[Action, Adventure, Comedy, Fantasy, Sci-F...",Dragon Ball Z:Broly – The Legendary Super Saiyan,ドラゴンボールZ 燃えつきろ!!熱戦・烈戦・超激戦,Movie,1,"Mar 6, 1993",Unknown,...,8445.0,9804.0,19412.0,22386.0,11181.0,5261.0,1924.0,735.0,312.0,238.0


We can now put everything in a single function because in this way it will be much easier call the content recommendation.  




In [None]:
def content_recommendation(user_input,n):
  userAnime = animeWithGenre_pdf[animeWithGenre_pdf['MAL_ID'].isin(user_input['MAL_ID'].tolist())]
  #Resetting the index to avoid future issues
  userAnime = userAnime.reset_index(drop=True)
  #Dropping unnecessary issues due to save memory and to avoid issues
  userGenreTable = userAnime.drop('MAL_ID', 1).drop('Name', 1)
  userProfile = userGenreTable.transpose().dot(user_input['rating'])
  #Now let's get the genres of every anime in our original dataframe
  genreTable = animeWithGenre_pdf.set_index(animeWithGenre_pdf['MAL_ID'])
  #And drop the unnecessary information
  genreTable = genreTable.drop('MAL_ID', 1).drop('Name', 1)
  recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
  #Sort our recommendations in descending order
  recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
  # Remove animes of the user input from the finale recommendation table
  recommendationTable_df = recommendationTable_df.drop(user_input["MAL_ID"].to_list())  
  #The final recommendation table
  return content_pdf_splitted.loc[content_pdf_splitted['MAL_ID'].isin(recommendationTable_df.head(n).keys())]

In [None]:
content_recommendation(inputAnime,20)

  userGenreTable = userAnime.drop('MAL_ID', 1).drop('Name', 1)
  genreTable = genreTable.drop('MAL_ID', 1).drop('Name', 1)


Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
201,225,Dragon Ball GT,6.48,"[Action, Sci-Fi, Adventure, Comedy, Super ...",Dragon Ball GT,ドラゴンボールGT,TV,64,"Feb 7, 1996 to Nov 19, 1997",Winter 1996,...,22680.0,23836.0,44045.0,74704.0,67689.0,42801.0,26843.0,11272.0,5903.0,4003.0
225,249,InuYasha,7.85,"[Action, Adventure, Comedy, Historical, De...",InuYasha,犬夜叉,TV,167,"Oct 16, 2000 to Sep 13, 2004",Fall 2000,...,48201.0,51683.0,79233.0,71630.0,29876.0,14032.0,4881.0,1804.0,906.0,776.0
272,296,Dragon Drive,6.73,"[Action, Sci-Fi, Adventure, Comedy, Fantas...",Dragon Drive,ドラゴンドライブ,TV,38,"Jul 4, 2002 to Mar 27, 2003",Summer 2002,...,410.0,648.0,1444.0,2546.0,1902.0,1154.0,434.0,171.0,110.0,47.0
421,449,InuYasha Movie 4: Guren no Houraijima,7.54,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie 4:Fire on the Mystic Island,犬夜叉 紅蓮の蓬莱島,Movie,1,"Dec 23, 2004",Unknown,...,5230.0,6127.0,9865.0,11837.0,5135.0,2190.0,671.0,225.0,92.0,73.0
422,450,InuYasha Movie 2: Kagami no Naka no Mugenjo,7.66,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie 2:The Castle Beyond the Loo...,犬夜叉 鏡の中の夢幻城,Movie,1,"Dec 21, 2002",Unknown,...,6722.0,7566.0,11990.0,12862.0,5409.0,2184.0,607.0,206.0,96.0,71.0
423,451,InuYasha Movie 3: Tenka Hadou no Ken,7.8,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie 3:Swords of an Honorable Ruler,犬夜叉 天下覇道の剣,Movie,1,"Dec 20, 2003",Unknown,...,6718.0,7647.0,11985.0,11322.0,4397.0,1687.0,395.0,160.0,60.0,66.0
424,452,InuYasha Movie 1: Toki wo Koeru Omoi,7.56,"[Action, Adventure, Comedy, Historical, De...",InuYasha the Movie:Affections Touching Across ...,犬夜叉 時代を越える想い,Movie,1,"Dec 22, 2001",Unknown,...,6033.0,6802.0,11048.0,13002.0,5767.0,2369.0,620.0,255.0,104.0,93.0
538,573,Saber Marionette J,7.35,"[Action, Adventure, Comedy, Drama, Harem, ...",Saber Marionette J,セイバーマリオネットJ,TV,25,"Oct 1, 1996 to Mar 25, 1997",Fall 1996,...,973.0,1552.0,3065.0,3663.0,1765.0,842.0,270.0,101.0,53.0,30.0
730,808,Bakuretsu Hunters OVA,6.54,"[Action, Adventure, Harem, Comedy, Super P...",Sorcerer Hunters,元祖　爆れつハンター,OVA,3,"Dec 21, 1996 to Apr 23, 1997",Unknown,...,101.0,171.0,346.0,638.0,477.0,315.0,119.0,63.0,30.0,18.0
816,901,"Dragon Ball Z Movie 08: Moetsukiro!! Nessen, R...",7.34,"[Action, Adventure, Comedy, Fantasy, Sci-F...",Dragon Ball Z:Broly – The Legendary Super Saiyan,ドラゴンボールZ 燃えつきろ!!熱戦・烈戦・超激戦,Movie,1,"Mar 6, 1993",Unknown,...,8445.0,9804.0,19412.0,22386.0,11181.0,5261.0,1924.0,735.0,312.0,238.0




---


## **6) Content Based Filtering with Cosine Similarity**
We use **cosine similarity** which measures the similarity between two vectors by taking the cosine of the angle between them (range [0,1]). With *TfidfVectorizer* library we create a matrix representing the anime data in numerical form useful to calculate the similarity between every pair anime, based on their genre.

**Advantages** 
* efficient to compute and easy to interpret

**Disadvantage** 
* does not take into account the preferences of individual users


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

content_pdf = content_df.toPandas() 
#Define a TF-IDF Vectorizer Object.
tf = TfidfVectorizer(analyzer='word', stop_words='english')
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tf.fit_transform(content_pdf['Genres'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) #calculate the cosine similarity between all pairs of anime in the dataset. 

In [None]:
tfidf_matrix.shape

(17495, 46)

We see that 46 different genres are used to describe the 17495 anime in our dataset.

In [None]:
# Build a 1-dimensional array with anime titles, and a series with indices
titles = content_pdf['Name']
indices = pd.Series(content_pdf.index, index=content_pdf['Name'])

# Function that takes in anime title as input and outputs most similar anime
def anime_recommendations(title):
    # Get the index of the anime that matches the title
    idx = indices[title]
    # Get the list of cosine similarity scores for that particular anime with all anime, and 
    # convert it into a list of tuples where the first element is its position and the second is the similarity score.
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the anime based on the similarity scores (the second position)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the 10 most similar anime, Ignore the first element as it refers to itself 
    sim_scores = sim_scores[1:21]
    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar anime
    return titles.iloc[anime_indices]

In [None]:
print(indices)

Name
Cowboy Bebop                           0
Cowboy Bebop: Tengoku no Tobira        1
Trigun                                 2
Witch Hunter Robin                     3
Bouken Ou Beet                         4
                                   ...  
Daomu Biji Zhi Qinling Shen Shu    17490
Mieruko-chan                       17491
Higurashi no Naku Koro ni Sotsu    17492
Yama no Susume: Next Summit        17493
Scarlet Nexus                      17494
Length: 17495, dtype: int64


In [None]:
anime_recommendations('One Piece Movie 1').head(20)

431          One Piece Movie 2: Nejimaki-jima no Daibouken
432      One Piece Movie 3: Chinjuu-jima no Chopper Oukoku
433                  One Piece Movie 4: Dead End no Bouken
434                   One Piece Movie 5: Norowareta Seiken
437                      One Piece: Taose! Kaizoku Ganzack
469             Dragon Ball Movie 1: Shen Long no Densetsu
994                One Piece: Umi no Heso no Daibouken-hen
1128     One Piece: Oounabara ni Hirake! Dekkai Dekkai ...
3328     One Piece Movie 9: Episode of Chopper Plus - F...
5260                                       One Piece Recap
5510                One Piece Film: Strong World Episode 0
7397     One Piece: Episode of Luffy - Hand Island no B...
7440                            One Piece: Glorious Island
11191                   One Piece: Adventure of Nebulandia
11952            One Piece Film: Gold Episode 0 - 711 ver.
115                                        Hunter x Hunter
117                          Hunter x Hunter: Greed Isla



---


## **7) Evaluation for both implementations of Content Based Recommendation**
We use Precision and Recall to evaluate the performance of our content-based recommendation algorithms.
* The first algorithm, run_content1_evaluation, takes a user ID as input and returns a tuple containing the precision and recall for the recommendations made by the algorithm. 
* The second algorithm, run_content2_evaluation, does the same for the second recommendation algorithm. 

The code first selects a subset of users who have made a large number of ratings, and then applies the evaluation functions to each of these users. It then calculates the average precision and recall for each algorithm, and prints the results. 

* Precision measures the fraction of the predicted positive cases that are actually positive
* Recall measures the fraction of the actual positive cases that are correctly predicted as positive


\begin{align}
Precision = \frac{TP}{TP+FP}
\hspace{2cm}
Recall = \frac{TP}{TP+FN}
\end{align}


In [None]:
#We shorten the dataframe with only the anime ID columns and its name
content1 = content_df.select(content_df.MAL_ID, content_df.Name)
#Union of two dataframes, so as to have a single one with 4 total columns
merged_df = ratings_df.join(content1,ratings_df["anime_id"] == content1["MAL_ID"])
merged_df = merged_df.select(merged_df.user_id, merged_df.anime_id, merged_df.rating, merged_df.Name) #user-profile
merged_df = merged_df.withColumnRenamed('anime_id', 'MAL_ID')
#For each user we calculate the number of ratings made, in descent order
merged_df_count = merged_df.groupBy('user_id').count()
merged_df_count = merged_df_count.withColumnRenamed('count', 'tot')
merged_df_count = merged_df_count.orderBy('tot', ascending=False)
#We are interested in users who have made a large number of reviews
users = merged_df_count.select('user_id','tot').where(merged_df_count.tot >500).where(merged_df_count.tot < 1000).collect()       


In [None]:
users

In [None]:
def evaluation(recommendation_list, total_reviews):
    #TP: those that are both in the recommendation_list and in the total_reviews
    #FP: those that are in the recommendation_list but not in the total_reviews
    #FN: those that fit in total_reviews but not in recommendation_list
    #TN: those who are nowhere
    TP = 0
    for anime in recommendation_list:
        if anime in total_reviews:
            TP += 1
    FP = len(set(recommendation_list)-set(total_reviews))
    FN = len(set(total_reviews)-set(recommendation_list))

    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    return (precision,recall)
    

In [None]:
# This two functions, given a row, will run the evaluation for both content based algorithms

def run_content1_evaluation(row):
    from pyspark.sql.functions import rand 
    #We take all the anime reviewed by the user and randomly ordering them
    user_input = merged_df.filter(merged_df.user_id == row[0])
    user_input = user_input.select(merged_df.MAL_ID, merged_df.Name, merged_df.rating)
    user_input = user_input.orderBy(rand(seed=42))
    # we take only animes that the user reviewed sufficiently (rating > 5)
    limited_user_input = user_input.limit(user_input.count()//4).filter(merged_df.rating > 5)
    #We create a list of anime titles viewed by the user which will be our test set
    differ = user_input.subtract(limited_user_input).select('Name').rdd.map(lambda x : x[0]).collect()
    #We invoke the recommendation function and get the list of anime titles
    recommendation_list = content_recommendation(limited_user_input.toPandas(),20)["Name"].to_list()
    return evaluation(recommendation_list, differ)
    
def run_content2_evaluation(row):
    total_reviews = merged_df.filter(merged_df.user_id == row[0])
    review = total_reviews.orderBy('rating', ascending = False).limit(1).select('Name').first()[0]
    # We take the list with all names of reviewed animes.
    total_reviews_list = total_reviews.select('Name').rdd.map(lambda x : x[0]).collect()
    recommendation_list = anime_recommendations(review).head(20).to_list()
    return evaluation(recommendation_list, total_reviews_list)

In [None]:
import builtins as p


precision1 = []
precision2 = []
recall1 = []
recall2 = []



for row in users[:1000]:
    print(row)
    t1 = run_content1_evaluation(row)
    precision1.append(t1[0])
    recall1.append(t1[1])

    t2 = run_content2_evaluation(row)
    precision2.append(t2[0])
    recall2.append(t2[1])
    

prec1 = p.sum(precision1)/len(precision1)
rec1 = p.sum(recall1)/len(recall1)
prec2 = p.sum(precision2)/len(precision2)
rec2 = p.sum(recall2)/len(recall2)
 

In [None]:
prec1,rec1

In [None]:
prec2,rec2



---


# **References📖**
*  [Are They Making Too Much Anime?](https://www.youtube.com/watch?v=GCBUZP9MA-w)
* [Anime Recommendation Database 2020](https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020)
*  [Evaluation Metrics](https://www.microsoft.com/en-us/research/publication/evaluating-recommender-systems/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F115396%2Fevaluationmetrics.tr.pdf)
* [Collaborative recommender system](https://github.com/gtolomei/big-data-computing/blob/master/notebooks/MF_Recommender_Systems.ipynb)



---

