# Recommender System for Animes using a Content-Based Approach with Pyspark

## What this notebook includes
* Introduction to problem
* Summary of the dataset being used.
* Exploration of the dataset.
* Cleaning and processing of the dataset.
* Encoding categorical features.
* Utilising NLP to vectorise text features.
* Creating similarity matrices
* Using similarity matrix to recommend similar items
* Suggestions for improvement.

### Introduction
In this data science notebook, we will be creating a content-based anime recommender system. The dataset used contains details about anime titles, their ratings, genres, and synopses.

Instead of using a collaborative-based approach, we will be using a content-based method to find similar animes. We will be utilizing pyspark for data cleaning and pre-processing, and using techniques such as TFIDF to create vectors for the synopses and one hot encoding for the genres.

We will then create genre and synopsis similarity matrices for each anime, and combine them to create a final similarity matrix. This final matrix will be used in a function to retrieve similar animes above a certain threshold, which can be sorted by similarity score or by the rating of the anime.

As potential improvements, we will look into incorporating additional features such as year and voice actors, and also experiment with collaborative filtering techniques.

### The dataset

The anime_with_synopsis dataset from Kaggle is a collection of anime titles, their synopses, ratings, and genres. The dataset contains information for over 12,000 anime titles, including their English and Japanese titles, as well as their synopsis. It also includes the genres that the anime belongs to, and the ratings (average score) from MyAnimeList. This dataset can be used to analyze the different genres and themes of anime, as well as to build a content-based anime recommendation system using the synopsis and genre information.

## Import data and libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from scipy import sparse
from scipy.stats import mstats

# Spark
import pyspark.pandas as ps
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF
from pyspark.ml.linalg import DenseVector, Vectors
from pyspark.sql.types import DoubleType

# Text pre-processing
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer



In [2]:
# stopwords
# nltk.download('stopwords')

In [3]:
# Start Spark session
spark = SparkSession.Builder().appName("AnimeRecommender").getOrCreate()

In [4]:
# Data into spark dataframes
anime_sdf = spark.read.csv('../data/raw/anime_with_synopsis.csv', header=True)

In [5]:
# Creating a copy of original dataframe to work with
anime = anime_sdf.select("*")

## Explore Data

### Shape

In [6]:
# Shape of data
print(anime.count())
print(len(anime.columns))

16214
5


The data has 16214 rows and 5 columns.

### Preview

In [7]:
# Preview data
anime.show(5)

+------+--------------------+-----+--------------------+--------------------+
|MAL_ID|                Name|Score|              Genres|           sypnopsis|
+------+--------------------+-----+--------------------+--------------------+
|     1|        Cowboy Bebop| 8.78|Action, Adventure...|"In the year 2071...|
|     5|Cowboy Bebop: Ten...| 8.39|Action, Drama, My...|other day, anothe...|
|     6|              Trigun| 8.24|Action, Sci-Fi, A...|"Vash the Stamped...|
|     7|  Witch Hunter Robin| 7.27|Action, Mystery, ...|ches are individu...|
|     8|      Bouken Ou Beet| 6.98|Adventure, Fantas...|It is the dark ce...|
+------+--------------------+-----+--------------------+--------------------+
only showing top 5 rows



In [8]:
# Preview columns
anime.columns

['MAL_ID', 'Name', 'Score', 'Genres', 'sypnopsis']

The synopsis column is incorrectly spelled, let's fix it.

In [9]:
# Renaming column
anime = anime.withColumn('Synopsis', F.col('sypnopsis')).drop('sypnopsis')

anime.show(5)

+------+--------------------+-----+--------------------+--------------------+
|MAL_ID|                Name|Score|              Genres|            Synopsis|
+------+--------------------+-----+--------------------+--------------------+
|     1|        Cowboy Bebop| 8.78|Action, Adventure...|"In the year 2071...|
|     5|Cowboy Bebop: Ten...| 8.39|Action, Drama, My...|other day, anothe...|
|     6|              Trigun| 8.24|Action, Sci-Fi, A...|"Vash the Stamped...|
|     7|  Witch Hunter Robin| 7.27|Action, Mystery, ...|ches are individu...|
|     8|      Bouken Ou Beet| 6.98|Adventure, Fantas...|It is the dark ce...|
+------+--------------------+-----+--------------------+--------------------+
only showing top 5 rows



### Nulls check

In [10]:
# Checking for nulls
anime.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in anime.columns]).show()

+------+----+-----+------+--------+
|MAL_ID|Name|Score|Genres|Synopsis|
+------+----+-----+------+--------+
|     0|   0|    0|     0|       8|
+------+----+-----+------+--------+



Only 8 rows with nulls in Synopsis column, easy to drop.

In [11]:
# Dropping null rows
anime = anime.na.drop()

anime.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in anime.columns]).show()

+------+----+-----+------+--------+
|MAL_ID|Name|Score|Genres|Synopsis|
+------+----+-----+------+--------+
|     0|   0|    0|     0|       0|
+------+----+-----+------+--------+



### Preview tail end

In [12]:
anime.tail(5)
# Can see some Scores as 'Unknown' which isn't great as we don't want to recommend titles that haven't even been scored

[Row(MAL_ID='48481', Name='Daomu Biji Zhi Qinling Shen Shu', Score='Unknown', Genres='Adventure, Mystery, Supernatural', Synopsis='No synopsis information has been added to this title. Help improve our database by adding a synopsis here .'),
 Row(MAL_ID='48483', Name='Mieruko-chan', Score='Unknown', Genres='Comedy, Horror, Supernatural', Synopsis='ko is a typical high school student whose life turns upside down when she suddenly starts to see gruesome and hideous monsters. Despite being completely terrified, Miko carries on with her daily life, pretending not to notice the horrors that surround her. She must endure the fear in order to keep herself and her friend Hana out of danger, even if that means coming face to face with the absolute worst. Blending both comedy and horror, Mieruko-chan tells the story of a girl who tries to deal with the paranormal by acting indifferent toward it.'),
 Row(MAL_ID='48488', Name='Higurashi no Naku Koro ni Sotsu', Score='Unknown', Genres='Mystery, Dem

Everything looks fine except the 'Scores' column, some of them are set as 'Unknown'. In an ideal scenario, we would like to not recommend anime that are not scored.

In [15]:
# Find more details about Unknown scored columns
anime_unkown = anime.where(anime.Score == 'Unknown')
print(anime_unkown.count())
anime_unkown.show(5)

5115
+------+--------------------+-------+--------------------+--------------------+
|MAL_ID|                Name|  Score|              Genres|            Synopsis|
+------+--------------------+-------+--------------------+--------------------+
|  1547|    Obake no Q-tarou|Unknown|Comedy, School, S...|Q-taro, a monster...|
|  1656|     PostPet Momobin|Unknown|        Comedy, Kids|omo and Komomo ca...|
|  1739|Shibawanko no Wa ...|Unknown|                Kids|"Based on a japan...|
|  1863|Silk Road Shounen...|Unknown|Adventure, Fantas...|hen a boy Yuto vi...|
|  2073|Hengen Taima Yako...|Unknown|      Horror, Shoujo|"Shoko and Maiko ...|
+------+--------------------+-------+--------------------+--------------------+
only showing top 5 rows



Nearly a third of the data has 'unknown' scores. If this was a classification/regression problem then it may be detremental to drop these rows. But since here it fits with our objective (not recommending unscored animes), we can drop them.

In [16]:
# Dropping unknown scored animes
anime = anime.where(anime.Score != 'Unknown')

print(anime.count())
anime.tail(5)

11091


[Row(MAL_ID='47398', Name='Kimetsu Gakuen: Valentine-hen', Score='6.59', Genres='Comedy', Synopsis="Valentine's Day special for Kimetsu no Yaiba . The first three episodes will be released on Aniplex's official YouTube channel, while the fourth and final episode will be streamed during the Kimetsu Matsuri Online: Anime 2nd Anniversary Festival ."),
 Row(MAL_ID='47402', Name='Heikousen', Score='7.52', Genres='Music, Romance', Synopsis="usic video for Eve and suis' song Heikousen , also used as commercial for the Lotte Ghana Valentine Present campaign."),
 Row(MAL_ID='47614', Name='Nu Wushen de Canzhuo Spring Festival Special', Score='6.83', Genres='Slice of Life, Comedy', Synopsis='No synopsis information has been added to this title. Help improve our database by adding a synopsis here .'),
 Row(MAL_ID='47616', Name='Yakusoku no Neverland 2nd Season: Michishirube', Score='4.81', Genres='Mystery, Psychological, Supernatural, Thriller, Shounen', Synopsis='cap of the first season of Yakuso

### Clean up

In [17]:
# Dropping columns not needed
anime = anime.drop('MAL_ID')

anime.show(5)

+--------------------+-----+--------------------+--------------------+
|                Name|Score|              Genres|            Synopsis|
+--------------------+-----+--------------------+--------------------+
|        Cowboy Bebop| 8.78|Action, Adventure...|"In the year 2071...|
|Cowboy Bebop: Ten...| 8.39|Action, Drama, My...|other day, anothe...|
|              Trigun| 8.24|Action, Sci-Fi, A...|"Vash the Stamped...|
|  Witch Hunter Robin| 7.27|Action, Mystery, ...|ches are individu...|
|      Bouken Ou Beet| 6.98|Adventure, Fantas...|It is the dark ce...|
+--------------------+-----+--------------------+--------------------+
only showing top 5 rows



In [18]:
# Saving copy to use later for recommendations
anime_sdf = anime.select("*")

anime_sdf.show(5)

+--------------------+-----+--------------------+--------------------+
|                Name|Score|              Genres|            Synopsis|
+--------------------+-----+--------------------+--------------------+
|        Cowboy Bebop| 8.78|Action, Adventure...|"In the year 2071...|
|Cowboy Bebop: Ten...| 8.39|Action, Drama, My...|other day, anothe...|
|              Trigun| 8.24|Action, Sci-Fi, A...|"Vash the Stamped...|
|  Witch Hunter Robin| 7.27|Action, Mystery, ...|ches are individu...|
|      Bouken Ou Beet| 6.98|Adventure, Fantas...|It is the dark ce...|
+--------------------+-----+--------------------+--------------------+
only showing top 5 rows



## Data Pre-processing and Cleaning

### Categorical Features

We need to perform one hot encoding for the genres.

In [19]:
# Splitting column for one hot encoding
anime = anime.withColumn('Genres_split', F.split(anime['Genres'], ', '))

anime.select('Genres', 'Genres_split').show(5)

+--------------------+--------------------+
|              Genres|        Genres_split|
+--------------------+--------------------+
|Action, Adventure...|[Action, Adventur...|
|Action, Drama, My...|[Action, Drama, M...|
|Action, Sci-Fi, A...|[Action, Sci-Fi, ...|
|Action, Mystery, ...|[Action, Mystery,...|
|Adventure, Fantas...|[Adventure, Fanta...|
+--------------------+--------------------+
only showing top 5 rows



In [20]:
# One hot encoding
genres_set = anime.withColumn('exploded_genres', F.explode('Genres_split')).agg(F.collect_set('exploded_genres')).collect()[0][0]

genres_set = sorted(genres_set)

for genre in genres_set:
    anime = anime.withColumn(genre, F.when(F.array_contains('Genres_split', genre), 1).otherwise(0))
    
print(anime.printSchema())
anime.show(5)

root
 |-- Name: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Synopsis: string (nullable = true)
 |-- Genres_split: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- 6.45: integer (nullable = false)
 |-- 7.14: integer (nullable = false)
 |-- Action: integer (nullable = false)
 |-- Adventure: integer (nullable = false)
 |-- Cars: integer (nullable = false)
 |-- Comedy: integer (nullable = false)
 |-- Dementia: integer (nullable = false)
 |-- Demons: integer (nullable = false)
 |-- Drama: integer (nullable = false)
 |-- Ecchi: integer (nullable = false)
 |-- Fantasy: integer (nullable = false)
 |-- Game: integer (nullable = false)
 |-- Harem: integer (nullable = false)
 |-- Historical: integer (nullable = false)
 |-- Horror: integer (nullable = false)
 |-- Josei: integer (nullable = false)
 |-- Kids: integer (nullable = false)
 |-- Magic: integer (nullable = false)
 |-- Martial Arts: integer (nullable 

In [21]:
# Dropping unwanted columns
anime = anime.drop('Genres', 'Genres_split', '6.45', '7.14')
    
anime.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Synopsis: string (nullable = true)
 |-- Action: integer (nullable = false)
 |-- Adventure: integer (nullable = false)
 |-- Cars: integer (nullable = false)
 |-- Comedy: integer (nullable = false)
 |-- Dementia: integer (nullable = false)
 |-- Demons: integer (nullable = false)
 |-- Drama: integer (nullable = false)
 |-- Ecchi: integer (nullable = false)
 |-- Fantasy: integer (nullable = false)
 |-- Game: integer (nullable = false)
 |-- Harem: integer (nullable = false)
 |-- Historical: integer (nullable = false)
 |-- Horror: integer (nullable = false)
 |-- Josei: integer (nullable = false)
 |-- Kids: integer (nullable = false)
 |-- Magic: integer (nullable = false)
 |-- Martial Arts: integer (nullable = false)
 |-- Mecha: integer (nullable = false)
 |-- Military: integer (nullable = false)
 |-- Music: integer (nullable = false)
 |-- Mystery: integer (nullable = false)
 |-- Parody: integer (nullable = fal

### Cleaning Text

Before usig TF-IDF on the synopsis column, we need to clean it by:
1. Removing punctuation
2. Lower casing
3. Removing stop words

In [22]:
# Removing punctuation from synopsis
anime = anime.withColumn("Synopsis_cleaning", F.regexp_replace('Synopsis', r"""[!\"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]""", ''))

anime.select("Synopsis", "Synopsis_cleaning").show(5)

+--------------------+--------------------+
|            Synopsis|   Synopsis_cleaning|
+--------------------+--------------------+
|"In the year 2071...|In the year 2071 ...|
|other day, anothe...|other day another...|
|"Vash the Stamped...|Vash the Stampede...|
|ches are individu...|ches are individu...|
|It is the dark ce...|It is the dark ce...|
+--------------------+--------------------+
only showing top 5 rows



In [23]:
# lower case words
anime = anime.withColumn("Synopsis_cleaning", F.lower(F.col('Synopsis_cleaning')))

anime.select("Synopsis", "Synopsis_cleaning").show(5)

+--------------------+--------------------+
|            Synopsis|   Synopsis_cleaning|
+--------------------+--------------------+
|"In the year 2071...|in the year 2071 ...|
|other day, anothe...|other day another...|
|"Vash the Stamped...|vash the stampede...|
|ches are individu...|ches are individu...|
|It is the dark ce...|it is the dark ce...|
+--------------------+--------------------+
only showing top 5 rows



In [24]:
# Remove stop words

anime = anime.withColumn("Synopsis_cleaning", F.split(anime['Synopsis_cleaning'], ' '))

remover = StopWordsRemover().setInputCol('Synopsis_cleaning').setOutputCol('Synopsis_clean')

anime = remover.transform(anime)

anime = anime.withColumn('Synopsis_clean', F.concat_ws(" ",F.col('Synopsis_clean')))

anime.select("Synopsis", "Synopsis_clean").show(5)

+--------------------+--------------------+
|            Synopsis|      Synopsis_clean|
+--------------------+--------------------+
|"In the year 2071...|year 2071 humanit...|
|other day, anothe...|day another bount...|
|"Vash the Stamped...|vash stampede man...|
|ches are individu...|ches individuals ...|
|It is the dark ce...|dark century peop...|
+--------------------+--------------------+
only showing top 5 rows



In [25]:
anime.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Synopsis: string (nullable = true)
 |-- Action: integer (nullable = false)
 |-- Adventure: integer (nullable = false)
 |-- Cars: integer (nullable = false)
 |-- Comedy: integer (nullable = false)
 |-- Dementia: integer (nullable = false)
 |-- Demons: integer (nullable = false)
 |-- Drama: integer (nullable = false)
 |-- Ecchi: integer (nullable = false)
 |-- Fantasy: integer (nullable = false)
 |-- Game: integer (nullable = false)
 |-- Harem: integer (nullable = false)
 |-- Historical: integer (nullable = false)
 |-- Horror: integer (nullable = false)
 |-- Josei: integer (nullable = false)
 |-- Kids: integer (nullable = false)
 |-- Magic: integer (nullable = false)
 |-- Martial Arts: integer (nullable = false)
 |-- Mecha: integer (nullable = false)
 |-- Military: integer (nullable = false)
 |-- Music: integer (nullable = false)
 |-- Mystery: integer (nullable = false)
 |-- Parody: integer (nullable = fal

In [26]:
# Removing unwanted columns
anime = anime.drop("Synopsis", "Synopsis_cleaning")

anime.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Action: integer (nullable = false)
 |-- Adventure: integer (nullable = false)
 |-- Cars: integer (nullable = false)
 |-- Comedy: integer (nullable = false)
 |-- Dementia: integer (nullable = false)
 |-- Demons: integer (nullable = false)
 |-- Drama: integer (nullable = false)
 |-- Ecchi: integer (nullable = false)
 |-- Fantasy: integer (nullable = false)
 |-- Game: integer (nullable = false)
 |-- Harem: integer (nullable = false)
 |-- Historical: integer (nullable = false)
 |-- Horror: integer (nullable = false)
 |-- Josei: integer (nullable = false)
 |-- Kids: integer (nullable = false)
 |-- Magic: integer (nullable = false)
 |-- Martial Arts: integer (nullable = false)
 |-- Mecha: integer (nullable = false)
 |-- Military: integer (nullable = false)
 |-- Music: integer (nullable = false)
 |-- Mystery: integer (nullable = false)
 |-- Parody: integer (nullable = false)
 |-- Police: integer (nullable = fal

We will convert back to Pandas Dataframe to work with now.

In [27]:
# Converting to pandas dataframe
anime_cleaned_df = anime.toPandas()

anime_cleaned_df.head()

Unnamed: 0,Name,Score,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Synopsis_clean
0,Cowboy Bebop,8.78,1,1,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,0,year 2071 humanity colonized several planets m...
1,Cowboy Bebop: Tengoku no Tobira,8.39,1,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,day another bounty—such life often unlucky cre...
2,Trigun,8.24,1,1,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,vash stampede man 60000000000 bounty head reas...
3,Witch Hunter Robin,7.27,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,ches individuals special powers like esp telek...
4,Bouken Ou Beet,6.98,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,dark century people suffering rule devil vande...


## Creating Similarity Matrices

Now we will begin the process of creating the similarity matrices. We will create on for genres and one for synopsis, then combine them.

In [28]:
# Creating a numpy array to hold all the one hot encoded genres
genres_arr = anime_cleaned_df.drop(['Name', 'Score', 'Synopsis_clean'], axis=1).to_numpy()
genres_arr

array([[1, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Now we can create a similarity matrix for genres:

In [29]:
# Creating a genres matrix as the dot product of the genres 2d array with itself transposed
genres_matr = np.dot(genres_arr, np.transpose(genres_arr))
genres_matr
# Needs normalising!

array([[6, 4, 5, ..., 1, 0, 1],
       [4, 5, 3, ..., 0, 1, 0],
       [5, 3, 6, ..., 1, 1, 1],
       ...,
       [1, 0, 1, ..., 2, 0, 1],
       [0, 1, 1, ..., 0, 5, 0],
       [1, 0, 1, ..., 1, 0, 2]])

The similarity scores will benefit from normalisation, especially because we want to combine these with the synopsis similarity scores we will obtain soon (from TF-IDF transformation), and those scores will be in the 0-1 range also.

In [30]:
# Retrieve a vector for the max value of the genre matrix for each row
max_genres_vector = genres_matr.max(axis=1)
max_genres_vector

array([6, 5, 6, ..., 2, 5, 2])

In [31]:
# Retrieve a vector for the min value of the genre matrix for each row
min_genres_vector = genres_matr.min(axis=1)
min_genres_vector

array([0, 0, 0, ..., 0, 0, 0])

In [32]:
# Max-min range
max_min_genres_vector = max_genres_vector - min_genres_vector
max_min_genres_vector

array([6, 5, 6, ..., 2, 5, 2])

In [33]:
# Max-min normalisation for genres matrix
norm_genres_matr = (genres_matr - min_genres_vector[:, None])/(max_min_genres_vector[:, None])
norm_genres_matr

  norm_genres_matr = (genres_matr - min_genres_vector[:, None])/(max_min_genres_vector[:, None])


array([[1.        , 0.66666667, 0.83333333, ..., 0.16666667, 0.        ,
        0.16666667],
       [0.8       , 1.        , 0.6       , ..., 0.        , 0.2       ,
        0.        ],
       [0.83333333, 0.5       , 1.        , ..., 0.16666667, 0.16666667,
        0.16666667],
       ...,
       [0.5       , 0.        , 0.5       , ..., 1.        , 0.        ,
        0.5       ],
       [0.        , 0.2       , 0.2       , ..., 0.        , 1.        ,
        0.        ],
       [0.5       , 0.        , 0.5       , ..., 0.5       , 0.        ,
        1.        ]])

Now that we have the normalised genres similarity matrix, let's obtain the same for synopsis (using TF-IDF).

### TF-IDF

In [34]:
# Retrieve TFIDF transformation for the synopsis
v = TfidfVectorizer()
tfidf = v.fit_transform(anime_cleaned_df['Synopsis_clean'])
tfidf # it is a scipy sparse matrix

<11091x41760 sparse matrix of type '<class 'numpy.float64'>'
	with 361245 stored elements in Compressed Sparse Row format>

The result is a sparse matrix, we want it to be the same type as the matrix for genres, so we will convert the genres one to be the same type.

In [35]:
# Convert genres numpy matrix to scipy matrix to use with tfidf
# This is the genres similarity matrix
genres_sm = sparse.csc_matrix(norm_genres_matr)

Now we create the similarity matrix for synopsis (TF-IDF transformed):

In [36]:
# TFIDF similarity matrix
tfidf_sm = np.dot(tfidf, np.transpose(tfidf))
tfidf_sm

<11091x11091 sparse matrix of type '<class 'numpy.float64'>'
	with 44480639 stored elements in Compressed Sparse Row format>

### Combining Similarity Matrices

There are two ways we will try out to combine the similarity scores of both the genres and synopsis matrices. Note that we are doing this to obtain a combined score of similarity for each anime pair.

The first method is simply just averaging. This involves adding both matrices by element and then dividing by 2. The trouble is this implies that both the synopsis and genres have the same weight for recommendations. There isn't actually a right answer for what the weightings should be, because it is up to us - do we want to recommend more off synopsis, or do we want to recommend more off genre?

The second method is the geometric mean of both matrices, which in our case would be the square root of both matrices multipled. This gives a little more consideration to the importance of information provided by both matrices.

We will try both these methods out and then combine them to give us our final similarity matrix.

In [37]:
# Add both and divide by 2 to get the average similarities of both
average_sm = (genres_sm + tfidf_sm) / 2
average_sm

<11091x11091 sparse matrix of type '<class 'numpy.float64'>'
	with 74670437 stored elements in Compressed Sparse Column format>

Let's have a look at how the average similarity matrix looks.

In [38]:
# Let us have a look at the similarities for genre, TFIDF, and both averaged for Cowboy Bebop
print(genres_sm.todense()[0])
print(tfidf_sm.todense()[0])
print(average_sm.todense()[0])

[[1.         0.66666667 0.83333333 ... 0.16666667 0.         0.16666667]]
[[1.         0.13925599 0.02920766 ... 0.         0.01335555 0.        ]]
[[1.         0.40296133 0.4312705  ... 0.08333333 0.00667778 0.08333333]]


Just as we calculated, each pairing similarity score is the average of the two individuals for genre and synopsis.

Now we will create the geometric mean similarity matrix:

In [39]:
# The similarities maybe should not have equal importance for genre and synopsis
# Workout the geometric mean of both
gmean_sm = genres_sm.multiply(tfidf_sm).sqrt()
gmean_sm.todense()[0]

matrix([[1.        , 0.30469218, 0.15601191, ..., 0.        , 0.        ,
         0.        ]])

Compared to the average similarity matrix, it looks like the scores drop heavily when either of the scores is very low. E.g. if the synopsis similarity of a pair is low but the genre similarity is high, the average similarity matrix would give a similarity score that is neither low nor high, but the geometric mean similarity score would still be low.

We will combine both for best results.

In [40]:
# Combine the average similarity matrix and the geometrix mean similarity matrix to get the final one
sm = (average_sm + gmean_sm) / 2
sm.todense()[0]

matrix([[1.        , 0.35382676, 0.2936412 , ..., 0.04166667, 0.00333889,
         0.04166667]])

## Testing

Let us test our final similarity matrix using Cowboy Bebop.

In [41]:
# Pandas df of original df to return recommendations
anime_df = anime_sdf.toPandas()

anime_df.head()

Unnamed: 0,Name,Score,Genres,Synopsis
0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","""In the year 2071, humanity has colonized seve..."
1,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."
2,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","""Vash the Stampede is the man with a $$60,000,..."
3,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...
4,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...


In [42]:
# Use densed matrix
dense_sm = sm.todense()

First we will retrieve the index of each anime that has a similarity score of more than 0.3 with Cowboy Bebop.

In [43]:
# Test find the anime which are similar to Cowboy Bebop by genre and synopsis
cowboy_bebop_sim_idx = np.where(dense_sm[0] > 0.3)[1].tolist()
cowboy_bebop_sim_idx

[0,
 1,
 39,
 74,
 75,
 365,
 527,
 834,
 881,
 1092,
 1145,
 1222,
 1225,
 1300,
 1411,
 1433,
 1646,
 1878,
 1907,
 1908,
 2021,
 2177,
 2191,
 2522,
 2525,
 2993,
 3093,
 3257,
 3583,
 6736]

Now we can use those indices to retrieve their scores too.

In [44]:
# Retrieve the scores for the similar anime to Cowboy Bebop so we can see which is the most similar
cowboy_bebop_sim_scores = np.array(dense_sm[0].tolist()[0])[cowboy_bebop_sim_idx].tolist()
cowboy_bebop_sim_scores

[1.0,
 0.3538267555619954,
 0.30156571791789183,
 0.3686875854350235,
 0.33455845888115077,
 0.40230339444965446,
 0.32431680375960464,
 0.3140748961880193,
 0.3389283728674283,
 0.338630613251939,
 0.4244131664629514,
 0.3432676618297479,
 0.32171427360080157,
 0.3133713712976669,
 0.30134085486704315,
 0.32263063509104695,
 0.3039864112034351,
 0.3291693394721288,
 0.3066952765778514,
 0.33187348292362223,
 0.36943748196731385,
 0.3401960563610572,
 0.36733068688783344,
 0.3118898636881699,
 0.3339970911088977,
 0.36303922700896557,
 0.4509182566105018,
 0.34624455191691395,
 0.33943826757668283,
 0.31411005766453426]

Now we can print the titles from the original dataframe and add the scores as a column to then sort by.

In [45]:
# Print Cowboy Bebop recommendations
cowboy_bebop_recoms = anime_df.iloc[cowboy_bebop_sim_idx]
cowboy_bebop_recoms['Similarity_Scores'] = cowboy_bebop_sim_scores
cowboy_bebop_recoms.sort_values(by='Similarity_Scores', ascending=False).head(10)
# Can be improved by more features (such as year, voice actors etc.)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cowboy_bebop_recoms['Similarity_Scores'] = cowboy_bebop_sim_scores


Unnamed: 0,Name,Score,Genres,Synopsis,Similarity_Scores
0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","""In the year 2071, humanity has colonized seve...",1.0
3093,Ginga Senpuu Braiger,6.68,"Action, Sci-Fi, Adventure, Space, Mecha","In the year 2111, the solar system has been co...",0.450918
1145,Odin: Koushi Hansen Starlight,5.24,"Action, Sci-Fi, Adventure, Space, Drama","""In the year 2099, mankind has colonized parts...",0.424413
365,Seihou Bukyou Outlaw Star,7.87,"Action, Sci-Fi, Adventure, Space, Comedy",Gene Starwind has always dreamed of piloting h...,0.402303
2021,Sei Juushi Bismarck,7.23,"Action, Sci-Fi, Adventure, Space, Mecha","In the distant future, humanity has explored b...",0.369437
74,Turn A Gundam,7.7,"Action, Military, Sci-Fi, Adventure, Space, Dr...","""It is the Correct Century, two millennia afte...",0.368688
2191,Juusenki L-Gaim,6.7,"Action, Sci-Fi, Adventure, Space, Drama, Mecha...","""In the year 3990, the immortal Oldna Poseidal...",0.367331
2993,Cowboy Bebop: Yose Atsume Blues,7.44,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","""Due to the violence portrayed in the Cowboy B...",0.363039
1,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...",0.353827
3257,Yamato Takeru,6.78,"Action, Adventure, Fantasy, Magic, Mecha, Sci-...","In the 25th century, a spaceship carrying 300 ...",0.346245


Looks okay! Let's create a function to test with any title.

In [46]:
# Create a function to try it out
def get_recommendations(anime_name, threshold=0.3, sort_by='Similarity_Scores'):

    # Return nothing if name not found
    if anime_name not in anime_df['Name'].values:
        print("Could not find an anime by that name. Please check punctuation and spelling.")
        return None
    
    # Find index of anime in anime dataframe
    anime_idx = anime_df.index[anime_df['Name'] == anime_name].to_list()[0]

    # Find indices of similar anime
    sim_idx = np.where(dense_sm[anime_idx] > threshold)[1].tolist()

    # Find similarity scores of similar anime
    sim_scores = np.array(dense_sm[anime_idx].tolist()[0])[sim_idx].tolist()

    recoms = anime_df.iloc[sim_idx]
    recoms['Similarity_Scores'] = sim_scores
    return recoms.sort_values(by=sort_by, ascending=False).head(10)
        

Test function:

In [47]:
get_recommendations("Cowboy Bebop", sort_by='Score')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recoms['Similarity_Scores'] = sim_scores


Unnamed: 0,Name,Score,Genres,Synopsis,Similarity_Scores
0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","""In the year 2071, humanity has colonized seve...",1.0
1,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...",0.353827
2177,Mobile Suit Gundam 00,8.14,"Action, Military, Sci-Fi, Space, Drama, Mecha","""In the distant future, mankind's dependence o...",0.340196
1878,Terra e... (TV),7.92,"Action, Drama, Military, Sci-Fi, Space","In the future, humans are living on colonized ...",0.329169
834,Top wo Nerae! Gunbuster,7.89,"Action, Comedy, Drama, Mecha, Military, Sci-Fi...","""In the near future, humanity has taken its fi...",0.314075
365,Seihou Bukyou Outlaw Star,7.87,"Action, Sci-Fi, Adventure, Space, Comedy",Gene Starwind has always dreamed of piloting h...,0.402303
6736,Uchuu Senkan Yamato 2199: Hoshimeguru Hakobune,7.77,"Action, Military, Sci-Fi, Space, Drama","""2199 AD. Yamato tried to leave behind the Lar...",0.31411
881,Uchuu Kaizoku Captain Herlock,7.71,"Action, Sci-Fi, Adventure, Space, Drama, Seinen",It is 2977 AD and mankind has become stagnant....,0.338928
74,Turn A Gundam,7.7,"Action, Military, Sci-Fi, Adventure, Space, Dr...","""It is the Correct Century, two millennia afte...",0.368688
1433,Uchuu Senkan Yamato,7.59,"Action, Adventure, Drama, Military, Sci-Fi, Space","In the year 2199, Earth is a mere shell of its...",0.322631


In conclusion, this notebook presented a content-based anime recommender system that utilizes a dataset containing anime titles, ratings, genres, and synopses. By using pyspark for data cleaning and pre-processing, and techniques such as TFIDF and one hot encoding, the system is able to find similar animes based on genre and synopsis. A genre similarity matrix and a synopsis similarity matrix were created and combined to create a final similarity matrix. This final matrix is used in a function to retrieve similar animes above a certain threshold, which can be sorted by similarity score or rating. For future improvements, incorporating more features such as year and voice actors and experimenting with collaborative filtering methods can be considered to further enhance the performance of the system.

In [49]:
# Done

# Improvements:
# - Better features
# - Collaborative-based filtering