# Top 10 Similar Anime's using hybrid filtering

Uses processed data (in dataframe **anime**) showing each anime's genre(s), type (show, movie, etc.), average rating, and the number of viewers who reviewed each anime. In addition, there is a dataframe (**ratings**) of individual viewers' ratings for each show.

We will use this to create a system that will show the top 10 most similar anime's to any given anime.

In [1]:
import pickle
import pandas as pd
import numpy as np

DIR_DATA = "data"
DIR_PROCESSED = "processed"

The data in anime was preprocessed into it's current format and then stored in a pickle file. This process will be discussed in another post.

Both dataframes have a "rating" column, so we are moving the rating field in **anime** into a new field called *avg_rating*

In [2]:
a_file = open(DIR_PROCESSED + '/one_hot_encoded_anime.pickle', 'rb')
anime = pickle.load(a_file)
a_file.close()

# "rating" is in both tables
anime["avg_rating"] = anime.rating
anime = anime.drop("rating", 1)

ratings = pd.read_csv(DIR_DATA + '/rating.csv')

In [3]:
anime.head()

Unnamed: 0,anime_id,name,episodes,members,Movie,Music,ONA,OVA,Special,TV,...,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri,nan,int_rating,avg_rating
0,32281,Kimi no Na wa.,1,200630,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,9.37
1,5114,Fullmetal Alchemist: Brotherhood,64,793665,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,9.26
2,28977,Gintama°,51,114262,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,9,9.25
3,9253,Steins;Gate,24,673572,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,9.17
4,9969,Gintama&#039;,51,151266,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,9,9.16


In [4]:
ratings.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


Now, we will merge the two dataframes into one **shows** dataframe.

In [5]:
shows = pd.merge(anime, ratings)
shows.head()

Unnamed: 0,anime_id,name,episodes,members,Movie,Music,ONA,OVA,Special,TV,...,Supernatural,Thriller,Vampire,Yaoi,Yuri,nan,int_rating,avg_rating,user_id,rating
0,32281,Kimi no Na wa.,1,200630,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,9,9.37,99,5
1,32281,Kimi no Na wa.,1,200630,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,9,9.37,152,10
2,32281,Kimi no Na wa.,1,200630,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,9,9.37,244,10
3,32281,Kimi no Na wa.,1,200630,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,9,9.37,271,10
4,32281,Kimi no Na wa.,1,200630,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,9,9.37,278,-1


For this example, we will find ***Dragon Ball Z***'s top 10 most similar shows.

In [6]:
showName = "Dragon Ball Z"

shows[shows.name == "Dragon Ball Z"].head()

Unnamed: 0,anime_id,name,episodes,members,Movie,Music,ONA,OVA,Special,TV,...,Supernatural,Thriller,Vampire,Yaoi,Yuri,nan,int_rating,avg_rating,user_id,rating
1259607,813,Dragon Ball Z,291,375662,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,8,8.32,3,10
1259608,813,Dragon Ball Z,291,375662,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,8,8.32,5,5
1259609,813,Dragon Ball Z,291,375662,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,8,8.32,7,9
1259610,813,Dragon Ball Z,291,375662,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,8,8.32,12,9
1259611,813,Dragon Ball Z,291,375662,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,8,8.32,21,8


Now let's make a pivot table where the value at element (*user_id*, *name*) is the rating the user with *user_id* gave to the show with that *name*, and NaN if no rating exists.

In practice you could also use the *anime_id* instead of the *name* of the anime, but for readibility we went with *name* instead.

In [7]:
animeRatings = shows.pivot_table(index=['user_id'], columns='name', values='rating')
animeRatings.head()

name,&quot;0&quot;,"&quot;Aesop&quot; no Ohanashi yori: Ushi to Kaeru, Yokubatta Inu",&quot;Bungaku Shoujo&quot; Kyou no Oyatsu: Hatsukoi,&quot;Bungaku Shoujo&quot; Memoire,&quot;Bungaku Shoujo&quot; Movie,&quot;Eiji&quot;,.hack//G.U. Returner,.hack//G.U. Trilogy,.hack//G.U. Trilogy: Parody Mode,.hack//Gift,...,makemagic,"on-chan, Yume Power Daibouken!",s.CRY.ed,vivi,xxxHOLiC,xxxHOLiC Kei,xxxHOLiC Movie: Manatsu no Yoru no Yume,xxxHOLiC Rou,xxxHOLiC Shunmuki,◯
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,2.0,,,,,


Let's look at the ratings that ***Dragon Ball Z*** got from users.

In [9]:
showRatings = animeRatings[showName]
showRatings.head()

user_id
1     NaN
2     NaN
3    10.0
4     NaN
5     5.0
Name: Dragon Ball Z, dtype: float64

<!-- TEASER_END -->
## Colaboritive Filtering

Now let's see what shows ratings correlate with ***Dragon Ball Z***, i.e. if user *x* gave both DBZ and FullMetal the same score, than they are similar (1.0 in similarity), if he gave them very different scores, then they aren't similar (closet to 0 similarity)

In [13]:
similarAnimes = animeRatings.corrwith(showRatings)
similarAnimes = similarAnimes.dropna()
df = pd.DataFrame(similarAnimes)
df.head()

  c = cov(x, y, rowvar)
  c *= 1. / np.float64(fact)


Unnamed: 0_level_0,0
name,Unnamed: 1_level_1
&quot;0&quot;,0.663817
"&quot;Aesop&quot; no Ohanashi yori: Ushi to Kaeru, Yokubatta Inu",1.0
&quot;Bungaku Shoujo&quot; Kyou no Oyatsu: Hatsukoi,0.438522
&quot;Bungaku Shoujo&quot; Memoire,0.430201
&quot;Bungaku Shoujo&quot; Movie,0.542192


Now let's see those similarity scores in order, highest to lowest.

In [14]:
similarAnimes.order(ascending=False)

  if __name__ == '__main__':


name
Tamagotchi! Miracle Friends                                     1.000000
3-tsu no Hanashi                                                1.000000
Ojiichan ga Kaizoku Datta Koro                                  1.000000
Itazura Tenshi Chippo-chan                                      1.000000
Itoshi no Betty Mamonogatari                                    1.000000
Oishinbo: Nichibei Kome Sensou                                  1.000000
Oishinbo                                                        1.000000
Jam                                                             1.000000
Jankenman                                                       1.000000
Okashina Hotel                                                  1.000000
Jarinko Chie (TV)                                               1.000000
Obocchama-kun                                                   1.000000
Wan Wan Chuushingura                                            1.000000
Wan Wan Celepoo Soreyuke! Tetsunoshin         

It seems like a lot of users gave the same score to other anime's that they gave to ***DBZ***, which makes sense since the ratings are just Int's between 1 and 10 inclusive.

So let's see if we can aggregate these similarities over all users (currently this is per user per show paired with ***DBZ***), in addition to seeing how many users contributed to this score

In [17]:
animeStats = shows.groupby('name').agg({'rating': [np.size, np.mean]})
animeStats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
name,Unnamed: 1_level_2,Unnamed: 2_level_2
&quot;0&quot;,26,2.769231
"&quot;Aesop&quot; no Ohanashi yori: Ushi to Kaeru, Yokubatta Inu",2,0.0
&quot;Bungaku Shoujo&quot; Kyou no Oyatsu: Hatsukoi,782,5.774936
&quot;Bungaku Shoujo&quot; Memoire,809,6.155748
&quot;Bungaku Shoujo&quot; Movie,1535,6.45798


Hmm. It seems like we may end up in a case where a show with a handful of reviewers can seem very similar to ***DBZ*** if that same handful of people also watched and liked ***DBZ***.

To avoid this we're going to filter out any show with less than a hundred reviews. This number (100) is picked as an arbitrary signifier of a show with enough reviews that we can draw meaningful conclusions from it.

Printed below is the top 10 shows by average rating with mroe than 100 reviews. These are popular, well-known anime's, several of which I've watched and love (*Full Metal Alchemist* and *Code Geass*)

In [18]:
popularAnimes = animeStats['rating']['size'] >= 100
animeStats[popularAnimes].sort_values([('rating','mean')], ascending=False).head(10)

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
name,Unnamed: 1_level_2,Unnamed: 2_level_2
Kimi no Na wa.,2199,8.297863
Ginga Eiyuu Densetsu,903,8.239203
Steins;Gate,19283,8.126796
Fullmetal Alchemist: Brotherhood,24574,8.028933
Gintama°,1386,7.95671
Hunter x Hunter (2011),8575,7.924082
Clannad: After Story,17854,7.835275
Monster,4594,7.809099
Gintama,4974,7.775231
Code Geass: Hangyaku no Lelouch R2,24242,7.765943


Now, we join this table of average ratings with our previous table of similarities to ***DBZ***

In [26]:
df = animeStats[popularAnimes].join(pd.DataFrame(similarAnimes, columns=['similarity']))
df.head()



Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
&quot;Bungaku Shoujo&quot; Kyou no Oyatsu: Hatsukoi,782,5.774936,0.438522
&quot;Bungaku Shoujo&quot; Memoire,809,6.155748,0.430201
&quot;Bungaku Shoujo&quot; Movie,1535,6.45798,0.542192
.hack//G.U. Returner,730,4.80411,0.673044
.hack//G.U. Trilogy,1118,5.347943,0.648628


Now let's see the top 10 most similar movies to ***DBZ***

In [20]:
df.sort_values(['similarity'], ascending=False).head(10)

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dragon Ball Z,17032,6.772135,1.0
Soratobu Yuureisen,163,5.380368,0.947088
Omae Umasou da na,171,6.824561,0.918776
Ginga Tetsudou Monogatari,144,5.638889,0.887691
Galaxy Angel 3 Specials,117,4.965812,0.865963
Dragon Ball,14117,6.597152,0.863344
Aquarion Movie: Ippatsu Gyakuten-hen,141,5.404255,0.861022
Sol Bianca,109,4.568807,0.85603
Ojamajo Doremi Na-i-sho,135,5.251852,0.854087
Youma,186,4.806452,0.852663


Unsurprisingly, the most similar show to ***Dragon Ball Z*** is ***Dragon Ball Z*** itself, which in real life we would add a step to remove this obvious, but useless, recommendation.

***Dragon Ball*** is where it should be in the list (on it, but not at the top) given it's slightly different genre. It's more fantasy/*Journey to the West* style than ***DBZ***'s more superman-y sci-fi plots with aliens and androids.