<img src="https://media.giphy.com/media/TTedQxhzd5T4A/giphy.gif" width="750" align="center">

## (ฅ^•ﻌ•^ฅ) 
## ⛩️ Content Based & Collaborative Filtering - Anime Recommender System ⛩️

### Importing libraries & packages

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet') # wordnet is the most well known lemmatizer for english
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import ssl
from sklearn import cluster
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Mi Notebook\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to C:\Users\Mi
[nltk_data]     Notebook\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Mi
[nltk_data]     Notebook\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 🍥 Introduction 🍥

#### This notebook is written and illustrated by an Ironhack student - Dory, a young data enthusiast who seeks recognition from her peers and dreams of becoming a Data Scientist. With the help of Machine Learning, StackOverflow, Google and her friends from her class, Dory wants to become a Hokage and create an anime recommender.


<img src="https://media.giphy.com/media/Nzz86dByLtYTS/giphy.gif" width="750" align="center">

#### The main goal of this project is to use NLP techniques to analyze anime synopsis of the anime catalogue, user rating data and build an anime recommender system.  

#### This notebook uses 3 datasets:
* `anime` : contain general information of every anime (17.562 different anime) like genre, stats, studio, etc. This file have the following columns:
* * MAL_ID: MyAnimelist ID of the anime. (e.g. 1)
* * Name: full name of the anime. (e.g. Cowboy Bebop)
* * Score: average score of the anime given from all users in MyAnimelist database. (e.g. 8.78)
* * Genres: comma separated list of genres for this anime. (e.g. Action, Adventure, Comedy, Drama, Sci-Fi, Space)
* * English name: full name in english of the anime. (e.g. Cowboy Bebop)
* * Japanese name: full name in japanses of the anime. (e.g. カウボーイビバップ)
* * Type: TV, movie, OVA, etc. (e.g. TV)
* * Episodes': number of chapters. (e.g. 26)
* * Aired: broadcast date. (e.g. Apr 3, 1998 to Apr 24, 1999)
* * Premiered: season premiere. (e.g. Spring 1998)
* * Producers: comma separated list of produducers (e.g. Bandai Visual)
* * Licensors: comma separated list of licensors (e.g. Funimation, Bandai Entertainment)
* * Studios: comma separated list of studios (e.g. Sunrise)
* * Source: Manga, Light novel, Book, etc. (e.g Original)
* * Duration: duration of the anime per episode (e.g 24 min. per ep.)
* * Rating: age rate (e.g. R - 17+ (violence & profanity))
* * Ranked: position based in the score. (e.g 28)
* * Popularity: position based in the the number of users who have added the anime to their list. (e.g 39)
* * Members: number of community members that are in this anime's "group". (e.g. 1251960)
* * Favorites: number of users who have the anime as "favorites". (e.g. 61,971)
* * Watching: number of users who are watching the anime. (e.g. 105808)
* * Completed: number of users who have complete the anime. (e.g. 718161)
* * On-Hold: number of users who have the anime on Hold. (e.g. 71513)
* * Dropped: number of users who have dropped the anime. (e.g. 26678)
* * Plan to Watch': number of users who plan to watch the anime. (e.g. 329800)
* * Score-10': number of users who scored 10. (e.g. 229170)
* * Score-9': number of users who scored 9. (e.g. 182126)
* * Score-8': number of users who scored 8. (e.g. 131625)
* * Score-7': number of users who scored 7. (e.g. 62330)
* * Score-6': number of users who scored 6. (e.g. 20688)
* * Score-5': number of users who scored 5. (e.g. 8904)
* * Score-4': number of users who scored 4. (e.g. 3184)
* * Score-3': number of users who scored 3. (e.g. 1357)
* * Score-2': number of users who scored 2. (e.g. 741)
* * Score-1': number of users who scored 1. (e.g. 1580)
* `rating` : This dataset contains 57 Million ratings applied to 16.872 animes by 310.059 users. This file have the following columns:
* * user_id: non identifiable randomly generated user id.
* * anime_id: MyAnimelist ID of the anime that this user has rated.
* * rating: rating that this user has assigned.
* `synopsis` : contains the synopsis (description) of every anime.
* * mal_id : MyAnimelist ID of the anime.
* * name : the name of the anime
* * score : average score of the anime given from all users in MyAnimelist database
* * genres: comma separated list of genres for this anime
* * synopsis : the synopsis of every  anime

### Table of Contents
1. [Introduction](#introduction)
    1. [Datasets information](#this-notebook-has-3-datasets)
2. [`Anime` dataset](#anime-dataset)
    1. [Cleaning](#cleaning)
    2. [Treating the null values](#treating-the-null-values)
3. [`Rating` dataset](#rating-dataset)
4. [`Synopsis` dataset](#animewithsynopsis-dataset)
    1. [Cleaning](#cleaning-the-columns)
    2. [NLP](#nlp)
    3. [Clustering](#clustering)
5. [The Recommender system](#the-recommender)


<img src="https://media.giphy.com/media/ZFB7q1RYGVoHu/giphy.gif" width="750" align="center">

## 🍡 `anime` dataset 🍡

This dataset contains general information of every anime (17.562 different anime) like genre, stats, studio, etc

In [2]:
anime = pd.read_csv('anime.csv')
anime.head(2)

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,Producers,Licensors,Studios,Source,Duration,Rating,Ranked,Popularity,Members,Favorites,Watching,Completed,On-Hold,Dropped,Plan to Watch,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,Bandai Visual,"Funimation, Bandai Entertainment",Sunrise,Original,24 min. per ep.,R - 17+ (violence & profanity),28.0,39,1251960,61971,105808,718161,71513,26678,329800,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,"Sunrise, Bandai Visual",Sony Pictures Entertainment,Bones,Original,1 hr. 55 min.,R - 17+ (violence & profanity),159.0,518,273145,1174,4143,208333,1935,770,57964,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0


### Cleaning

<img src="https://media.giphy.com/media/hAuYWrVIyfK5G/giphy.gif" width="750" align="center">

In [3]:
# Some functions to clean the dataset

#  To clean the column names
def to_clean_columns(df):

    columns = []
    for i in range (len(df.columns)):
        columns.append(df.columns[i].lower().replace(" ", "_"))
    df.columns = columns #we replace the original column names with the standarized ones

to_clean_columns(anime)

# To treat the missing values
def discovering_nans(df):
    for col in df.columns:
        df[col].replace('Unknown', np.nan, inplace =True)
    return df.head() 

discovering_nans(anime) # remove nans from the anime dataset

# to change the datatype into numerical
def to_number(x):
    try:
        return float(x)
    except ValueError:
        return None

anime.head(2)    

Unnamed: 0,mal_id,name,score,genres,english_name,japanese_name,type,episodes,aired,premiered,producers,licensors,studios,source,duration,rating,ranked,popularity,members,favorites,watching,completed,on-hold,dropped,plan_to_watch,score-10,score-9,score-8,score-7,score-6,score-5,score-4,score-3,score-2,score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,Bandai Visual,"Funimation, Bandai Entertainment",Sunrise,Original,24 min. per ep.,R - 17+ (violence & profanity),28.0,39,1251960,61971,105808,718161,71513,26678,329800,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",,"Sunrise, Bandai Visual",Sony Pictures Entertainment,Bones,Original,1 hr. 55 min.,R - 17+ (violence & profanity),159.0,518,273145,1174,4143,208333,1935,770,57964,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0


In [4]:
anime2 = anime.copy() #to keep the original dataframe
pd.DataFrame(anime2.isna().sum()*100/len(anime2), columns=['percentage'])

Unnamed: 0,percentage
mal_id,0.0
name,0.0
score,29.273431
genres,0.358729
english_name,60.158296
japanese_name,0.273317
type,0.210682
episodes,2.938162
aired,1.759481
premiered,72.981437


In [5]:
print(f'dataframe has {anime2.shape[0]} rows and {anime2.shape[1]} columns')

dataframe has 17562 rows and 35 columns


#### Treating the null values

In [6]:
anime_nulls = pd.DataFrame(anime2.isna().sum()*100/len(anime2), columns=['percentage'])
anime_nulls.sort_values('percentage', ascending = False)

Unnamed: 0,percentage
licensors,77.531033
premiered,72.981437
english_name,60.158296
producers,44.379911
studios,40.308621
score,29.273431
source,20.310899
score-9,18.033254
ranked,10.033026
score-2,9.093497


The column `english_name`, `licensors`, `studios`, `producers`, `premiered`, `source` have too many NaN values, we can drop them. 

In [7]:
anime2['english_name'].fillna(anime2.name, inplace=True) # we need to keep the english name, so we fill in the missing names with japanese names
anime2 = anime2.drop(['licensors', 'studios', 'producers', 'premiered', 'source'], axis =1) # we remove the other columns that have a high percentage of null values
anime2.head(2)

Unnamed: 0,mal_id,name,score,genres,english_name,japanese_name,type,episodes,aired,duration,rating,ranked,popularity,members,favorites,watching,completed,on-hold,dropped,plan_to_watch,score-10,score-9,score-8,score-7,score-6,score-5,score-4,score-3,score-2,score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",24 min. per ep.,R - 17+ (violence & profanity),28.0,39,1251960,61971,105808,718161,71513,26678,329800,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",1 hr. 55 min.,R - 17+ (violence & profanity),159.0,518,273145,1174,4143,208333,1935,770,57964,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0


## 🎌 `Rating` dataset 🎌

This dataset contains 57 Million ratings applied to 16.872 animes by 310.059 users.

In [8]:
rating = pd.read_csv('rating_complete.csv')
rating.head(1)

Unnamed: 0,user_id,anime_id,rating
0,0,430,9


In [9]:
rating.rename(columns = {'anime_id': 'mal_id'}, inplace = True) # to make sure that the anime id is called mal_id in thsi dataset too
to_clean_columns(rating)
rating.head(1)

Unnamed: 0,user_id,mal_id,rating
0,0,430,9


In [10]:
rating.shape

(57633278, 3)

I need more information from this dataset. I want to know the following:
* how many rating each user gave, in order to filter out the ones who did not rate much. I only care about the opinion of those anime nerds who rated a lot of animes.
* I also want to know what was the most common rating each user gave. 

In [11]:
rating['rating_counts_per_user']=rating.groupby("user_id")['rating'].transform('count') # to show how many ratings the user gave
rating['rating_mode']=rating.groupby("user_id")['rating'].transform(lambda x: x.value_counts().idxmax()) # the most common rating

It is also interesting to look at how many ratings each anime received. 

In [12]:
rating['rating_counts_per_anime']=rating.groupby("mal_id")['rating'].transform('count') # to show how many ratings each anime received
rating.head(2)

Unnamed: 0,user_id,mal_id,rating,rating_counts_per_user,rating_mode,rating_counts_per_anime
0,0,430,9,35,7,44672
1,0,1004,5,35,7,18488



<img src="https://media.giphy.com/media/GtHcfDOIyzyRq/giphy.gif" width="750" align="center">

Thats a tough battle

## 🍜 `anime_with_synopsis` dataset 🍜

This dataset contains the synopsis (description) of every anime. This is the most important dataset, since I will be using the descripion of each anime to create clusters.

In [13]:
s = pd.read_csv('anime_with_synopsis.csv')
s.head(2)

Unnamed: 0,MAL_ID,Name,Score,Genres,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."


#### Cleaning the columns


In [14]:
#to_clean_columns(s)
s.rename(columns = {'sypnopsis': 'synopsis'}, inplace = True)
to_clean_columns(s)
s.head()

Unnamed: 0,mal_id,name,score,genres,synopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...


In [15]:
s.isna().sum()

mal_id      0
name        0
score       0
genres      0
synopsis    8
dtype: int64

In [16]:
print(f' Where there is no synopsis, it shows  `{s.synopsis.mode()[0]} ` I replace the null values in Synopsis with this.')

 Where there is no synopsis, it shows  `No synopsis information has been added to this title. Help improve our database by adding a synopsis here . ` I replace the null values in Synopsis with this.


In [17]:
s['synopsis'] = s['synopsis'].fillna(s['synopsis'].dropna().mode().values[0])
s.isnull().sum() #verify that all nulls are gone

mal_id      0
name        0
score       0
genres      0
synopsis    0
dtype: int64

In [18]:
s.shape

(16214, 5)

In [19]:
s['synop_genre'] = s['genres'].map(str) + ' ' + s['synopsis'].map(str)
s.head()

Unnamed: 0,mal_id,name,score,genres,synopsis,synop_genre
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...","Action, Adventure, Comedy, Drama, Sci-Fi, Spac..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...","Action, Drama, Mystery, Sci-Fi, Space other da..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0...","Action, Sci-Fi, Adventure, Comedy, Drama, Shou..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...,"Action, Mystery, Police, Supernatural, Drama, ..."
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...,"Adventure, Fantasy, Shounen, Supernatural It i..."


In [20]:
print(s.dtypes)

mal_id          int64
name           object
score          object
genres         object
synopsis       object
synop_genre    object
dtype: object


<div>
<img src="https://media.giphy.com/media/Md22NIX1r3Xoc/giphy.gif" width="500" align = 'center'/>
</div>

### NLP

The goal in this section is to use the `synopsis` column and create a bag of words out of it

In [21]:
#tokenize, lowercase, remove punctuation

def tokenizer_and_remove_punctuation(row):
  tokens = word_tokenize(row['synop_genre'])
  return [word.lower() for word in tokens if word.isalpha()]

s['tokenized'] = s.apply(tokenizer_and_remove_punctuation, axis = 1)
s.head(2)

Unnamed: 0,mal_id,name,score,genres,synopsis,synop_genre,tokenized
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...","Action, Adventure, Comedy, Drama, Sci-Fi, Spac...","[action, adventure, comedy, drama, space, in, ..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...","Action, Drama, Mystery, Sci-Fi, Space other da...","[action, drama, mystery, space, other, day, an..."


In [22]:
# unfortunately pos_tag and lemmatize use different codes for parts of speech 
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper() # gets first letter of POS categorization
    tag_dict = {"J": wordnet.ADJ, 
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN) # get returns second argument if first key does not exist 

def lemmatizer_with_pos(row):
  return [lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in row['tokenized']]

lemmatizer = WordNetLemmatizer() 

s['lemmatized'] = s.apply(lemmatizer_with_pos,axis=1)
s.head(2)

Unnamed: 0,mal_id,name,score,genres,synopsis,synop_genre,tokenized,lemmatized
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...","Action, Adventure, Comedy, Drama, Sci-Fi, Spac...","[action, adventure, comedy, drama, space, in, ...","[action, adventure, comedy, drama, space, in, ..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...","Action, Drama, Mystery, Sci-Fi, Space other da...","[action, drama, mystery, space, other, day, an...","[action, drama, mystery, space, other, day, an..."


In [23]:
# to remove the stopwords

def remove_sw(row):
  return list(set(row['lemmatized']).difference(stopwords.words()))

s['no_stopwords'] = s.apply(remove_sw,axis=1)
s.head()

Unnamed: 0,mal_id,name,score,genres,synopsis,synop_genre,tokenized,lemmatized,no_stopwords
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...","Action, Adventure, Comedy, Drama, Sci-Fi, Spac...","[action, adventure, comedy, drama, space, in, ...","[action, adventure, comedy, drama, space, in, ...","[corgi, crew, spiegel, bebop, high, humanity, ..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...","Action, Drama, Mystery, Sci-Fi, Space other da...","[action, drama, mystery, space, other, day, an...","[action, drama, mystery, space, other, day, an...","[crew, day, bebop, company, mass, ragtag, stak..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0...","Action, Sci-Fi, Adventure, Comedy, Drama, Shou...","[action, adventure, comedy, drama, shounen, va...","[action, adventure, comedy, drama, shounen, va...","[rumor, villain, thompson, reason, flattens, c..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...,"Action, Mystery, Police, Supernatural, Drama, ...","[action, mystery, police, supernatural, drama,...","[action, mystery, police, supernatural, drama,...","[source, user, craft, organization, like, rece..."
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...,"Adventure, Fantasy, Shounen, Supernatural It i...","[adventure, fantasy, shounen, supernatural, it...","[adventure, fantasy, shounen, supernatural, it...","[devil, day, fantasy, dream, defeat, dark, pow..."


In [24]:
# put all this cleaning together

def re_blob(row):
  return " ".join(row['no_stopwords'])

s['clean_blob'] = s.apply(re_blob,axis=1)
s.head(2)

Unnamed: 0,mal_id,name,score,genres,synopsis,synop_genre,tokenized,lemmatized,no_stopwords,clean_blob
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever...","Action, Adventure, Comedy, Drama, Sci-Fi, Spac...","[action, adventure, comedy, drama, space, in, ...","[action, adventure, comedy, drama, space, in, ...","[corgi, crew, spiegel, bebop, high, humanity, ...",corgi crew spiegel bebop high humanity kid wes...
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ...","Action, Drama, Mystery, Sci-Fi, Space other da...","[action, drama, mystery, space, other, day, an...","[action, drama, mystery, space, other, day, an...","[crew, day, bebop, company, mass, ragtag, stak...",crew day bebop company mass ragtag stake place...


In [25]:
bow_vect = CountVectorizer()
# fit creates one entry for each different word seen  
X = bow_vect.fit_transform(s['clean_blob']).toarray() # fitting and transforming
s_df = pd.DataFrame(X,columns=bow_vect.get_feature_names())

In [26]:
s_df.shape

(16214, 36260)


<img src="https://media.giphy.com/media/h2eSZKHwohF20gYc7A/giphy.gif" width="750" align="center">

### Clustering

In [27]:
from sklearn import cluster
kmeans = cluster.KMeans(n_clusters=10,random_state=100)
kmeans.fit(X)
pred = kmeans.predict(X)

In [28]:
predict_df = pd.concat([s['mal_id'],pd.DataFrame(pred,columns=['cluster'])],axis=1)
predict_df['cluster'].value_counts() # we need to check and remove the outlier clusters

5    3856
7    3661
4    2676
6    2147
1    1802
2    1346
3     717
0       6
8       2
9       1
Name: cluster, dtype: int64

In [29]:
predict_df.head()

Unnamed: 0,mal_id,cluster
0,1,6
1,5,6
2,6,6
3,7,5
4,8,6


In [30]:
anime_with_clusters = predict_df.merge(anime2, on = 'mal_id',  how = 'inner' ) # we concatenate the cluster information with the anime df
anime_with_clusters.head(2)

Unnamed: 0,mal_id,cluster,name,score,genres,english_name,japanese_name,type,episodes,aired,duration,rating,ranked,popularity,members,favorites,watching,completed,on-hold,dropped,plan_to_watch,score-10,score-9,score-8,score-7,score-6,score-5,score-4,score-3,score-2,score-1
0,1,6,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",24 min. per ep.,R - 17+ (violence & profanity),28.0,39,1251960,61971,105808,718161,71513,26678,329800,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,6,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",1 hr. 55 min.,R - 17+ (violence & profanity),159.0,518,273145,1174,4143,208333,1935,770,57964,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0


In [31]:
anime_with_clusters.cluster.value_counts()

5    3856
7    3661
4    2676
6    2147
1    1802
2    1346
3     717
0       6
8       2
9       1
Name: cluster, dtype: int64

In [32]:
# we remove the clusters with outliers
awc = anime_with_clusters[anime_with_clusters.cluster < 8]
awc = awc[awc.cluster > 0]# awc = anime with clusters
awc.shape

(16205, 31)

For the recommender, I will be using the ratings only of those users, who watched a lot of anime, becasue they are the true anime enthusiasts, whose opinion matter to me.

In [33]:
people_who_rated_lots_of_anime = rating.loc[rating['rating_counts_per_user']>= 200] # we are only interested in people who have rated 200 or more animes
print(f'{len(people_who_rated_lots_of_anime.user_id.value_counts())} anime nerds have rated more than 200 animes')
print(f'They have rated {len(people_who_rated_lots_of_anime.mal_id.unique())} animes in total') # the number of animes that have been rated by the users who watched more than 200 animes. 

95143 anime nerds have rated more than 200 animes
They have rated 16864 animes in total


In [34]:
people_who_rated_lots_of_anime.to_csv('filtered_ratings.csv')
people_who_rated_lots_of_anime = pd.read_csv('filtered_ratings.csv')

awc.to_csv('awc.csv')
awc = pd.read_csv('awc.csv')

In [35]:
anime_list = people_who_rated_lots_of_anime.mal_id.unique().tolist() # putting the anime_id from the rating df into a list
awc[awc.mal_id.isin(anime_list)] # we check if those anime ids are present in our dataframe

Unnamed: 0.1,Unnamed: 0,mal_id,cluster,name,score,genres,english_name,japanese_name,type,episodes,aired,duration,rating,ranked,popularity,members,favorites,watching,completed,on-hold,dropped,plan_to_watch,score-10,score-9,score-8,score-7,score-6,score-5,score-4,score-3,score-2,score-1
0,0,1,6,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26.0,"Apr 3, 1998 to Apr 24, 1999",24 min. per ep.,R - 17+ (violence & profanity),28.0,39,1251960,61971,105808,718161,71513,26678,329800,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,1,5,6,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1.0,"Sep 1, 2001",1 hr. 55 min.,R - 17+ (violence & profanity),159.0,518,273145,1174,4143,208333,1935,770,57964,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,2,6,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26.0,"Apr 1, 1998 to Sep 30, 1998",24 min. per ep.,PG-13 - Teens 13 or older,266.0,201,558913,12944,29113,343492,25465,13925,146918,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,3,7,5,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26.0,"Jul 2, 2002 to Dec 24, 2002",25 min. per ep.,PG-13 - Teens 13 or older,2481.0,1467,94683,587,4300,46165,5121,5378,33719,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,4,8,6,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",23 min. per ep.,PG - Children,3710.0,4369,13224,18,642,7314,766,1108,3394,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16156,16165,47614,3,Nu Wushen de Canzhuo Spring Festival Special,6.83,"Slice of Life, Comedy",Cooking with Valkyries Spring Festival Special,,Special,1.0,"Feb 11, 2021",10 min.,PG - Children,4382.0,11973,540,8,51,168,18,2,301,15.0,17.0,16.0,29.0,26.0,10.0,1.0,,3.0,
16157,16166,47616,7,Yakusoku no Neverland 2nd Season: Michishirube,4.81,"Mystery, Psychological, Supernatural, Thriller...",The Promised Neverland Season 2 Episode 5.5,約束のネバーランド 特別編「道標」,Special,1.0,"Feb 12, 2021",23 min.,R - 17+ (violence & profanity),10760.0,4398,13070,90,1183,8196,119,202,3370,188.0,141.0,317.0,565.0,998.0,1542.0,831.0,516.0,336.0,722.0
16158,16167,47618,5,Ichi Nichi Shite Narazu,,Slice of Life,Ichi Nichi Shite Narazu,一日にして成らず,ONA,1.0,"Feb 8, 2021",6 min.,,14587.0,17156,73,0,7,42,1,2,21,,,1.0,2.0,6.0,16.0,2.0,,1.0,1.0
16170,16179,48177,7,Ichiban Chikakute Tooi Hoshi,,"Music, Drama",Ichiban Chikakute Tooi Hoshi,一番近くて遠い星,ONA,1.0,"Feb 18, 2021",1 min.,PG - Children,14582.0,15826,151,2,9,74,2,4,62,,,3.0,11.0,11.0,10.0,1.0,1.0,,


<div>
<img src="https://media.giphy.com/media/RneIcLEosVuta/giphy.gif" width="500" align = 'center'/>
</div>

## 🐱‍👤 The Recommender 🐱‍👤

In [40]:
def r():   
    your_anime = input('Give me an anime, bestie ^^') # your anime of choice
    input_with_cluster = awc.loc[awc['name'] == your_anime] # finding out the cluster your anime belongs to
    cluster_no = input_with_cluster['cluster'].values[0] # storing that cluster number safe and sound <3
    similar_anime = awc.loc[awc['cluster'] == cluster_no] # finding other anime from the same cluster
    anime_list = similar_anime.mal_id.unique().tolist() # we store the anime ids of the same cluster animes
    rating_df = rating[rating.mal_id.isin(anime_list)] # we find the rating of those animes
    rating_df['rating_counts_per_anime']= rating_df['rating_counts_per_anime'].sort_values(ascending = False)
    rating_df = rating_df.head(50)# we filter out the animes that have the highest ratings
    rating_df = pd.DataFrame(rating_df['mal_id']) # store the filtered anime ids only 
    final_df = rating_df.merge(awc, on = 'mal_id', how = 'inner') #put the filtered info into a dataframe
    final_df = final_df[final_df.rating == input_with_cluster.rating.values[0]] # make sure we recommend the same age rating
    #recom = final_df.sample(5)
    return final_df    



def recommendation(temp):
    feedback='No'
    #temp = r()
    #print(temp.head(1))
    
    
    while feedback=="No":
        print(temp['name'].sample(5).tolist())
        feedback = input("Do you like this recommendation?")
        #print(temp.sample(5))
    return ()

def recommender():
    temp =r()
    recommendation(temp)
    return()

In [41]:
recommender()

['Naruto', 'SSSS.Gridman', 'Boruto: Naruto the Movie', 'Black Cat (TV)', 'Dr. Stone']
['Dr. Stone', 'Fullmetal Alchemist: The Conqueror of Shamballa', 'Enen no Shouboutai', 'One Piece Movie 1', 'SSSS.Gridman']


()

<img src="https://media.giphy.com/media/yALcFbrKshfoY/giphy.gif" width="750" align="center">


In [38]:
#temp =r()
#ecommendation(temp)