# Research Question

Can we accurately predict if the upcoming anime/manga will become popular ot not based on its genres, ranking, and other features? What are the factors that can make audiences value or like a particular anime/manga even more?

# Dataset Exploration

To answer this question, let's use the "animes.csv" that was downloaded from Kaggle and into the repository.               
Before using it though, let's first study it.

In [1]:
import pandas as pd

In [2]:
animes_df = pd.read_csv("animes.csv")
animes_df.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


In [3]:
animes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19311 entries, 0 to 19310
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   uid         19311 non-null  int64  
 1   title       19311 non-null  object 
 2   synopsis    18336 non-null  object 
 3   genre       19311 non-null  object 
 4   aired       19311 non-null  object 
 5   episodes    18605 non-null  float64
 6   members     19311 non-null  int64  
 7   popularity  19311 non-null  int64  
 8   ranked      16099 non-null  float64
 9   score       18732 non-null  float64
 10  img_url     19131 non-null  object 
 11  link        19311 non-null  object 
dtypes: float64(3), int64(3), object(6)
memory usage: 1.8+ MB


The "animes.csv" shows a list of anime with the following features:
1. uid - The ID of the anime/manga
2. title - The title of the anime/manga
3. synopsis - A brief intro to what the anime/manga is about
4. genre - The list of genres the anime/manga is assigned to
5. aired - When it started airing and when it ended
6. episodes - The number of episodes
7. members - The number of users that have that entry in their anime/manga list
8. popularity - The overall position of the entry by how many users have that in their anime/manga list
9. ranked - The overall position of the entry sorted by how users score the work, from highest to lowest.
10. score - The overall rating score on a scale from 1 to 10, from worse to best.
11. img_url - Basically the image url of the corresponding anime/manga
12. link - The MyAnimeList link of the corresponding anime/manga

# Dataset Cleaning

On first glance, I believe uid, aired, img_url, and link are unnecessary infos for the training of any model. Thus, I'm going to remove those.

In [4]:
animes_df = animes_df.drop(columns = ['uid', 'aired', 'img_url', 'link'])
animes_df.head()

Unnamed: 0,title,synopsis,genre,episodes,members,popularity,ranked,score
0,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,489888,141,25.0,8.82
1,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,995473,28,24.0,8.83
2,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...",13.0,581663,98,23.0,8.83
3,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...",64.0,1615084,4,1.0,9.23
4,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']",1.0,214621,502,22.0,8.83


When looking at the column genre, we would see a list of genres. However, when looking at the element's type, it's not a list but a string as I will show below.

In [26]:
animes_df['genre'][0]

"['Comedy', 'Sports', 'Drama', 'School', 'Shounen']"

In [27]:
type(animes_df['genre'][0])

str

Thus, let's change the element's type in column genre into that of a list using eval().

In [38]:
animes_df['genre'] = animes_df['genre'].map(eval)
type(animes_df['genre'][0])

list

When exploring the DataFrame, I've noticed two "Death Note".

In [44]:
animes_df.query('popularity == 1')

Unnamed: 0,title,synopsis,genre,episodes,members,popularity,ranked,score
740,Death Note,"A shinigami, as a god of death, can kill any p...","[Mystery, Police, Psychological, Supernatural,...",37.0,1871043,1,52.0,8.65
17762,Death Note,"A shinigami, as a god of death, can kill any p...","[Mystery, Police, Psychological, Supernatural,...",37.0,1871043,1,52.0,8.65


This shows that there are duplicate animes/mangas which would provide bias towards any model training and reduce the model's robustness to generalize data.                                                                                     
The following code below confirms it.

In [50]:
duplicate_df = animes_df.copy()
duplicate_df['genre'] = duplicate_df['genre'].apply(tuple)
duplicate_df.duplicated().any()

True

Thus, let's remove the duplicates.

In [51]:
len(animes_df)

19311

It seems that having a column with lists as elements hinders some pandas codes like drop_duplicates().                                            
Thus, we shall change column genre's elements from list to tuple.

In [53]:
animes_df['genre'] = animes_df['genre'].apply(tuple)
animes_df = animes_df.drop_duplicates()
len(animes_df)

16368

------------------------------