# Data Preparation & Cleaning
---
This notebook covers the steps we took to clean the dataset.

## Importing all the necessary libraries

In [100]:
# Basic Libraries
import pandas as pd

## Importing dataset
For this mini project, we will be using the dataset that consists of:
> Anime info  
> Anime reviews  

These datasets originate from Kaggle and their respective links can be found here:  
> **anime_info_2020:**  
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?select=animes.csv
>
> **anime_review_2020:**  
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?select=reviews.csv 
>
> **anime_info:**  
https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset?select=anime-dataset-2023.csv  

### Import Anime Reviews
Since the anime reviews dataset is too large, we broke it down in order to upload it to GitHub (max file size 100MB).

In [101]:
# Directory containing the CSV files
data_dir = 'datasets'

#Pattern for the review parts
pattern = f"{data_dir}/anime_review_part_"


all_data = []

# Read each CSV file and append the DataFrame to the list
for fileCount in range(1,8):
    df = pd.read_csv(pattern+str(fileCount)+".csv")
    print(pattern+str(fileCount))
    all_data.append(df)
    
print(f"Successfully combined {len(all_data)} CSV files!")


# Anime 2020 Review Ratings
anime_review_2020 = pd.concat(all_data, ignore_index=True)

datasets/anime_review_part_1
datasets/anime_review_part_2
datasets/anime_review_part_3
datasets/anime_review_part_4
datasets/anime_review_part_5
datasets/anime_review_part_6
datasets/anime_review_part_7
Successfully combined 7 CSV files!


### Import Anime Info

In [102]:
# Anime 2020 Dataset
anime_info_2020 = pd.read_csv("datasets/anime_2020.csv")

# Latest Anime dataset used to fill in NaN values
anime_info = pd.read_csv('datasets/anime-dataset-2023.csv')

In [103]:
print("The shape of the Anime 2020 CSV is:")
print(anime_info_2020.shape)
print()
print(anime_info_2020.info())
print()
print("========================================")
print("The shape of the Anime 2020 Review Rating CSV is:")
print(anime_review_2020.shape)
print()
print(anime_review_2020.info())

The shape of the Anime 2020 CSV is:
(19311, 12)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19311 entries, 0 to 19310
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   uid         19311 non-null  int64  
 1   title       19311 non-null  object 
 2   synopsis    18336 non-null  object 
 3   genre       19311 non-null  object 
 4   aired       19311 non-null  object 
 5   episodes    18605 non-null  float64
 6   members     19311 non-null  int64  
 7   popularity  19311 non-null  int64  
 8   ranked      16099 non-null  float64
 9   score       18732 non-null  float64
 10  img_url     19131 non-null  object 
 11  link        19311 non-null  object 
dtypes: float64(3), int64(3), object(6)
memory usage: 1.8+ MB
None

The shape of the Anime 2020 Review Rating CSV is:
(192112, 7)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192112 entries, 0 to 192111
Data columns (total 7 columns):
 #   Column     Non-Null Co

## Data Cleaning for Anime Info dataset
---
### Duplicate Rows
We first start by checking for duplicate rows and remove them if there are any.

In [104]:
# Checking EXACT row duplicates in Anime 2020 CSV dataset & dropping them
print("Shape of anime_info_2020 before dropping EXACT duplicates")
print(anime_info_2020.shape)
anime_info_2020 = anime_info_2020.drop_duplicates()
print()
print("Shape of anime_info_2020 afterwards")
print(anime_info_2020.shape)

Shape of anime_info_2020 before dropping EXACT duplicates
(19311, 12)

Shape of anime_info_2020 afterwards
(16368, 12)


In [105]:
# Check to see if there is any more duplicates after using panda's drop_duplicates method
anime_info_2020[anime_info_2020.duplicated('uid', keep = False) == True].sort_values(by = ['uid', 'members'], ascending = [False, False])

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
3028,40957,Shin Chuuka Ichiban! 2,Sequel of Shin Chuuka Ichiban,"['Comedy', 'Shounen']",Not available,,601,16334,,,https://cdn.myanimelist.net/images/anime/1684/...,https://myanimelist.net/anime/40957/Shin_Chuuk...
14699,40957,Shin Chuuka Ichiban! 2,Sequel of Shin Chuuka Ichiban,"['Comedy', 'Shounen']",Not available,,600,16334,,,https://cdn.myanimelist.net/images/anime/1684/...,https://myanimelist.net/anime/40957/Shin_Chuuk...
1653,40908,Kemono Jihen,When a series of animal bodies that rot away a...,"['Action', 'Mystery', 'Demons', 'Supernatural'...",Not available,,897,16329,,,https://cdn.myanimelist.net/images/anime/1438/...,https://myanimelist.net/anime/40908/Kemono_Jihen
14365,40908,Kemono Jihen,When a series of animal bodies that rot away a...,"['Action', 'Mystery', 'Demons', 'Supernatural'...",Not available,,896,16329,,,https://cdn.myanimelist.net/images/anime/1438/...,https://myanimelist.net/anime/40908/Kemono_Jihen
3038,40879,Love Live! Nijigasaki Gakuen School Idol Douko...,,"['Music', 'Slice of Life', 'School']",Not available,,3674,16330,,,https://cdn.myanimelist.net/images/anime/1447/...,https://myanimelist.net/anime/40879/Love_Live_...
...,...,...,...,...,...,...,...,...,...,...,...,...
4814,71,Full Metal Panic!,Equipped with cutting-edge weaponry and specia...,"['Action', 'Military', 'Sci-Fi', 'Comedy', 'Me...","Jan 8, 2002 to Jun 18, 2002",24.0,366816,244,1095.0,7.72,https://cdn.myanimelist.net/images/anime/2/752...,https://myanimelist.net/anime/71/Full_Metal_Panic
1023,60,Chrno Crusade,The 1920s was a decade of great change and uph...,"['Action', 'Demons', 'Historical', 'Romance', ...","Nov 25, 2003 to Jun 10, 2004",24.0,182796,593,1069.0,7.73,https://cdn.myanimelist.net/images/anime/13/14...,https://myanimelist.net/anime/60/Chrno_Crusade
3112,60,Chrno Crusade,The 1920s was a decade of great change and uph...,"['Action', 'Demons', 'Historical', 'Romance', ...","Nov 25, 2003 to Jun 10, 2004",24.0,182751,593,1069.0,7.73,https://cdn.myanimelist.net/images/anime/13/14...,https://myanimelist.net/anime/60/Chrno_Crusade
1031,26,Texhnolyze,"Texhnolyze takes place in the city of Lux, a ...","['Action', 'Sci-Fi', 'Psychological', 'Drama']","Apr 17, 2003 to Sep 25, 2003",22.0,154140,720,1058.0,7.74,https://cdn.myanimelist.net/images/anime/3/181...,https://myanimelist.net/anime/26/Texhnolyze


From the data frame, we can tell that there are still duplicates in the dataset. Even though they have the same uid number, the member count is different. Thus, the previous code was insufficient when it comes to dropping all the duplicate rows.

This also led us to believe that the creator of this dataset scrapped the data during two different periods of time. To solve this issue, we would keep the anime entry with the higher member count when removing duplicates.

In [106]:
 def get_max_members(group):
    #return group[group['members'] == group['members'].max()]
    return group.loc[group['members'].idxmax()]

print("Shape of anime_info_2020 before dropping duplicates")
print(anime_info_2020.shape)
anime_info_2020 = anime_info_2020.groupby('uid').apply(get_max_members).reset_index(drop = True)

print()
print("Shape of anime_info_2020 after dropping duplicates")
print(anime_info_2020.shape)

Shape of anime_info_2020 before dropping duplicates
(16368, 12)

Shape of anime_info_2020 after dropping duplicates
(16216, 12)


Now that we have dropped all duplicate rows, we will proceed to handle NaN values.

In [107]:
anime_info_2020

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,26.0,8.81,https://cdn.myanimelist.net/images/anime/4/196...,https://myanimelist.net/anime/1/Cowboy_Bebop
1,5,Cowboy Bebop: Tengoku no Tobira,"Another day, another bounty—such is the life o...","['Action', 'Drama', 'Mystery', 'Sci-Fi', 'Space']","Sep 1, 2001",1.0,223199,475,149.0,8.40,https://cdn.myanimelist.net/images/anime/1439/...,https://myanimelist.net/anime/5/Cowboy_Bebop__...
2,6,Trigun,"Vash the Stampede is the man with a $$60,000,0...","['Action', 'Sci-Fi', 'Adventure', 'Comedy', 'D...","Apr 1, 1998 to Sep 30, 1998",26.0,460146,158,256.0,8.28,https://cdn.myanimelist.net/images/anime/7/203...,https://myanimelist.net/anime/6/Trigun
3,7,Witch Hunter Robin,Witches are individuals with special powers li...,"['Action', 'Magic', 'Police', 'Supernatural', ...","Jul 2, 2002 to Dec 24, 2002",26.0,85182,1278,2487.0,7.32,https://cdn.myanimelist.net/images/anime/1796/...,https://myanimelist.net/anime/7/Witch_Hunter_R...
4,8,Bouken Ou Beet,It is the dark century and the people are suff...,"['Adventure', 'Fantasy', 'Shounen', 'Supernatu...","Sep 30, 2004 to Sep 29, 2005",52.0,12319,3968,3704.0,7.02,https://cdn.myanimelist.net/images/anime/7/215...,https://myanimelist.net/anime/8/Bouken_Ou_Beet
...,...,...,...,...,...,...,...,...,...,...,...,...
16211,40936,Ore wo Suki nano wa Omae dake ka yo: Oretachi ...,The original video anime episode will serve as...,"['Comedy', 'Romance', 'School']",2020,1.0,7976,16332,,,https://cdn.myanimelist.net/images/anime/1443/...,https://myanimelist.net/anime/40936/Ore_wo_Suk...
16212,40938,Hige wo Soru. Soshite Joshikousei wo Hirou.,Office worker Yoshida has been crushing on his...,"['Drama', 'Romance']",Not available,,1485,16327,,,https://cdn.myanimelist.net/images/anime/1859/...,https://myanimelist.net/anime/40938/Hige_wo_So...
16213,40956,Enen no Shouboutai: Ni no Shou,Second season of Enen no Shouboutai .,"['Action', 'Supernatural', 'Shounen']","Jul, 2020 to ?",,13877,16326,,,https://cdn.myanimelist.net/images/anime/1328/...,https://myanimelist.net/anime/40956/Enen_no_Sh...
16214,40957,Shin Chuuka Ichiban! 2,Sequel of Shin Chuuka Ichiban,"['Comedy', 'Shounen']",Not available,,601,16334,,,https://cdn.myanimelist.net/images/anime/1684/...,https://myanimelist.net/anime/40957/Shin_Chuuk...


From the dataframe, we can see the NaN values that we need to settle.

In [108]:
print("NaN count in anime_info_2020")
print()
print(anime_info_2020.isnull().sum())

NaN count in anime_info_2020

uid              0
title            0
synopsis       763
genre            0
aired            0
episodes       492
members          0
popularity       0
ranked        1663
score          341
img_url        165
link             0
dtype: int64


Looking at the data frame, we decided to drop 2 columns which do not seem to be helpful in our future data exploration. The two columns that we will be dropping are 'ranked' for the anime and 'img_url'. 

Based on MyAnimeList (MAL),
> Top Anime and Top Manga rankings are ordered by weighted score where 
> 
> Weighted Score = (v / (v + m)) * S + (m / (v + m)) * C  
> S = Average score for the anime/manga  
> v = Number users giving a score for the anime/manga †  
> m = Minimum number of scored users required to get a calculated score
> C = The mean score across the entire Anime/Manga database  

As the rankings depend on a minimum number of users who have scored the anime, we decided to drop the entire column as this citeria has lead to a significant amount of NaN values. Furthermore, the variable 'popularity' seems to be a better alternative which is more helpful for data exploration. As for dropping image url, it is rather self-explanatory.

In [109]:
#Dropping both columns 'ranked' and 'img_url'
anime_info_2020 = anime_info_2020.drop(columns=['ranked', 'img_url'])
anime_info_2020

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,score,link
0,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,8.81,https://myanimelist.net/anime/1/Cowboy_Bebop
1,5,Cowboy Bebop: Tengoku no Tobira,"Another day, another bounty—such is the life o...","['Action', 'Drama', 'Mystery', 'Sci-Fi', 'Space']","Sep 1, 2001",1.0,223199,475,8.40,https://myanimelist.net/anime/5/Cowboy_Bebop__...
2,6,Trigun,"Vash the Stampede is the man with a $$60,000,0...","['Action', 'Sci-Fi', 'Adventure', 'Comedy', 'D...","Apr 1, 1998 to Sep 30, 1998",26.0,460146,158,8.28,https://myanimelist.net/anime/6/Trigun
3,7,Witch Hunter Robin,Witches are individuals with special powers li...,"['Action', 'Magic', 'Police', 'Supernatural', ...","Jul 2, 2002 to Dec 24, 2002",26.0,85182,1278,7.32,https://myanimelist.net/anime/7/Witch_Hunter_R...
4,8,Bouken Ou Beet,It is the dark century and the people are suff...,"['Adventure', 'Fantasy', 'Shounen', 'Supernatu...","Sep 30, 2004 to Sep 29, 2005",52.0,12319,3968,7.02,https://myanimelist.net/anime/8/Bouken_Ou_Beet
...,...,...,...,...,...,...,...,...,...,...
16211,40936,Ore wo Suki nano wa Omae dake ka yo: Oretachi ...,The original video anime episode will serve as...,"['Comedy', 'Romance', 'School']",2020,1.0,7976,16332,,https://myanimelist.net/anime/40936/Ore_wo_Suk...
16212,40938,Hige wo Soru. Soshite Joshikousei wo Hirou.,Office worker Yoshida has been crushing on his...,"['Drama', 'Romance']",Not available,,1485,16327,,https://myanimelist.net/anime/40938/Hige_wo_So...
16213,40956,Enen no Shouboutai: Ni no Shou,Second season of Enen no Shouboutai .,"['Action', 'Supernatural', 'Shounen']","Jul, 2020 to ?",,13877,16326,,https://myanimelist.net/anime/40956/Enen_no_Sh...
16214,40957,Shin Chuuka Ichiban! 2,Sequel of Shin Chuuka Ichiban,"['Comedy', 'Shounen']",Not available,,601,16334,,https://myanimelist.net/anime/40957/Shin_Chuuk...


In [110]:
print("After dropping the 2 columns,")
print("========================================")
print("NaN count in anime_info_2020")
print()
print(anime_info_2020.isnull().sum())
print()
print("Shape of anime_info_2020")
print(anime_info_2020.shape)

After dropping the 2 columns,
NaN count in anime_info_2020

uid             0
title           0
synopsis      763
genre           0
aired           0
episodes      492
members         0
popularity      0
score         341
link            0
dtype: int64

Shape of anime_info_2020
(16216, 10)


When looking at the data frame, there seem to be a few anime shows which have not aired yet as shown below:

In [111]:
anime_info_2020[anime_info_2020['aired'] == 'Not available']

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,score,link
8232,20471,Aoki Uru,"In March 1992, Gainax had begun planning and p...","['Military', 'Sci-Fi']",Not available,1.0,4662,5923,,https://myanimelist.net/anime/20471/Aoki_Uru
8267,20715,Mint,"In the recent future, the child disease is epi...","['Drama', 'Fantasy']",Not available,1.0,2396,7529,,https://myanimelist.net/anime/20715/Mint
8824,23275,Tsubu★Doll,The series centers around 16 girls who want to...,['Music'],Not available,5.0,1766,8229,,https://myanimelist.net/anime/23275/Tsubu%E2%9...
9013,24023,Project758,,['Drama'],Not available,,974,9595,,https://myanimelist.net/anime/24023/Project758
9360,25985,Contact,A Backkom short about him meeting an alien.,"['Sci-Fi', 'Comedy', 'Kids']",Not available,1.0,266,12487,5.07,https://myanimelist.net/anime/25985/Contact
...,...,...,...,...,...,...,...,...,...,...
16208,40911,Yuukoku no Moriarty,Everyone is familiar with the story of Sherloc...,"['Mystery', 'Historical', 'Psychological', 'Sh...",Not available,,940,16338,,https://myanimelist.net/anime/40911/Yuukoku_no...
16210,40935,Beastars 2nd Season,Second season of Beastars,"['Slice of Life', 'Psychological', 'Drama', 'S...",Not available,,16035,16321,,https://myanimelist.net/anime/40935/Beastars_2...
16212,40938,Hige wo Soru. Soshite Joshikousei wo Hirou.,Office worker Yoshida has been crushing on his...,"['Drama', 'Romance']",Not available,,1485,16327,,https://myanimelist.net/anime/40938/Hige_wo_So...
16214,40957,Shin Chuuka Ichiban! 2,Sequel of Shin Chuuka Ichiban,"['Comedy', 'Shounen']",Not available,,601,16334,,https://myanimelist.net/anime/40957/Shin_Chuuk...


As such, we would remove these anime shows from the dataset.

In [112]:
print("Shape of anime_info_2020 before dropping anime shows not aired")
print(anime_info_2020.shape)
print()

anime_info_2020 = anime_info_2020.drop(anime_info_2020[anime_info_2020['aired'] == 'Not available'].index)
print("Shape of anime_info_2020 after dropping anime shows not aired")
print(anime_info_2020.shape)
anime_info_2020

Shape of anime_info_2020 before dropping anime shows not aired
(16216, 10)

Shape of anime_info_2020 after dropping anime shows not aired
(15943, 10)


Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,score,link
0,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,8.81,https://myanimelist.net/anime/1/Cowboy_Bebop
1,5,Cowboy Bebop: Tengoku no Tobira,"Another day, another bounty—such is the life o...","['Action', 'Drama', 'Mystery', 'Sci-Fi', 'Space']","Sep 1, 2001",1.0,223199,475,8.40,https://myanimelist.net/anime/5/Cowboy_Bebop__...
2,6,Trigun,"Vash the Stampede is the man with a $$60,000,0...","['Action', 'Sci-Fi', 'Adventure', 'Comedy', 'D...","Apr 1, 1998 to Sep 30, 1998",26.0,460146,158,8.28,https://myanimelist.net/anime/6/Trigun
3,7,Witch Hunter Robin,Witches are individuals with special powers li...,"['Action', 'Magic', 'Police', 'Supernatural', ...","Jul 2, 2002 to Dec 24, 2002",26.0,85182,1278,7.32,https://myanimelist.net/anime/7/Witch_Hunter_R...
4,8,Bouken Ou Beet,It is the dark century and the people are suff...,"['Adventure', 'Fantasy', 'Shounen', 'Supernatu...","Sep 30, 2004 to Sep 29, 2005",52.0,12319,3968,7.02,https://myanimelist.net/anime/8/Bouken_Ou_Beet
...,...,...,...,...,...,...,...,...,...,...
16203,40902,Shokugeki no Souma: Gou no Sara,Fifth season of Shokugeki no Souma . \r\n \r\n,"['Ecchi', 'School', 'Shounen']","Apr, 2020 to ?",,31394,16335,,https://myanimelist.net/anime/40902/Shokugeki_...
16205,40906,Dragon Quest: Dai no Daibouken (2020),"A long time ago, there was a valiant swordsman...","['Action', 'Adventure', 'Fantasy', 'Shounen']","Oct, 2020 to ?",,803,16325,,https://myanimelist.net/anime/40906/Dragon_Que...
16209,40934,Bungou to Alchemist: Shinpan no Haguruma,You are in a world where you can meet famous J...,"['Action', 'Adventure', 'Fantasy']","Apr, 2020 to ?",,49,16323,,https://myanimelist.net/anime/40934/Bungou_to_...
16211,40936,Ore wo Suki nano wa Omae dake ka yo: Oretachi ...,The original video anime episode will serve as...,"['Comedy', 'Romance', 'School']",2020,1.0,7976,16332,,https://myanimelist.net/anime/40936/Ore_wo_Suk...


In [113]:
print("NaN count in anime_info_2020")
print()
print(anime_info_2020.isnull().sum())

NaN count in anime_info_2020

uid             0
title           0
synopsis      726
genre           0
aired           0
episodes      388
members         0
popularity      0
score         201
link            0
dtype: int64


By removing the anime shows that are not available, it has reduced the NaN count by a little. 

Now, we would make use of the other dataset, anime_review_2020 and drop anime shows that do not have a review in the anime_info_2020 dataset.

In [114]:
anime_reviews_uids = anime_review_2020['anime_uid'].unique()
print(anime_reviews_uids.shape)
print("The amount of anime shows with reviews are 8113.")
print()

# Drop those anime that do not have any review
print("Shape of anime_info_2020 before dropping anime shows with no review")
print(anime_info_2020.shape)
print()
review_filtered = anime_info_2020[anime_info_2020['uid'].isin(anime_reviews_uids)]

print("Shape of dataframe after dropping anime shows with no review")
print(review_filtered.shape)

(8113,)
The amount of anime shows with reviews are 8113.

Shape of anime_info_2020 before dropping anime shows with no review
(15943, 10)

Shape of dataframe after dropping anime shows with no review
(8110, 10)


In [115]:
print("NaN count")
print(review_filtered.isnull().sum())

NaN count
uid             0
title           0
synopsis      154
genre           0
aired           0
episodes       54
members         0
popularity      0
score           0
link            0
dtype: int64


In order to settle anime shows with NaN episodes, we would make use of another dataset from 2023 (anime_info) and attempt to fill in these values. 

In [116]:
# Get rows with NaN values
review_filtered_uid = review_filtered[review_filtered['episodes'].isnull()].iloc[:, 0].tolist()

# Filling in NaN values with the use of another dataset
for anime_uid in review_filtered_uid:
    row = anime_info[anime_info['anime_id'] == anime_uid]
    if row['Episodes'].iloc[0] != 'UNKNOWN':
        review_filtered.loc[review_filtered['uid'] == anime_uid, 'episodes'] = float(row['Episodes'].iloc[0])

print("NaN count")        
print(review_filtered.isnull().sum())
print()
print("Shape of dataframe after attempting to fill in episode values")
print(review_filtered.shape)

NaN count
uid             0
title           0
synopsis      154
genre           0
aired           0
episodes       16
members         0
popularity      0
score           0
link            0
dtype: int64

Shape of dataframe after attempting to fill in episode values
(8110, 10)


Seeing that there are still NaN values, we look at these rows and attempt to find out why. After a quick search online, all these anime shows are still ongoing, with some having thousands of episodes. Thus, these anime shows episodes are labelled as NaN. Viewing these anime shows as anomalies with some having more than 1000 episodes, we will drop the following anime shows.

In [117]:
review_filtered[review_filtered['episodes'].isnull()]

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,score,link
11,21,One Piece,"Gol D. Roger was known as the ""Pirate King,"" t...","['Action', 'Adventure', 'Comedy', 'Super Power...","Oct 20, 1999 to ?",,948342,35,8.53,https://myanimelist.net/anime/21/One_Piece
210,235,Detective Conan (TV),"Shinichi Kudou, a high school student of astou...","['Adventure', 'Mystery', 'Comedy', 'Police', '...","Jan 8, 1996 to ?",,196953,545,8.24,https://myanimelist.net/anime/235/Detective_Co...
864,966,Crayon Shin-chan,Just because an anime features a young protago...,"['Slice of Life', 'Comedy', 'Ecchi', 'School',...","Apr 13, 1992 to ?",,44171,2057,7.74,https://myanimelist.net/anime/966/Crayon_Shin-...
1084,1199,Nintama Rantarou,"Rantarou, Shinbei and Kirimaru are ninja appre...","['Comedy', 'Kids']","Apr 10, 1993 to ?",,4057,6242,7.13,https://myanimelist.net/anime/1199/Nintama_Ran...
1775,1960,Sore Ike! Anpanman,"One night, a Star of Life falls down the chimn...","['Comedy', 'Kids', 'Fantasy']","Oct 3, 1988 to ?",,1659,8374,6.56,https://myanimelist.net/anime/1960/Sore_Ike_An...
2187,2406,Sazae-san,The main character is a mother named Sazae-san...,"['Slice of Life', 'Comedy']","Oct 5, 1969 to ?",,3805,6414,6.21,https://myanimelist.net/anime/2406/Sazae-san
4451,6149,Chibi Maruko-chan (1995),Momoko Sakura is an elementary school student ...,"['Comedy', 'Kids', 'School', 'Slice of Life']","Jan 8, 1995 to ?",,2360,7539,7.27,https://myanimelist.net/anime/6149/Chibi_Maruk...
5008,7505,Knyacki!,Two caterpillars investigate objects on a kitc...,"['Comedy', 'Kids', 'Drama', 'Fantasy']","Apr 7, 1995 to ?",,510,10949,6.32,https://myanimelist.net/anime/7505/Knyacki
5447,8687,Doraemon (2005),Doraemon (2005) is the most recent anime serie...,"['Sci-Fi', 'Comedy', 'Kids', 'Shounen']","Apr 22, 2005 to ?",,7447,4890,7.53,https://myanimelist.net/anime/8687/Doraemon_2005
5910,9874,Touhou Niji Sousaku Doujin Anime: Musou Kakyou,"Welcome to the fascinating world of Gensokyo, ...","['Magic', 'Vampire', 'Fantasy']","Dec 29, 2008 to ?",,27712,2686,7.16,https://myanimelist.net/anime/9874/Touhou_Niji...


In [118]:
# Magia Record & Knyacki!
# Taking all these anime that seemingly never end as anomalies, we drop these rows
review_filtered = review_filtered.dropna(subset = ['episodes'])

print("NaN count") 
print()
print(review_filtered.isnull().sum())
print()
print("Shape of dataframe after dropping these anomalies")
print(review_filtered.shape)

review_filtered

NaN count

uid             0
title           0
synopsis      154
genre           0
aired           0
episodes        0
members         0
popularity      0
score           0
link            0
dtype: int64

Shape of dataframe after dropping these anomalies
(8094, 10)


Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,score,link
0,1,Cowboy Bebop,"In the year 2071, humanity has colonized sever...","['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...","Apr 3, 1998 to Apr 24, 1999",26.0,930311,39,8.81,https://myanimelist.net/anime/1/Cowboy_Bebop
1,5,Cowboy Bebop: Tengoku no Tobira,"Another day, another bounty—such is the life o...","['Action', 'Drama', 'Mystery', 'Sci-Fi', 'Space']","Sep 1, 2001",1.0,223199,475,8.40,https://myanimelist.net/anime/5/Cowboy_Bebop__...
2,6,Trigun,"Vash the Stampede is the man with a $$60,000,0...","['Action', 'Sci-Fi', 'Adventure', 'Comedy', 'D...","Apr 1, 1998 to Sep 30, 1998",26.0,460146,158,8.28,https://myanimelist.net/anime/6/Trigun
3,7,Witch Hunter Robin,Witches are individuals with special powers li...,"['Action', 'Magic', 'Police', 'Supernatural', ...","Jul 2, 2002 to Dec 24, 2002",26.0,85182,1278,7.32,https://myanimelist.net/anime/7/Witch_Hunter_R...
4,8,Bouken Ou Beet,It is the dark century and the people are suff...,"['Adventure', 'Fantasy', 'Shounen', 'Supernatu...","Sep 30, 2004 to Sep 29, 2005",52.0,12319,3968,7.02,https://myanimelist.net/anime/8/Bouken_Ou_Beet
...,...,...,...,...,...,...,...,...,...,...
16064,40489,Sword Art Online: Alicization - War of Underwo...,"Recap of Sword Art Online: Alicization , aire...","['Action', 'Game', 'Adventure', 'Romance', 'Fa...","Oct 6, 2019",1.0,23440,2953,6.67,https://myanimelist.net/anime/40489/Sword_Art_...
16090,40542,Saiki Kusuo no Ψ-nan: Shidou-hen,,"['Slice of Life', 'Comedy', 'Supernatural', 'S...","Dec 30, 2019",6.0,24641,2901,8.35,https://myanimelist.net/anime/40542/Saiki_Kusu...
16166,40693,The iDOLM@STER Cinderella Girls: Spin-off!,The Idolm@ster Cinderella Girls 8th anniversar...,['Music'],"Nov 10, 2019",1.0,668,10388,6.34,https://myanimelist.net/anime/40693/The_iDOLMS...
16186,40769,Rifle Is Beautiful Recap,Recap of the first seven episodes of Rifle Is...,"['Comedy', 'School', 'Slice of Life']","Dec 1, 2019",1.0,990,9547,5.59,https://myanimelist.net/anime/40769/Rifle_Is_B...


With all these done, we are left with 167 rows in synopsis that consist of NaN values and we will fill these NaN values with the "Missing" object type.

In [119]:
# Filling in NaN synopsis with 'missing' object
review_filtered['synopsis'].fillna("Missing", inplace = True)

print("NaN count") 
print()
print(review_filtered.isnull().sum())
print()
print("Shape of dataframe after filling in NaN")
print(review_filtered.shape)

NaN count

uid           0
title         0
synopsis      0
genre         0
aired         0
episodes      0
members       0
popularity    0
score         0
link          0
dtype: int64

Shape of dataframe after filling in NaN
(8094, 10)


In [120]:
# Save the cleaned dataset as a new CSV file, uncomment this for GitHub
review_filtered.to_csv('datasets/anime_2020_clean.csv', index=False)

With this, we are done with data cleaning for our anime_info_2020 dataset.

## Data Cleaning for Anime Review dataset

In [121]:
anime_review_2020

Unnamed: 0,uid,profile,anime_uid,text,score,scores,link
0,255938,DesolatePsyche,34096,\n \n \n \n ...,8,"{'Overall': '8', 'Story': '8', 'Animation': '8...",https://myanimelist.net/reviews.php?id=255938
1,259117,baekbeans,34599,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=259117
2,253664,skrn,28891,\n \n \n \n ...,7,"{'Overall': '7', 'Story': '7', 'Animation': '9...",https://myanimelist.net/reviews.php?id=253664
3,8254,edgewalker00,2904,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '9', 'Animation': '9...",https://myanimelist.net/reviews.php?id=8254
4,291149,aManOfCulture99,4181,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=291149
...,...,...,...,...,...,...,...
192107,240067,Unicorn819,1281,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '5', 'Animation': '1...",https://myanimelist.net/reviews.php?id=240067
192108,285777,ShizzoSVH,1281,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '7', 'Animation': '9...",https://myanimelist.net/reviews.php?id=285777
192109,286904,AlluMan96,1281,\n \n \n \n ...,3,"{'Overall': '3', 'Story': '3', 'Animation': '1...",https://myanimelist.net/reviews.php?id=286904
192110,287903,AgentK300,1281,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '3', 'Animation': '...",https://myanimelist.net/reviews.php?id=287903


Looking at the dataset for anime_review_2020,  we decided to drop the following 3 columns 'uid' of the review, 'text' of the review and 'link' of the review as they do not appear to be helpful in our future data exploration.

We could use 'profile' instead of 'uid' to identify user as it is unique for each user as well.

In [122]:
# Dropping the column, 'uid', 'text' and 'link'
print("Shape of anime_review_2020 before dropping columns")
print(anime_review_2020.shape)
anime_review_2020 = anime_review_2020.drop(['text', 'uid', 'link'], axis=1)
print()
print("Shape of anime_review_2020 after dropping columns")
print(anime_review_2020.shape)

Shape of anime_review_2020 before dropping columns
(192112, 7)

Shape of anime_review_2020 after dropping columns
(192112, 4)


Aside from dropping the columns, we also remove the anime shows deemed as anomalies (in anime_info_2020) from this dataset.

In [123]:
#Remove reviews pertaining to anime regarded as anomaly
print("Shape of anime_review_2020 before dropping rows of anime review")
print(anime_review_2020.shape)

anime_uids_to_drop = review_filtered['uid'].unique()
anime_review_2020 = anime_review_2020[anime_review_2020['anime_uid'].isin(anime_uids_to_drop)]
print()
print("Shape of anime_review_2020 after dropping rows of anime review")
print(anime_review_2020.shape)

Shape of anime_review_2020 before dropping rows of anime review
(192112, 4)

Shape of anime_review_2020 after dropping rows of anime review
(191091, 4)


In [124]:
# Save the cleaned dataset as a new CSV file, uncomment this for GitHub
anime_review_2020.to_csv('datasets/anime_review_cleaned.csv', index=False)

##### 