### Beginning of the Assignment - exploration

In [1]:
import pandas as pd
import numpy as np
#pd.set_option('max_columns', 200)

In [2]:
character_nicknames_df = pd.read_csv('datasets/character_nicknames.csv')

In [3]:
anime_details_df = pd.read_csv('datasets/details.csv')

In [4]:
favs_df = pd.read_csv('datasets/favs.csv')


In [5]:
person_details_df = pd.read_csv('datasets/person_details.csv')


In [6]:
person_alternate_names_df = pd.read_csv('datasets/person_alternate_names.csv')


In [7]:
person_anime_works_df = pd.read_csv('datasets/person_anime_works.csv')


In [8]:
stats_df = pd.read_csv('datasets/stats.csv')


In [9]:
# converted high-cardinality string columns with many repeated values 
# (e.g, username) to the category type to reduce memory usage.
# This representation stores each distinct value once and references it via integer 
# codes, allowing the dataset to be processed efficiently on machines with limited RAM.

dtypes = {
    "is_rewatching": "Int8",
    "anime_id": "Int32",
    "score": "Int8",
    "num_watched_episodes": "Int32",
    "username": "category",
}
ratings_df = pd.read_csv("datasets/ratings.csv", dtype=dtypes)

In [10]:
characters_df = pd.read_csv('datasets/characters.csv')

In [11]:
character_anime_works_df = pd.read_csv('datasets/character_anime_works.csv')

In [12]:
person_voice_works_df = pd.read_csv('datasets/person_voice_works.csv')

In [13]:
profiles_df = pd.read_csv('datasets/profiles.csv')

In [14]:
recommendations_df = pd.read_csv('datasets/recommendations.csv')

### GUIDELINES

Things to do for each dataset:
1. Give it a look with .head and/or .tail
2. .describe and check if all the numeric values make sense (e.g. year=300 makes no sense in our context)
3. Check the format: objects to date if we need it. Check if all the dates are in the same format: us or eu
    also check if there's any 29/02/2013. It doesn't exist right? Maybe this is too much lol
4. Check for duplicates
5. CHECK FOR CORRELATION: df.corr() (e.g. with longer duration, there are more actors)
6. ADD or Remove columns?
7. GROUPBY selects the elements and makes group out of it, combines the numeric fields of each specific group
8. Aggregations: we can apply multiple different aggregated functions (e.g. for the first column you sum the data, for the second you do the average and so on)
9. Transformations: apply operations and return results aligned with the original DF

10. Removing NaNs is wrong in general because Pandas will skip it.
We do it when? Is it safe to remove NaNs rows if EVERY field in the row is empty? I hope so lol
BE CAREFUL if they are foreign keys: for example, if a person has a nan in the "anime he worked in" field, it shouldn't be dropped
NEVER replace with invalid values (e.g. -1)
IF we use df.dropna(subset=["name"],inplace=True)
the inplace means that the df itself is modified and will result in the one without the na. Without "inplace=true" you'll need to assign the result to another df (or the same) 

11. Check if data are consistent (e.g. normalizing names of countries and/or numeric fields, describing them and checking what they are)

12. Normalize data types all in the same place (e.g. all the dates in the same cell)

BONUS: NEVER USE LOOP FOR, NEVER DUPLICATE DATA (unless necessary)


##### First look

In [15]:
character_nicknames_df

Unnamed: 0,character_mal_id,nickname
0,280205,Hikaruko
1,280129,Hinacchi
2,280127,Bertha Willis
3,280066,Jimmy
4,280059,Full Body Red Square
...,...,...
37075,282159,Ling Long
37076,282159,Silvermoon
37077,282227,Mei's Mother
37078,282254,Cyrano


In [16]:
character_nicknames_df.head()

Unnamed: 0,character_mal_id,nickname
0,280205,Hikaruko
1,280129,Hinacchi
2,280127,Bertha Willis
3,280066,Jimmy
4,280059,Full Body Red Square


In [17]:
character_nicknames_df.columns
#will list all the columns. Not necessary here but kept as a pattern to follow with the following files

Index(['character_mal_id', 'nickname'], dtype='object')

In [18]:
character_nicknames_df.dtypes

character_mal_id     int64
nickname            object
dtype: object

In [19]:
character_nicknames_df.describe(include='all')   

Unnamed: 0,character_mal_id,nickname
count,37080.0,37064
unique,,28928
top,,Princess
freq,,76
mean,115767.769579,
std,87596.063426,
min,3.0,
25%,30541.75,
50%,109656.0,
75%,192520.75,


In [20]:
character_nicknames_df.loc[character_nicknames_df.duplicated()]
#by default will give us the second

Unnamed: 0,character_mal_id,nickname
328,276501,Koko
329,276501,Cao Cao
352,275883,King
356,275835,Eldest Brother
447,274437,The Half-Fool
...,...,...
37051,281206,Apemon
37052,281206,Apemon
37053,281206,Apemon
37054,281206,Apemon


We have 102 rows that are duplicated over a 37080 rows dataset.
Why is that? Some characters have multiple nicknames, are they repeated in the dataset?
Mhh they're somehow different so yeah, the same character could have different nicknames.
We want to drop though the ones that are exactly the same.

In [21]:
#this way we drop the duplicates on the first dataset
character_nicknames_df.drop_duplicates(keep='first', inplace=True)

In [22]:
#checking for nan values
character_nicknames_df[character_nicknames_df.isna().any(axis=1)]

Unnamed: 0,character_mal_id,nickname
696,271183,
3170,248157,
4880,230372,
15899,135548,
18598,107631,
24278,48077,
25041,42325,
26604,35444,
30115,19532,
30116,19531,


In [23]:
#cleaning the df from nan values
character_nicknames_df.dropna(inplace=True)


## SECOND DATASET

In [24]:
anime_details_df["studios"].astype(str).str.strip().str.replace(r'^[\[\("\']+', "", regex=True).str.replace(r'[\]\)"\']+$', "", regex=True)  

0                        
1             Flat Studio
2               Bee Train
3           CyberConnect2
4                        
               ...       
28950       LandQ studios
28951    Qualia Animation
28952    Qualia Animation
28953     Studio 9 Maiami
28954                    
Name: studios, Length: 28955, dtype: object

In [25]:
anime_details_df
#anime details
#japanese title could be dropped?
#members stand for how many users have this anime added to their list.
#licensor and streaming are mostly empty. Do we care?

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,demographics,source,rating,episodes,season,year,producers,explicit_genres,licensors,streaming
0,59356,-Socket-,-socket-,https://myanimelist.net/anime/59356/-Socket-,https://cdn.myanimelist.net/images/anime/1043/...,Movie,Finished Airing,,,2010-01-01T00:00:00+00:00,...,[],Original,G - All Ages,1.0,,,['Nagoya Zokei University'],[],[],[]
1,56036,......,......,https://myanimelist.net/anime/56036/-,https://cdn.myanimelist.net/images/anime/1057/...,Music,Finished Airing,6.53,503.0,2023-06-11T00:00:00+00:00,...,[],Original,PG-13 - Teens 13 or older,1.0,,,[],[],[],[]
2,2928,.hack//G.U. Returner,.HACK//G.U. RETURNER,https://myanimelist.net/anime/2928/hack__GU_Re...,https://cdn.myanimelist.net/images/anime/1798/...,OVA,Finished Airing,6.65,9745.0,2007-01-18T00:00:00+00:00,...,[],Game,PG-13 - Teens 13 or older,1.0,,,"['Bandai Visual', 'CyberConnect2']",[],[],[]
3,3269,.hack//G.U. Trilogy,.hack//G.U. Trilogy,https://myanimelist.net/anime/3269/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/1566/...,Movie,Finished Airing,7.06,15373.0,2007-12-22T00:00:00+00:00,...,[],Game,PG-13 - Teens 13 or older,1.0,,,['Bandai Visual'],[],"['Funimation', 'Bandai Entertainment']",[]
4,4469,.hack//G.U. Trilogy: Parody Mode,.hack//G.U. Trilogy,https://myanimelist.net/anime/4469/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/10/86...,Special,Finished Airing,6.35,4317.0,2008-03-25T00:00:00+00:00,...,[],Game,PG-13 - Teens 13 or older,1.0,,,['Bandai Visual'],[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28950,59421,Zutaboro Reijou wa Ane no Moto Konyakusha ni D...,ずたぼろ令嬢は姉の元婚約者に溺愛される,https://myanimelist.net/anime/59421/Zutaboro_R...,https://cdn.myanimelist.net/images/anime/1518/...,TV,Finished Airing,7.37,15624.0,2025-07-05T00:00:00+00:00,...,['Josei'],Light novel,PG-13 - Teens 13 or older,12.0,summer,2025.0,"['Studio Pierrot', 'Mainichi Broadcasting Syst...",[],[],"['Crunchyroll', 'Aniplus TV', 'Bahamut Anime C..."
28951,31245,Zutto Mae kara Suki deshita. Kokuhaku Jikkou I...,ずっと前から好きでした。～告白実行委員会～,https://myanimelist.net/anime/31245/Zutto_Mae_...,https://cdn.myanimelist.net/images/anime/3/821...,Movie,Finished Airing,7.20,104106.0,2016-04-23T00:00:00+00:00,...,[],Music,PG-13 - Teens 13 or older,1.0,,,"['Aniplex', 'Dentsu', 'Kadokawa Shoten', 'Movi...",[],['Aniplex of America'],[]
28952,36305,Zutto Mae kara Suki deshita. Kokuhaku Jikkou I...,ずっと前から好きでした。～告白実行委員会～ 「金曜日のおはよう」,https://myanimelist.net/anime/36305/Zutto_Mae_...,https://cdn.myanimelist.net/images/anime/6/883...,Special,Finished Airing,7.17,10038.0,2016-10-26T00:00:00+00:00,...,[],Music,PG - Children,1.0,,,['Aniplex'],[],[],[]
28953,34895,Zutto Suki Datta,ずっと好きだった,https://myanimelist.net/anime/34895/Zutto_Suki...,https://cdn.myanimelist.net/images/anime/1498/...,OVA,Finished Airing,5.68,1887.0,2017-04-21T00:00:00+00:00,...,[],Manga,Rx - Hentai,2.0,,,"['Queen Bee', 'Mediabank']",[],[],[]


In [26]:
anime_details_df['status'].value_counts()

status
Finished Airing     28097
Not yet aired         544
Currently Airing      314
Name: count, dtype: int64

In [27]:
anime_details_df['demographics'].value_counts()

demographics
[]                       18140
['Kids']                  6876
['Shounen']               2099
['Seinen']                1108
['Shoujo']                 516
['Josei']                  160
['Kids', 'Shounen']         53
['Kids', 'Shoujo']           2
['Seinen', 'Shounen']        1
Name: count, dtype: int64

In [28]:
anime_details_df['genres'].value_counts()

genres
[]                                                       5983
['Comedy']                                               2705
['Fantasy']                                              1487
['Hentai']                                               1305
['Slice of Life']                                         854
                                                         ... 
['Comedy', 'Sci-Fi', 'Erotica']                             1
['Action', 'Adventure', 'Fantasy', 'Horror']                1
['Action', 'Adventure', 'Drama', 'Mystery', 'Sci-Fi']       1
['Drama', 'Fantasy', 'Horror']                              1
['Horror', 'Erotica']                                       1
Name: count, Length: 931, dtype: int64

In [29]:
anime_details_df.loc[anime_details_df.duplicated()]

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,demographics,source,rating,episodes,season,year,producers,explicit_genres,licensors,streaming


There are no duplicates

In [30]:
#explicit_genres looks empty so it can be removed
#checking how many non empty values are there
(anime_details_df["explicit_genres"] != "[]").sum()

0

In [31]:
anime_details_df[['start_date','season']].query("season.notna()")
#season can be removed? Do we care about the season? We could "calculate" it, if needed from the "start_date" field

Unnamed: 0,start_date,season
10,2006-04-06T00:00:00+00:00,spring
11,2002-04-04T00:00:00+00:00,spring
12,2003-01-09T00:00:00+00:00,winter
19,2024-07-07T00:00:00+00:00,summer
20,2025-01-05T00:00:00+00:00,winter
...,...,...
28906,2001-04-07T00:00:00+00:00,spring
28913,2011-05-18T00:00:00+00:00,spring
28932,1980-04-15T00:00:00+00:00,spring
28939,2012-04-03T00:00:00+00:00,spring


In [32]:
anime_details_df.drop(columns=['explicit_genres', 'season'], inplace=True)

In [33]:
anime_details_df['themes'].value_counts()

themes
[]                                                                     11818
['Music']                                                               4015
['Anthropomorphic']                                                      923
['School']                                                               912
['Historical']                                                           858
                                                                       ...  
['Gore', 'Martial Arts', 'Organized Crime', 'Psychological']               1
['Adult Cast', 'Mythology', 'Urban Fantasy', 'Workplace']                  1
['Anthropomorphic', 'CGDCT', 'Iyashikei', 'Mythology', 'Workplace']        1
['CGDCT', 'Educational', 'School']                                         1
['Adult Cast', 'Gore', 'Survival']                                         1
Name: count, Length: 1044, dtype: int64

In [34]:
# we want to see what are "type"
anime_details_df['type'].value_counts()

type
TV            8414
Movie         4915
OVA           4184
ONA           4096
Music         3999
Special       1755
TV Special     767
CM             483
PV             275
Name: count, dtype: int64

In [35]:
anime_details_df.query('year>2025')
#some anime have to be on air next year

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,studios,themes,demographics,source,rating,episodes,year,producers,licensors,streaming
134,61637,29-sai Dokushin Chuuken Boukensha no Nichijou,29歳独身中堅冒険者の日常,https://myanimelist.net/anime/61637/29-sai_Dok...,https://cdn.myanimelist.net/images/anime/1688/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['HORNETS'],[],['Shounen'],Manga,,,2026.0,['Kadokawa'],[],[]
733,62000,Akuyaku Reijou wa Ringoku no Outaishi ni Dekia...,悪役令嬢は隣国の王太子に溺愛される,https://myanimelist.net/anime/62000/Akuyaku_Re...,https://cdn.myanimelist.net/images/anime/1383/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['Studio Deen'],['Villainess'],['Shoujo'],Light novel,,,2026.0,['Kadokawa'],[],[]
1152,61333,Ao no Miburo: Serizawa Ansatsu-hen,青のミブロ 芹沢暗殺編,https://myanimelist.net/anime/61333/Ao_no_Mibu...,https://cdn.myanimelist.net/images/anime/1753/...,TV,Not yet aired,,,2025-12-20T00:00:00+00:00,...,['Maho Film'],"['Historical', 'Samurai']",['Shounen'],Manga,PG-13 - Teens 13 or older,,2026.0,[],[],[]
1355,60255,Arne no Jikenbo,アルネの事件簿,https://myanimelist.net/anime/60255/Arne_no_Ji...,https://cdn.myanimelist.net/images/anime/1289/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['SILVER LINK.'],['Vampire'],[],Game,,12.0,2026.0,['Asmik Ace'],[],[]
3361,60509,Champignon no Majo,シャンピニオンの魔女,https://myanimelist.net/anime/60509/Champignon...,https://cdn.myanimelist.net/images/anime/1135/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,"['Typhoon Graphics', 'Qzil.la']",[],[],Manga,,,2026.0,[],[],[]
4687,59853,Dark Moon: Tsuki no Saidan,DARK MOON　-黒の月: 月の祭壇-,https://myanimelist.net/anime/59853/Dark_Moon_...,https://cdn.myanimelist.net/images/anime/1609/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['TROYCA'],"['School', 'Vampire']",[],Web manga,,,2026.0,[],[],[]
4700,58886,Darwin Jihen,ダーウィン事変,https://myanimelist.net/anime/58886/Darwin_Jihen,https://cdn.myanimelist.net/images/anime/1740/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['Bellnox Films'],[],['Seinen'],Manga,,,2026.0,['TOHO animation'],[],[]
4762,61196,Dead Account,デッドアカウント,https://myanimelist.net/anime/61196/Dead_Account,https://cdn.myanimelist.net/images/anime/1158/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['SynergySP'],[],['Shounen'],Manga,,,2026.0,"['Nippon Columbia', 'Bushiroad']",[],[]
5459,61325,"Douse, Koishite Shimaunda. Season 2",どうせ、恋してしまうんだ。Season 2,https://myanimelist.net/anime/61325/Douse_Kois...,https://cdn.myanimelist.net/images/anime/1473/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['Typhoon Graphics'],"['Reverse Harem', 'School']",['Shoujo'],Manga,PG-13 - Teens 13 or older,,2026.0,[],[],[]
5929,59229,Enen no Shouboutai: San no Shou Part 2,炎炎ノ消防隊 参ノ章 第2クール,https://myanimelist.net/anime/59229/Enen_no_Sh...,https://cdn.myanimelist.net/images/anime/1847/...,TV,Not yet aired,,,2026-01-01T00:00:00+00:00,...,['David Production'],['Urban Fantasy'],['Shounen'],Manga,PG-13 - Teens 13 or older,,2026.0,[],[],[]


In [36]:
anime_details_df.describe()

Unnamed: 0,mal_id,score,scored_by,rank,popularity,members,favorites,episodes,year
count,28955.0,18882.0,18882.0,21997.0,28955.0,28955.0,28955.0,28275.0,6266.0
mean,33977.521948,6.3905,29963.08,11033.157794,14500.358798,38753.37,430.848973,14.13008,2010.034153
std,19616.858566,0.892045,121966.6,6388.279473,8373.663464,167376.3,4520.610176,47.161445,13.200708
min,1.0,1.89,101.0,1.0,1.0,23.0,0.0,1.0,1961.0
25%,15454.0,5.77,332.25,5505.0,7247.5,233.0,0.0,1.0,2005.0
50%,37969.0,6.36,1528.0,11008.0,14497.0,1077.0,1.0,2.0,2014.0
75%,50434.5,7.03,10145.75,16515.0,21754.5,9193.5,17.0,13.0,2020.0
max,62590.0,9.29,2979733.0,22020.0,28999.0,4230312.0,243358.0,3000.0,2026.0


In [37]:
anime_details_df.query("episodes > 2500")
#checking outliers

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,studios,themes,demographics,source,rating,episodes,year,producers,licensors,streaming
13697,9947,Lan Mao,蓝猫淘气3000问,https://myanimelist.net/anime/9947/Lan_Mao,https://cdn.myanimelist.net/images/anime/5/275...,TV,Finished Airing,6.06,228.0,1999-10-08T00:00:00+00:00,...,[],[],[],Original,PG - Children,3000.0,,['Beijing Sunchime Happy Culture Company'],[],[]


In [38]:
anime_details_df.dtypes
#start and end dates are not objects but dates

mal_id              int64
title              object
title_japanese     object
url                object
image_url          object
type               object
status             object
score             float64
scored_by         float64
start_date         object
end_date           object
synopsis           object
rank              float64
popularity          int64
members             int64
favorites           int64
genres             object
studios            object
themes             object
demographics       object
source             object
rating             object
episodes          float64
year              float64
producers          object
licensors          object
streaming          object
dtype: object

In [39]:
anime_details_df[["start_date", "end_date"]] = anime_details_df[["start_date", "end_date"]].apply(
    pd.to_datetime, errors="coerce"
)

In [40]:
anime_details_df.query("start_date > end_date")

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,studios,themes,demographics,source,rating,episodes,year,producers,licensors,streaming
125,48010,23 Hao Niu Nai Tang,23号牛乃糖,https://myanimelist.net/anime/48010/23_Hao_Niu...,https://cdn.myanimelist.net/images/anime/1789/...,TV,Finished Airing,,,2020-03-01 00:00:00+00:00,...,[],[],['Kids'],Original,PG - Children,26.0,,[],[],[]
614,23583,Akage no Anne Specials,世界名作劇場・完結版 赤毛のアン,https://myanimelist.net/anime/23583/Akage_no_A...,https://cdn.myanimelist.net/images/anime/3/609...,TV Special,Finished Airing,6.25,355.0,2001-02-11 00:00:00+00:00,...,['Nippon Animation'],['Historical'],[],Novel,G - All Ages,2.0,,['BS Fuji'],[],[]
755,44731,Ali Diu Dongxi de Wawa,阿狸·丢东西的娃娃,https://myanimelist.net/anime/44731/Ali_Diu_Do...,https://cdn.myanimelist.net/images/anime/1472/...,ONA,Finished Airing,,,2012-08-31 00:00:00+00:00,...,[],[],['Kids'],Picture book,PG-13 - Teens 13 or older,2.0,,[],[],[]
1462,53727,Ashuai 8th Season,阿衰 第八季,https://myanimelist.net/anime/53727/Ashuai_8th...,https://cdn.myanimelist.net/images/anime/1982/...,ONA,Finished Airing,,,2022-06-03 00:00:00+00:00,...,[],['School'],['Kids'],Manga,G - All Ages,34.0,,[],[],[]
1548,45385,AU: Kaixin Tongnian,阿U之开心童年,https://myanimelist.net/anime/45385/AU__Kaixin...,https://cdn.myanimelist.net/images/anime/1394/...,TV,Finished Airing,,,2013-03-01 00:00:00+00:00,...,[],[],['Kids'],Mixed media,PG - Children,60.0,,[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28755,44012,Zhiqu Yang Xuetang: Yangyang Lai Yunbao,智趣羊学堂之羊羊来寻宝,https://myanimelist.net/anime/44012/Zhiqu_Yang...,https://cdn.myanimelist.net/images/anime/1439/...,ONA,Finished Airing,,,2018-02-22 00:00:00+00:00,...,[],[],['Kids'],Unknown,PG - Children,26.0,,[],[],[]
28756,44010,Zhiqu Yang Xuetang: Yangyang You Shijie,智趣羊学堂之羊羊游世界,https://myanimelist.net/anime/44010/Zhiqu_Yang...,https://cdn.myanimelist.net/images/anime/1930/...,ONA,Finished Airing,,,2017-12-01 00:00:00+00:00,...,[],[],['Kids'],Unknown,PG - Children,26.0,,[],[],[]
28830,38118,Zhu Zhu Xia: Mo Huan Zhu Luo Ji,猪猪侠 魔幻猪猡纪,https://myanimelist.net/anime/38118/Zhu_Zhu_Xi...,https://cdn.myanimelist.net/images/anime/1912/...,TV,Finished Airing,,,2006-07-01 00:00:00+00:00,...,[],"['Martial Arts', 'Super Power']",['Kids'],Original,G - All Ages,20.0,,[],[],[]
28833,38119,Zhu Zhu Xia: Wu Xia 2008,猪猪侠 武侠2008,https://myanimelist.net/anime/38119/Zhu_Zhu_Xi...,https://cdn.myanimelist.net/images/anime/1658/...,TV,Finished Airing,,,2007-07-01 00:00:00+00:00,...,[],"['Historical', 'Super Power']",['Kids'],Original,G - All Ages,20.0,,['BlueArc Animation Studios'],[],[]


In [41]:
anime_details_df.query("start_date > end_date and not (end_date.dt.month == 1 and end_date.dt.day == 1)")

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,studios,themes,demographics,source,rating,episodes,year,producers,licensors,streaming
6392,3705,Flashback Game,フラッシュバックゲーム,https://myanimelist.net/anime/3705/Flashback_Game,https://cdn.myanimelist.net/images/anime/1870/...,OVA,Finished Airing,5.43,921.0,2001-10-18 00:00:00+00:00,...,['Blue Cat'],[],[],Original,Rx - Hentai,3.0,,['Five Ways'],['NuTech Digital'],[]
6971,10995,Ganbare!! Nattou-san,頑張れ!!納父さん,https://myanimelist.net/anime/10995/Ganbare_Na...,https://cdn.myanimelist.net/images/anime/9/744...,TV,Finished Airing,,,2011-07-02 00:00:00+00:00,...,['Kachidoki Studio'],[],[],Unknown,G - All Ages,4.0,2011.0,[],[],[]
12673,56892,Kkomimanyeo Lara Season 2,꼬미마녀 라라 시즌2,https://myanimelist.net/anime/56892/Kkomimanye...,https://cdn.myanimelist.net/images/anime/1348/...,TV,Finished Airing,,,2023-10-14 00:00:00+00:00,...,[],"['Mahou Shoujo', 'School']",['Kids'],Original,PG - Children,13.0,2023.0,[],[],[]
17729,34234,Ohayou! Kokekkou-san,おはよう! コケッコーさん,https://myanimelist.net/anime/34234/Ohayou_Kok...,https://cdn.myanimelist.net/images/anime/2/823...,TV,Finished Airing,,,2016-10-02 00:00:00+00:00,...,"['TMS Entertainment', 'TOCSIS']",[],['Kids'],Original,G - All Ages,50.0,2016.0,['TMS Music'],[],[]
19590,43976,Qi Jiguang Yingxiong Chuan,戚继光英雄传,https://myanimelist.net/anime/43976/Qi_Jiguang...,https://cdn.myanimelist.net/images/anime/1831/...,Movie,Finished Airing,,,2012-06-01 00:00:00+00:00,...,[],['Historical'],['Kids'],Other,G - All Ages,1.0,,[],[],[]
24814,38571,Tobidase! Dokan-kun,とびだせ！土管くん,https://myanimelist.net/anime/38571/Tobidase_D...,https://cdn.myanimelist.net/images/anime/1618/...,TV,Finished Airing,,,2011-10-26 00:00:00+00:00,...,['DLE'],['Mecha'],[],Original,G - All Ages,15.0,2011.0,[],[],[]
24837,62453,Tobot: Daedosiui Yeongungdeul Season 2 Part 2,또봇 : 대도시의 영웅들 시즌2 파트 2,https://myanimelist.net/anime/62453/Tobot__Dae...,https://cdn.myanimelist.net/images/anime/1721/...,TV,Finished Airing,,,2024-07-05 00:00:00+00:00,...,[],['Mecha'],['Kids'],Original,PG - Children,8.0,2024.0,[],[],[]
27477,61524,Xiongmao He Gan Mi Xiong,熊猫和甘米熊,https://myanimelist.net/anime/61524/Xiongmao_H...,https://cdn.myanimelist.net/images/anime/1216/...,TV,Finished Airing,,,2025-05-04 00:00:00+00:00,...,[],[],['Kids'],Other,PG - Children,13.0,2025.0,[],[],[]


In [42]:
anime_details_df.query("end_date.isna()")

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,studios,themes,demographics,source,rating,episodes,year,producers,licensors,streaming
0,59356,-Socket-,-socket-,https://myanimelist.net/anime/59356/-Socket-,https://cdn.myanimelist.net/images/anime/1043/...,Movie,Finished Airing,,,2010-01-01 00:00:00+00:00,...,[],[],[],Original,G - All Ages,1.0,,['Nagoya Zokei University'],[],[]
1,56036,......,......,https://myanimelist.net/anime/56036/-,https://cdn.myanimelist.net/images/anime/1057/...,Music,Finished Airing,6.53,503.0,2023-06-11 00:00:00+00:00,...,['Flat Studio'],['Music'],[],Original,PG-13 - Teens 13 or older,1.0,,[],[],[]
2,2928,.hack//G.U. Returner,.HACK//G.U. RETURNER,https://myanimelist.net/anime/2928/hack__GU_Re...,https://cdn.myanimelist.net/images/anime/1798/...,OVA,Finished Airing,6.65,9745.0,2007-01-18 00:00:00+00:00,...,['Bee Train'],['Video Game'],[],Game,PG-13 - Teens 13 or older,1.0,,"['Bandai Visual', 'CyberConnect2']",[],[]
3,3269,.hack//G.U. Trilogy,.hack//G.U. Trilogy,https://myanimelist.net/anime/3269/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/1566/...,Movie,Finished Airing,7.06,15373.0,2007-12-22 00:00:00+00:00,...,['CyberConnect2'],['Video Game'],[],Game,PG-13 - Teens 13 or older,1.0,,['Bandai Visual'],"['Funimation', 'Bandai Entertainment']",[]
4,4469,.hack//G.U. Trilogy: Parody Mode,.hack//G.U. Trilogy,https://myanimelist.net/anime/4469/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/10/86...,Special,Finished Airing,6.35,4317.0,2008-03-25 00:00:00+00:00,...,[],"['Parody', 'Video Game']",[],Game,PG-13 - Teens 13 or older,1.0,,['Bandai Visual'],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28948,52512,Zurui Maboroshi,ズルい幻,https://myanimelist.net/anime/52512/Zurui_Mabo...,https://cdn.myanimelist.net/images/anime/1096/...,Music,Finished Airing,6.26,128.0,2022-02-22 00:00:00+00:00,...,['StudioXD'],['Music'],[],Original,PG-13 - Teens 13 or older,1.0,,[],[],[]
28949,58303,Zurukute Sugoi,ズルくてすごい,https://myanimelist.net/anime/58303/Zurukute_S...,https://cdn.myanimelist.net/images/anime/1782/...,Music,Finished Airing,,,2019-08-17 00:00:00+00:00,...,[],['Music'],[],Original,PG-13 - Teens 13 or older,1.0,,[],[],[]
28951,31245,Zutto Mae kara Suki deshita. Kokuhaku Jikkou I...,ずっと前から好きでした。～告白実行委員会～,https://myanimelist.net/anime/31245/Zutto_Mae_...,https://cdn.myanimelist.net/images/anime/3/821...,Movie,Finished Airing,7.20,104106.0,2016-04-23 00:00:00+00:00,...,['Qualia Animation'],['School'],[],Music,PG-13 - Teens 13 or older,1.0,,"['Aniplex', 'Dentsu', 'Kadokawa Shoten', 'Movi...",['Aniplex of America'],[]
28952,36305,Zutto Mae kara Suki deshita. Kokuhaku Jikkou I...,ずっと前から好きでした。～告白実行委員会～ 「金曜日のおはよう」,https://myanimelist.net/anime/36305/Zutto_Mae_...,https://cdn.myanimelist.net/images/anime/6/883...,Special,Finished Airing,7.17,10038.0,2016-10-26 00:00:00+00:00,...,['Qualia Animation'],['School'],[],Music,PG - Children,1.0,,['Aniplex'],[],[]


In [43]:
#Set invalid dates to NaT
anime_details_df.loc[anime_details_df["start_date"] > anime_details_df["end_date"], "end_date"] = pd.NaT


### THIRD DATASET

In [44]:
favs_df

Unnamed: 0,username,fav_type,id
0,ishikawas,anime,45649
1,ishikawas,anime,38680
2,ishikawas,anime,795
3,ishikawas,anime,37510
4,ishikawas,anime,820
...,...,...,...
4178742,vincent0607,character,497
4178743,vincent0607,character,118739
4178744,vincent0607,character,188177
4178745,vincent0607,character,141354


In [45]:
# we want to see what are "fav_type"
favs_df['fav_type'].value_counts()

fav_type
character    1598040
anime        1531857
people        862186
company       186664
Name: count, dtype: int64

In [46]:
favs_df.isna().sum()


username    4
fav_type    0
id          0
dtype: int64

In [47]:
favs_df.dtypes

username    object
fav_type    object
id           int64
dtype: object

In [48]:
favs_df.loc[favs_df.duplicated()]
#there are no duplicates


Unnamed: 0,username,fav_type,id


### FOURTH DATASET

In [49]:
person_alternate_names_df

Unnamed: 0,person_mal_id,alt_name
0,1,Seki Mondoya
1,1,門戸 開
2,1,Monto Hiraku
3,3,雪野五月
4,10,Kevin Hatcher
...,...,...
20460,89567,Sydsnap
20461,89567,Queen of Degeneracy
20462,89826,陳浩
20463,89842,Chidori


In [50]:
person_alternate_names_df.dtypes

person_mal_id     int64
alt_name         object
dtype: object

In [51]:
person_alternate_names_df['person_mal_id'].value_counts()
# the person with the most alternate names has 29 different names

person_mal_id
246      29
548      28
406      16
10957    16
7025     15
         ..
58801     1
58795     1
58780     1
58779     1
55550     1
Name: count, Length: 12376, dtype: int64

In [52]:
person_alternate_names_df['alt_name'].value_counts()

alt_name
アイス                 6
Friendly Land       6
Aice5               6
Aice⁵               5
Studio Wallaby      4
                   ..
Mirei Miyamoto      1
宮本 深礼               1
Akihisa Matsuura    1
笹島啓一                1
Nobu                1
Name: count, Length: 20248, dtype: int64

In [53]:
person_alternate_names_df.isna().sum()
person_alternate_names_df[person_alternate_names_df.isna().any(axis=1)]


Unnamed: 0,person_mal_id,alt_name
1470,2813,
1662,3380,
2552,7070,
3932,12406,
7829,40897,
8647,44952,
8895,46425,
9899,49829,
11160,53940,
11350,54493,


In [54]:
person_alternate_names_df.dropna(inplace=True)

In [55]:
person_alternate_names_df[person_alternate_names_df.duplicated(subset=['person_mal_id','alt_name'], keep=False)].sort_values(['person_mal_id','alt_name'])

Unnamed: 0,person_mal_id,alt_name
2526,7025,Fumio Tada
2531,7025,Fumio Tada
2536,7025,Fumio Tada
2528,7025,Ichirou Miyoshi
2533,7025,Ichirou Miyoshi
...,...,...
17383,77491,Joshua Ricardo Rocha Jiménez
17386,77514,Scott Page
17387,77514,Scott Page
18080,80213,Samantha Carmichael


In [56]:
person_alternate_names_df.drop_duplicates(keep='first', inplace=True)

### FIFTH DATASET

In [57]:
person_details_df

Unnamed: 0,person_mal_id,url,website_url,image_url,name,given_name,family_name,birthday,favorites,relevant_location
0,1,https://myanimelist.net/people/1/Tomokazu_Seki,,https://cdn.myanimelist.net/images/voiceactors...,Tomokazu Seki,智一,関,1972-09-08T00:00:00+00:00,6219,"Berlin, Germany"
1,2,https://myanimelist.net/people/2/Tomokazu_Sugita,https://agrs.co.jp/,https://cdn.myanimelist.net/images/voiceactors...,Tomokazu Sugita,智和,杉田,1980-10-11T00:00:00+00:00,47666,"Los Angeles, USA"
2,3,https://myanimelist.net/people/3/Satsuki_Yukino,,https://cdn.myanimelist.net/images/voiceactors...,Satsuki Yukino,さつき,ゆきの,1970-05-25T00:00:00+00:00,1777,"Madrid, Spain"
3,4,https://myanimelist.net/people/4/Aya_Hirano,http://ayahirano.jp/,https://cdn.myanimelist.net/images/voiceactors...,Aya Hirano,綾,平野,1987-10-08T00:00:00+00:00,18374,"Paris, France"
4,5,https://myanimelist.net/people/5/Kenichi_Suzumura,https://intention-k.com,https://cdn.myanimelist.net/images/voiceactors...,Kenichi Suzumura,健一,鈴村,1974-09-12T00:00:00+00:00,5176,"Osaka, Japan"
...,...,...,...,...,...,...,...,...,...,...
76694,90011,https://myanimelist.net/people/90011/Nanako_Ki...,,https://cdn.myanimelist.net/img/sp/icon/apple-...,Nanako Kishimoto,七子,岸本,,0,"Mumbai, India"
76695,90012,https://myanimelist.net/people/90012/Pamon,,https://cdn.myanimelist.net/img/sp/icon/apple-...,Pamon,,파몬,,0,"Tokyo, Japan"
76696,90013,https://myanimelist.net/people/90013/Tomoru_Emoto,,https://cdn.myanimelist.net/img/sp/icon/apple-...,Tomoru Emoto,ともる,柄本,,0,"Tokyo, Japan"
76697,90014,https://myanimelist.net/people/90014/Hirari,https://hirari.2-d.jp/,https://cdn.myanimelist.net/img/sp/icon/apple-...,Hirari,,ひらり,,0,"Paris, France"


In [58]:
person_details_df.value_counts('relevant_location')


relevant_location
London, UK                 6989
Tokyo, Japan               6778
New York, USA              6670
Los Angeles, USA           5552
Paris, France              5517
Osaka, Japan               5516
Berlin, Germany            4199
Nagoya, Japan              4182
Houston, USA               4166
Chicago, USA               4082
Madrid, Spain              4080
Yokohama, Japan            4051
Rome, Italy                2703
Sapporo, Japan             2679
San Francisco, USA         2668
Mumbai, India              1382
São Paulo, Brazil          1380
Cape Town, South Africa    1370
Sydney, Australia          1369
Mexico City, Mexico        1366
Name: count, dtype: int64

In [59]:
person_details_df.loc[person_details_df['person_mal_id'].duplicated()]
#we found that the duplicates differ for the "relevant_location" field, which has no interest for us so we drop the duplicates

Unnamed: 0,person_mal_id,url,website_url,image_url,name,given_name,family_name,birthday,favorites,relevant_location
4171,4564,https://myanimelist.net/people/4564/Natsuko_Ta...,,https://cdn.myanimelist.net/images/voiceactors...,Natsuko Takahashi,ナツコ,高橋,,19,"Paris, France"
4172,4564,https://myanimelist.net/people/4564/Natsuko_Ta...,,https://cdn.myanimelist.net/images/voiceactors...,Natsuko Takahashi,ナツコ,高橋,,19,"Cape Town, South Africa"
5202,5742,https://myanimelist.net/people/5742/Sarah_Strange,,https://cdn.myanimelist.net/images/voiceactors...,Sarah Strange,,,,11,"Yokohama, Japan"
6018,6638,https://myanimelist.net/people/6638/Toshio_Suzuki,,https://cdn.myanimelist.net/images/voiceactors...,Toshio Suzuki,敏夫,鈴木,1948-08-19T00:00:00+00:00,354,"Paris, France"
6019,6638,https://myanimelist.net/people/6638/Toshio_Suzuki,,https://cdn.myanimelist.net/images/voiceactors...,Toshio Suzuki,敏夫,鈴木,1948-08-19T00:00:00+00:00,354,"Mumbai, India"
...,...,...,...,...,...,...,...,...,...,...
64387,77565,https://myanimelist.net/people/77565/Ena_Nishi...,,https://cdn.myanimelist.net/img/sp/icon/apple-...,Ena Nishikawa,絵奈,西川,,0,"Paris, France"
64402,77579,https://myanimelist.net/people/77579/Amina_Gaede,https://www.instagram.com/amina.gaede/,https://cdn.myanimelist.net/images/voiceactors...,Amina Gaede,,,1998-07-05T00:00:00+00:00,0,"São Paulo, Brazil"
64412,77588,https://myanimelist.net/people/77588/Willie_Ra...,,https://cdn.myanimelist.net/images/voiceactors...,Willie Ray Jr.,,,,0,"Berlin, Germany"
64414,77589,https://myanimelist.net/people/77589/Brandi_Ray,,https://cdn.myanimelist.net/images/voiceactors...,Brandi Ray,,,,0,"Mumbai, India"


In [60]:
person_details_df.drop_duplicates(subset=['person_mal_id', 'url', 'name'], keep='first', inplace=True)


In [61]:
person_details_df.dtypes
#we need to change birthday from object to data

person_mal_id         int64
url                  object
website_url          object
image_url            object
name                 object
given_name           object
family_name          object
birthday             object
favorites             int64
relevant_location    object
dtype: object

In [62]:
person_details_df["birthday"] = pd.to_datetime(person_details_df["birthday"], errors='coerce')

In [63]:
person_details_df["birthday"].min(), person_details_df["birthday"].max()
#makes sense because they're just composers of used music in anime

(Timestamp('1678-03-04 00:00:00+0000', tz='UTC'),
 Timestamp('2023-06-15 00:00:00+0000', tz='UTC'))

In [64]:
person_details_df.isna().sum()
#we have to check the nan values

person_mal_id            0
url                      0
website_url          59186
image_url                0
name                     2
given_name           30211
family_name          18610
birthday             59399
favorites                0
relevant_location        0
dtype: int64

In [65]:
person_details_df[person_details_df["name"].isna()]


Unnamed: 0,person_mal_id,url,website_url,image_url,name,given_name,family_name,birthday,favorites,relevant_location
46687,59857,https://myanimelist.net/people/59857/None,,https://cdn.myanimelist.net/img/sp/icon/apple-...,,,のね,NaT,0,"Paris, France"
75920,89106,https://myanimelist.net/people/89106/NULL,,https://cdn.myanimelist.net/img/sp/icon/apple-...,,,,NaT,0,"Paris, France"


### SIXTH DATASET

In [66]:
person_anime_works_df

Unnamed: 0,person_mal_id,position,anime_mal_id
0,1,Theme Song Performance,3080
1,1,Inserted Song Performance,15699
2,1,Theme Song Performance (OP),247
3,1,Theme Song Performance,258
4,1,Theme Song Performance (ED),34825
...,...,...,...
458086,89951,In-Between Animation,11001
458087,89951,Key Animation,55092
458088,89951,"Key Animation (eps 4, 9)",20053
458089,89951,Key Animation,50553


In [67]:
person_anime_works_df.dtypes
#the types are correct

person_mal_id     int64
position         object
anime_mal_id      int64
dtype: object

In [68]:
person_anime_works_df.isna().sum()
#There's no nan value

person_mal_id    0
position         0
anime_mal_id     0
dtype: int64

In [69]:
person_anime_works_df.loc[person_anime_works_df.duplicated()]

Unnamed: 0,person_mal_id,position,anime_mal_id


No need to clean this dataset

### SEVENTH DATASET

In [70]:
stats_df

Unnamed: 0,mal_id,watching,completed,on_hold,dropped,plan_to_watch,total,score_1_votes,score_1_percentage,score_2_votes,...,score_6_votes,score_6_percentage,score_7_votes,score_7_percentage,score_8_votes,score_8_percentage,score_9_votes,score_9_percentage,score_10_votes,score_10_percentage
0,59356,7,146,4,20,20,197,2.0,2.2,0.0,...,33.0,36.3,19.0,20.9,2.0,2.2,0.0,0.0,1.0,1.1
1,56036,21,770,8,29,113,941,5.0,1.0,6.0,...,138.0,27.4,144.0,28.6,81.0,16.1,17.0,3.4,40.0,8.0
2,2928,451,14953,302,349,6472,22527,101.0,1.0,93.0,...,2054.0,21.1,2709.0,27.8,1500.0,15.4,875.0,9.0,608.0,6.2
3,3269,726,22790,452,537,9762,34267,120.0,0.8,156.0,...,2457.0,16.0,4157.0,27.0,3075.0,20.0,1919.0,12.5,1400.0,9.1
4,4469,241,6918,182,266,3528,11135,83.0,1.9,104.0,...,888.0,20.6,871.0,20.2,592.0,13.7,308.0,7.1,315.0,7.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28950,59421,11349,17353,574,1440,15729,46445,54.0,0.3,40.0,...,2303.0,14.5,5545.0,34.9,3919.0,24.7,1598.0,10.1,1235.0,7.8
28951,31245,10332,140676,2989,1416,94249,249662,357.0,0.3,484.0,...,15972.0,15.3,34370.0,33.0,24387.0,23.4,10480.0,10.1,7589.0,7.3
28952,36305,928,16119,388,243,9656,27334,48.0,0.5,28.0,...,1839.0,18.3,3258.0,32.5,2009.0,20.0,887.0,8.8,936.0,9.3
28953,34895,896,2387,375,356,1512,5526,116.0,6.1,80.0,...,437.0,23.2,280.0,14.8,170.0,9.0,68.0,3.6,125.0,6.6


In [71]:
stats_df.describe()

Unnamed: 0,mal_id,watching,completed,on_hold,dropped,plan_to_watch,total,score_1_votes,score_1_percentage,score_2_votes,...,score_6_votes,score_6_percentage,score_7_votes,score_7_percentage,score_8_votes,score_8_percentage,score_9_votes,score_9_percentage,score_10_votes,score_10_percentage
count,28955.0,28955.0,28955.0,28955.0,28955.0,28955.0,28955.0,28525.0,28525.0,28525.0,...,28525.0,28525.0,28525.0,28525.0,28525.0,28525.0,28525.0,28525.0,28525.0,28525.0
mean,33977.521948,2586.641,24838.18,974.935313,1301.307304,9058.571231,38759.63,143.433094,5.567793,126.22787,...,2132.082559,17.608512,4331.507344,17.253325,5000.052515,10.330149,3531.283155,5.578801,2853.74,14.390657
std,19616.858566,16545.37,123576.9,5151.361911,5930.202998,30335.000519,167392.1,930.822451,6.142436,711.56521,...,8478.229701,7.314723,18841.647429,8.935283,26978.836598,9.429745,24313.644255,6.550833,22869.32,12.475693
min,1.0,0.0,0.0,0.0,0.0,2.0,23.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,15454.0,11.0,86.0,4.0,33.0,52.0,233.0,4.0,1.1,1.0,...,11.0,12.5,6.0,10.0,2.0,2.4,0.0,0.0,9.0,6.5
50%,37969.0,57.0,431.0,22.0,85.0,335.0,1077.0,17.0,3.5,9.0,...,60.0,17.6,46.0,16.9,21.0,7.9,10.0,3.5,25.0,9.7
75%,50434.5,489.5,4430.0,230.0,258.0,3057.5,9199.0,57.0,8.0,46.0,...,618.0,22.4,787.0,24.4,503.0,16.3,272.0,8.5,251.0,16.8
max,62590.0,1838015.0,3716436.0,308583.0,237819.0,687110.0,4230824.0,50279.0,100.0,36397.0,...,281971.0,100.0,514959.0,100.0,756635.0,100.0,872553.0,100.0,1090469.0,100.0


In [72]:
stats_df.isna().sum()
#there are 430 series without any vote

mal_id                   0
watching                 0
completed                0
on_hold                  0
dropped                  0
plan_to_watch            0
total                    0
score_1_votes          430
score_1_percentage     430
score_2_votes          430
score_2_percentage     430
score_3_votes          430
score_3_percentage     430
score_4_votes          430
score_4_percentage     430
score_5_votes          430
score_5_percentage     430
score_6_votes          430
score_6_percentage     430
score_7_votes          430
score_7_percentage     430
score_8_votes          430
score_8_percentage     430
score_9_votes          430
score_9_percentage     430
score_10_votes         430
score_10_percentage    430
dtype: int64

In [73]:
stats_df[stats_df.duplicated(subset=['mal_id'], keep="first")]


Unnamed: 0,mal_id,watching,completed,on_hold,dropped,plan_to_watch,total,score_1_votes,score_1_percentage,score_2_votes,...,score_6_votes,score_6_percentage,score_7_votes,score_7_percentage,score_8_votes,score_8_percentage,score_9_votes,score_9_percentage,score_10_votes,score_10_percentage


No need to clean this dataset

### EIGHTH DATASET 

In [74]:
ratings_df

Unnamed: 0,username,anime_id,status,score,is_rewatching,num_watched_episodes
0,--------788,30276,watching,7,0,3
1,--------788,28851,completed,7,0,1
2,--------788,41168,completed,7,0,1
3,--------788,22199,completed,10,0,24
4,--------788,16498,completed,10,0,25
...,...,...,...,...,...,...
124298352,arizkim,52305,plan_to_watch,0,0,0
124298353,arizkim,4224,plan_to_watch,0,0,0
124298354,arizkim,54790,plan_to_watch,0,0,0
124298355,arizkim,53835,plan_to_watch,0,0,0


In [75]:
ratings_df.describe()

Unnamed: 0,anime_id,score,is_rewatching,num_watched_episodes
count,124298357.0,124298357.0,120501036.0,124298357.0
mean,28071.344801,4.094335,0.000858,12.68757
std,18453.816887,3.87406,0.029272,308.711587
min,1.0,0.0,0.0,0.0
25%,10165.0,0.0,0.0,0.0
50%,31832.0,5.0,0.0,4.0
75%,40902.0,8.0,0.0,12.0
max,62893.0,10.0,1.0,65535.0


In [76]:
ratings_df.columns

Index(['username', 'anime_id', 'status', 'score', 'is_rewatching',
       'num_watched_episodes'],
      dtype='object')

In [77]:
# we have to understand the sense of "num_watched_episodes" and the link with "is_rewatching"
ratings_df.query('is_rewatching == 1')

  ratings_df.query('is_rewatching == 1')


Unnamed: 0,username,anime_id,status,score,is_rewatching,num_watched_episodes
934,CKK2,6,completed,9,1,8
8939,FollowYourHeart,1210,completed,10,1,24
11047,----Haku----,39017,completed,7,1,3
22879,KarioBaka,10087,completed,9,1,3
22928,KarioBaka,14467,completed,0,1,6
...,...,...,...,...,...,...
124291819,arissabelle,2167,completed,0,1,0
124291878,arissabelle,33352,completed,10,1,0
124291883,arissabelle,6547,on_hold,0,1,1
124297299,ariyanroy04,46569,completed,0,1,13


In [78]:
ratings_df.dtypes

username                category
anime_id                   Int32
status                    object
score                       Int8
is_rewatching               Int8
num_watched_episodes       Int32
dtype: object

In [79]:
ratings_df[["anime_id","score","num_watched_episodes"]].agg(["min", "max"])


Unnamed: 0,anime_id,score,num_watched_episodes
min,1,0,0
max,62893,10,65535


In [80]:
ratings_df[ratings_df.duplicated(subset=['username','anime_id'], keep=False)].sort_values(['username','anime_id'])

Unnamed: 0,username,anime_id,status,score,is_rewatching,num_watched_episodes
44417439,W00F1234,59457,on_hold,0,0,1
44469797,W00F1234,59457,dropped,3,0,1
74566332,Pavle2009,44511,watching,0,0,5
74614909,Pavle2009,44511,watching,0,0,7
74586609,Doopelsdoo,59459,watching,0,0,8
74616569,Doopelsdoo,59459,watching,0,0,9
74594917,Door_mp3,59644,watching,8,0,5
74626628,Door_mp3,59644,watching,8,0,6
74594935,Door_mp3,60564,watching,8,0,4
74626629,Door_mp3,60564,watching,8,0,5


In [81]:
# usually we should drop all the occurrence of a duplicate and keep the first
# in this case though, it looks like the latest occurence is the most updated one, contaning more info than the first one, so we drop the first one
ratings_df.drop_duplicates(subset=['username', 'anime_id'], keep='last', inplace=True)

We had just 6 duplicates having the same username and anime_id

In [82]:
# we check for Nan values.
ratings_df.isna().sum()

username                      7
anime_id                      0
status                        0
score                         0
is_rewatching           3797321
num_watched_episodes          0
dtype: int64

In [83]:
# We fill the NaN values in "is_rewatching" checking if the num_watched_episodes is greater than number of episodes of the anime itself
episodes_map = anime_details_df.set_index("mal_id")["episodes"]

ratings_df["total_episodes"] = ratings_df["anime_id"].map(episodes_map)

valid_mask = (ratings_df["total_episodes"].notna() & (ratings_df["total_episodes"] > 0) &
              ratings_df["num_watched_episodes"].notna() & (ratings_df["num_watched_episodes"] >= 0))

condition = ((ratings_df["num_watched_episodes"] > ratings_df["total_episodes"]) &valid_mask)
ratings_df["is_rewatching"] = ratings_df["is_rewatching"].fillna(condition)

To compare num_watched_episodes with the total number of episodes per anime without duplicating a 4 GB table in memory, we used a mapping from anime_id to episodes instead of a full merge. This approach is significantly lighter on RAM and therefore more suitable for lower RAM machines.

In [84]:
ratings_df[ratings_df['username'].isna()]
# there's only one username with Nan value, we drop it because it's not relevant in our analysis

Unnamed: 0,username,anime_id,status,score,is_rewatching,num_watched_episodes,total_episodes
100051758,,1482,watching,9,0,35,103.0
100051759,,1735,watching,9,0,21,500.0
100051760,,121,completed,10,0,51,51.0
100051761,,136,completed,8,0,62,62.0
100051762,,269,on_hold,7,0,33,366.0
100051763,,1818,on_hold,7,0,11,26.0
100051764,,1535,plan_to_watch,6,0,4,37.0


In [85]:
ratings_df.dropna(subset=["username"],inplace=True)

In [86]:
# first_chunk = True
# for chunk in pd.read_csv("datasets/ratings.csv", chunksize=2_000_000):
#    chunk["is_rewatching"] = chunk["is_rewatching"].astype("Int8")
#    chunk["anime_id"] = chunk["anime_id"].astype("Int32")
#    chunk["score"] = chunk["score"].astype("Int8")
#    chunk["num_watched_episodes"] = chunk["num_watched_episodes"].astype("Int32")
#
#    chunk.to_csv(
#        "datasets/ratings_half_cleaned.csv", mode="w" if first_chunk else "a",
#        index=False, header=first_chunk,lineterminator="\n" 
#    )
#    first_chunk = False

# ratings_cleaned_df = pd.read_csv("datasets/ratings_half_cleaned.csv")
# ratings_cleaned_df.drop_duplicates(subset=["username", "anime_id"], keep="last", inplace=True)
# ratings_cleaned_df.to_csv("datasets/ratings_cleaned.csv", index=False, lineterminator="\n")

This takes sooo long I'm not sure it's worth it. More than 12 minutes on my most powerful machine.
The cleaned version is about the same size as the original one and we save just a bit of memory when loading it (10% less)
This is because the file will be read and written twice, just for a small gain in memory usage
Why twice? Because if we drop duplicates whithin each chunk while reading it, we may miss duplicates that are in different chunks.

In [87]:
ratings_df.to_csv("datasets/ratings_cleaned.csv", index=False, lineterminator="\n")

### NINTH DATASET

In [88]:
characters_df

Unnamed: 0,character_mal_id,url,name,name_kanji,image,favorites,about
0,280386.0,https://myanimelist.net/character/280386/Envi_...,Envi Mel Champagne,エンヴィ・メル・シャンパーニュ,https://cdn.myanimelist.net/images/characters/...,0.0,
1,280354.0,https://myanimelist.net/character/280354/Eleven,Eleven,イレヴン,https://cdn.myanimelist.net/images/characters/...,0.0,
2,280353.0,https://myanimelist.net/character/280353/Stud,Stud,スタッド,https://cdn.myanimelist.net/images/characters/...,0.0,
3,280352.0,https://myanimelist.net/character/280352/Judge,Judge,ジャッジ,https://cdn.myanimelist.net/images/characters/...,0.0,
4,280339.0,https://myanimelist.net/character/280339/Eiji_...,Eiji Kurokawa,黒川 英治,https://cdn.myanimelist.net/img/sp/icon/apple-...,0.0,
...,...,...,...,...,...,...,...
209958,282276.0,https://myanimelist.net/character/282276/Farra...,Farrah Van Dorothy,ファラ・ヴァン・ドロシー,https://cdn.myanimelist.net/images/characters/...,0.0,
209959,282277.0,https://myanimelist.net/character/282277/Harri...,Harris Mead,ハリス・ミード,https://cdn.myanimelist.net/images/characters/...,0.0,
209960,282278.0,https://myanimelist.net/character/282278/Rob,Rob,ロブ,https://cdn.myanimelist.net/images/characters/...,0.0,
209961,282281.0,https://myanimelist.net/character/282281/Grimm,Grimm,グリム,https://cdn.myanimelist.net/images/characters/...,0.0,


In [89]:
# check types of dataset columns
characters_df.dtypes

character_mal_id    float64
url                  object
name                 object
name_kanji           object
image                object
favorites           float64
about                object
dtype: object

In [90]:
characters_df.describe()

Unnamed: 0,character_mal_id,favorites
count,209961.0,209961.0
mean,150579.010283,57.616138
std,85243.624423,1197.586488
min,1.0,0.0
25%,74581.0,0.0
50%,160191.0,0.0
75%,225373.0,2.0
max,282284.0,175632.0


In [91]:
characters_df.isna().sum()

character_mal_id        2
url                     2
name                    2
name_kanji          55480
image                   2
favorites               2
about               96976
dtype: int64

In [92]:
# here we want to check if the Nan values are concentrate in only two rows
characters_df[characters_df['character_mal_id'].isna()]

Unnamed: 0,character_mal_id,url,name,name_kanji,image,favorites,about
209733,,,,,,,
209734,,,,,,,


In [93]:
# Nan values are concentrated in two rows so we drop them
characters_df.dropna(how='all', inplace=True)

In [94]:
# we drop the name_kanji column because we don't actually need it for our analysis nor usage
characters_df.drop(columns='name_kanji', inplace=True)

In [95]:
# we want to see all duplicates to understand if we have to drop or not
characters_df.loc[characters_df.duplicated(subset=['character_mal_id', 'url', 'name'], keep=False)]

Unnamed: 0,character_mal_id,url,name,image,favorites,about
853,279200.0,https://myanimelist.net/character/279200/Shing...,Shingo Shimazaki,https://cdn.myanimelist.net/images/characters/...,1.0,
854,279200.0,https://myanimelist.net/character/279200/Shing...,Shingo Shimazaki,https://cdn.myanimelist.net/images/characters/...,1.0,
909,279142.0,https://myanimelist.net/character/279142/Kaho,Kaho,https://cdn.myanimelist.net/images/characters/...,0.0,
910,279142.0,https://myanimelist.net/character/279142/Kaho,Kaho,https://cdn.myanimelist.net/images/characters/...,0.0,
1166,278865.0,https://myanimelist.net/character/278865/Yoshi...,Yoshirou Sonozaki,https://cdn.myanimelist.net/images/characters/...,0.0,The owner of the restaurant Angel Mort.
...,...,...,...,...,...,...
209789,282042.0,https://myanimelist.net/character/282042/Alina,Alina,https://cdn.myanimelist.net/images/characters/...,0.0,
209849,282115.0,https://myanimelist.net/character/282115/Cavitt,Cavitt,https://cdn.myanimelist.net/images/characters/...,0.0,Lieutenant Rencon's younger sister.\n\nNo voic...
209850,282115.0,https://myanimelist.net/character/282115/Cavitt,Cavitt,https://cdn.myanimelist.net/images/characters/...,0.0,Lieutenant Rencon's younger sister.\n\nNo voic...
209858,282131.0,https://myanimelist.net/character/282131/Lieut...,Lieutenant Rencon,https://cdn.myanimelist.net/images/characters/...,0.0,Former lieutenant of Mar Expedition and older ...


In [96]:
# we drop the duplicates because they have all same values 
characters_df.drop_duplicates(subset=['character_mal_id', 'url', 'name'], keep='first', inplace=True)

### TENTH DATASET

In [97]:
# role of character anime
character_anime_works_df

Unnamed: 0,anime_mal_id,character_mal_id,character_name,role
0,2928,5781,Atoli,Main
1,2928,33,Haseo,Main
2,2928,32,Ovan,Main
3,2928,34,Shino,Main
4,2928,5785,Aina,Supporting
...,...,...,...,...
236811,31245,137157,"Shibasaki, Ken",Supporting
236812,36305,136064,"Hamanaka, Midori",Main
236813,36305,133916,"Narumi, Sena",Main
236814,36305,124942,"Hayasaka, Akari",Supporting


In [98]:
# check types of columns
character_anime_works_df.dtypes

anime_mal_id         int64
character_mal_id     int64
character_name      object
role                object
dtype: object

In [99]:
# check the number of Nan value
character_anime_works_df.isna().sum()

anime_mal_id        0
character_mal_id    0
character_name      0
role                0
dtype: int64

In [100]:
# check the number of duplicates
character_anime_works_df.loc[character_anime_works_df.duplicated(subset=['anime_mal_id', 'character_mal_id'])]

Unnamed: 0,anime_mal_id,character_mal_id,character_name,role


There is no need to clean this dataset 

### ELEVENTH DATASET

In [101]:
person_voice_works_df

Unnamed: 0,person_mal_id,role,anime_mal_id,character_mal_id,language
0,1,Main,55830,2514,Japanese
1,1,Supporting,60602,2822,Japanese
2,1,Supporting,59229,140499,Japanese
3,1,Supporting,60427,275856,Japanese
4,1,Supporting,62067,190335,Japanese
...,...,...,...,...,...
489511,89839,Supporting,40111,274622,Mandarin
489512,89840,Supporting,60544,266412,Mandarin
489513,89841,Supporting,60544,266416,Mandarin
489514,89842,Supporting,36896,279922,Japanese


In [102]:
person_voice_works_df['language'].value_counts()

language
Japanese           203537
English             94902
French              43340
Spanish             39153
Portuguese (BR)     37100
Italian             30712
German              26805
Korean               7981
Mandarin             3529
Hungarian            1378
Hebrew               1079
Name: count, dtype: int64

In [103]:
person_voice_works_df.dtypes

person_mal_id        int64
role                object
anime_mal_id         int64
character_mal_id     int64
language            object
dtype: object

In [104]:
person_voice_works_df.isna().sum()

person_mal_id       0
role                0
anime_mal_id        0
character_mal_id    0
language            0
dtype: int64

In [105]:
# check if the duplicates are in all columns
person_voice_works_df.loc[person_voice_works_df.duplicated(keep=False)]

Unnamed: 0,person_mal_id,role,anime_mal_id,character_mal_id,language
173909,5742,Main,1011,668,English
173910,5742,Main,1008,668,English
173911,5742,Main,1010,668,English
173912,5742,Main,1007,668,English
173913,5742,Main,792,668,English
...,...,...,...,...,...
478139,77591,Supporting,41025,151299,Portuguese (BR)
478140,77591,Main,431,508,Portuguese (BR)
478141,77591,Main,2202,7952,Portuguese (BR)
478142,77591,Supporting,1069,77729,Portuguese (BR)


In [106]:
# drop the duplicates because they have all same values
person_voice_works_df.drop_duplicates(keep='first', inplace=True)

### TWELFTH DATASET

In [107]:
profiles_df

Unnamed: 0,username,gender,birthday,location,joined,watching,completed,on_hold,dropped,plan_to_watch
0,ishikawas,,,South Korea,,,,,,
1,CKK2,,,United States,"Dec 1, 2018",3,182,15,0,405
2,--------788,Female,,Mexico,"Oct 4, 2022",1,64,0,0,1
3,potatoaris,,,Spain,"Oct 2, 2018",5,1,0,0,4
4,Rinrintan,,,Japan,"May 12, 2019",20,311,40,16,34
...,...,...,...,...,...,...,...,...,...,...
337150,ariyanroy04,,,United States,"Mar 30, 2021",35,130,0,0,489
337151,ariyanvk18,Male,,Turkey,"Jan 24, 2022",7,101,1,12,21
337152,ariyoskz,,,Germany,"Jul 7, 2022",1,155,2,11,15
337153,arizima23,Female,"Jun 23, 1998",Spain,"Mar 15, 2019",15,64,0,4,4


In [108]:
# check if the types are right for each field
profiles_df.dtypes

username         object
gender           object
birthday         object
location         object
joined           object
watching         object
completed        object
on_hold          object
dropped          object
plan_to_watch    object
dtype: object

In [109]:
# change types of columns "birthday" and "joined" from object to date and the others columns that they should be int
# errors='coerce' will set invalid parsing to NaT (e.g. for invalid dates like February 30 or 1980)
profiles_df["birthday"] = pd.to_datetime(profiles_df["birthday"], errors='coerce')
profiles_df["joined"] = pd.to_datetime(profiles_df["joined"], errors='coerce')

In [110]:
profiles_df.value_counts('location')


location
Japan             98316
United States     65240
Germany           26348
United Kingdom    19520
Thailand          16416
Argentina         13230
China             13196
Spain             12916
France            10041
Australia          9691
Mexico             6634
South Korea        6524
Turkey             6516
Italy              6468
Indonesia          3334
Brazil             3293
Vietnam            3272
South Africa       3267
Philippines        3260
Egypt              3239
India              3228
Canada             3206
Name: count, dtype: int64

In [111]:
#checking weird birthdays
profiles_df["birthday"].agg(['min', 'max'])


min   1800-01-01
max   2100-12-31
Name: birthday, dtype: datetime64[ns]

In [112]:
weird_birthdays = profiles_df[(profiles_df["birthday"] < "1900-01-01") | (profiles_df["birthday"] > "2025-12-31")]

weird_birthdays


Unnamed: 0,username,gender,birthday,location,joined,watching,completed,on_hold,dropped,plan_to_watch
2817,Ristic17,Male,2077-07-18,United Kingdom,2022-01-15,7,316,9,51,224
34419,Killj0y6_KIJO,Male,2053-01-01,France,2020-08-14,21,831,73,63,268
37654,ChellKabutomushi,Male,2033-03-06,Japan,2018-10-08,5,233,3,19,94
41023,Mystic-0-,Male,1899-06-11,China,2019-07-15,3,100,0,8,33
43399,SDan,Male,2026-10-08,Turkey,2018-08-02,22,386,25,28,497
49805,KissxSled,Female,2098-09-10,Australia,2022-06-09,8,762,0,0,501
51556,Tychoo9600,Male,1836-05-11,Japan,2020-11-26,47,287,0,0,52
55975,Sadscientist__,Male,2100-06-15,Japan,2020-07-28,10,237,15,13,209
88691,NiceLittleNig,Non-Binary,2026-05-25,Japan,2017-10-25,17,406,12,85,175
107175,Seneto,Male,1823-10-28,United States,2020-07-10,10,82,12,6,12


In [113]:
profiles_df.loc[profiles_df["birthday"] > profiles_df["joined"],["birthday", "joined"]]


Unnamed: 0,birthday,joined
2606,2021-07-13,2020-09-14
2683,2019-07-11,2017-10-13
2817,2077-07-18,2022-01-15
4506,2013-09-03,2013-01-26
4616,2017-01-12,2015-01-30
...,...,...
333376,2022-09-05,2019-11-22
334143,2019-10-11,2012-04-21
334335,2020-04-04,2018-12-31
335340,2023-09-05,2021-09-12


In [114]:
profiles_df["joined"].agg(['min', 'max'])

min   2004-11-05
max   2025-11-02
Name: joined, dtype: datetime64[ns]

In [115]:
profiles_df["birthday"].dt.year.value_counts().sort_index().head(30)

birthday
1800.0      2
1823.0      1
1836.0      1
1869.0      1
1888.0      1
1899.0      3
1900.0     63
1901.0      3
1902.0      1
1903.0      2
1904.0      3
1906.0      1
1907.0      1
1912.0      2
1917.0      2
1921.0      1
1925.0      1
1927.0      1
1930.0    521
1931.0     11
1932.0      9
1933.0     14
1934.0      9
1935.0      4
1936.0      8
1937.0      7
1938.0      5
1939.0     30
1940.0      3
1941.0      7
Name: count, dtype: int64

In [116]:
profiles_df[profiles_df["birthday"].dt.year == 1930]

Unnamed: 0,username,gender,birthday,location,joined,watching,completed,on_hold,dropped,plan_to_watch
345,MongoDB,Female,1930-07-09,Thailand,2017-07-09,190,382,0,291,73
1469,-Andre-,Male,1930-04-11,Japan,2019-03-01,21,84,7,6,14
1820,Forgetworld,Female,1930-09-10,Argentina,2014-12-02,7,826,23,62,151
1876,pranto_zaho,Male,1930-12-10,Japan,2016-07-25,7,410,43,31,221
2095,CTx99,,1930-12-08,United States,2019-02-12,73,687,6,6,46
...,...,...,...,...,...,...,...,...,...,...
334149,alphabeta9000,Male,1930-12-31,Japan,2018-05-25,22,229,35,0,16
335977,aniyu_jp,,1930-07-16,Japan,2013-03-25,8,1425,0,0,124
336671,apricol,Female,1930-03-24,Spain,2021-11-25,12,311,21,47,119
336678,apricottt,,1930-01-01,Japan,2014-03-04,6,143,43,31,50


In [117]:
# decided to remove the birthdays of people older than 100 when they joined the website
mask_too_old = (profiles_df["joined"] - profiles_df["birthday"]).dt.days / 365.25 > 100
profiles_df.loc[mask_too_old, "birthday"] = pd.NaT


In [118]:
mask_too_young = (profiles_df["joined"] - profiles_df["birthday"]).dt.days / 365.25 < 3
profiles_df.loc[mask_too_young, "birthday"] = pd.NaT



In [119]:
profiles_df.isna().sum()

username              1
gender           170876
birthday         251574
location              0
joined             1676
watching           1678
completed          1678
on_hold            1678
dropped            1678
plan_to_watch      1678
dtype: int64

In [120]:
#There's only one username with Nan value, we drop it because it's not relevant in our analysis
profiles_df.dropna(subset=["username"],inplace=True)

In [121]:
# check if there is any duplicate on "username"
profiles_df.loc[profiles_df.duplicated(subset=['username'], keep='first')]
# none found

Unnamed: 0,username,gender,birthday,location,joined,watching,completed,on_hold,dropped,plan_to_watch


### THIRTEENTH DATASET

In [122]:
recommendations_df

Unnamed: 0,mal_id,recommendation_mal_id
0,3269,317
1,3269,6922
2,3269,299
3,3269,3446
4,3269,5681
...,...,...
105244,31245,1689
105245,31245,35434
105246,31245,31798
105247,31245,21995


In [123]:
recommendations_df.dtypes

mal_id                   int64
recommendation_mal_id    int64
dtype: object

In [124]:
recommendations_df.isna().sum()

mal_id                   0
recommendation_mal_id    0
dtype: int64

In [125]:
recommendations_df.loc[profiles_df.duplicated(keep='first')]

Unnamed: 0,mal_id,recommendation_mal_id


No need to clean this dataset

In [126]:
#Exporting all cleaned small datasets
small_datasets = {
    "character_anime_works": character_anime_works_df,
    "character_nicknames": character_nicknames_df,
    "characters": characters_df,
    "details": anime_details_df,
    "favs": favs_df,
    "person_alternate_names": person_alternate_names_df,
    "person_anime_works": person_anime_works_df,
    "person_details": person_details_df,
    "person_voice_works": person_voice_works_df,
    "profiles": profiles_df,
    #"ratings": ratings_df,
    "recommendations": recommendations_df,
    "stats": stats_df,
}

for name, df in small_datasets.items():
    df.to_csv(f"datasets/{name}_cleaned.csv", index=False, lineterminator="\n")
# The argument lineterminator,which in older versions of pandas may not work, and must be replaced with line_terminator
# needs to be used because on Windows line endings occupy 2 bytes instead of 1, resulting in larger file sizes when exporting



In [127]:
#TODO maybe, when we'll need to upload the csv into the database (TWEB assignment), 
# we'll need to substitute null values with "Unknown"
# PLUS we may need to reconvert date columns back to string format YYYY-MM-DD
# We could join the two tables person_details_df and person_alternate_names_df having the keys that match.
# Putting the alternate names in a new column called alt_name and having a list of those inside