### Beginning of the Assignment - exploration

In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
#pd.set_option('max_columns', 200)

### PHIL DATA SET

In [2]:
character_nicknames_df = pd.read_csv('datasets/PHIL DATA SET/character_nicknames.csv')

In [3]:
details_df = pd.read_csv('datasets/PHIL DATA SET/details.csv')

In [4]:
favs_df = pd.read_csv('datasets/PHIL DATA SET/favs.csv')


In [5]:
person_details_df = pd.read_csv('datasets/PHIL DATA SET/person_details.csv')


In [6]:
person_alternate_names_df = pd.read_csv('datasets/PHIL DATA SET/person_alternate_names.csv')


In [7]:
person_anime_works_df = pd.read_csv('datasets/PHIL DATA SET/person_anime_works.csv')


In [8]:
stats_df = pd.read_csv('datasets/PHIL DATA SET/stats.csv')


### DENIS DATA SET

In [9]:
ratings_df = pd.read_csv('datasets/DENIS DATA SET/ratings.csv')

In [10]:
characters_df = pd.read_csv('datasets/DENIS DATA SET/characters.csv')

In [11]:
character_anime_works_df = pd.read_csv('datasets/DENIS DATA SET/character_anime_works.csv')

In [12]:
person_voice_works_df = pd.read_csv('datasets/DENIS DATA SET/person_voice_works.csv')

In [13]:
profiles_df = pd.read_csv('datasets/DENIS DATA SET/profiles.csv')

In [14]:
recommendations_df = pd.read_csv('datasets/DENIS DATA SET/recommendations.csv')

# GUIDELINES
### BEFORE STARTING
Use Conda in order to do the correct setup

Things to do for each dataset:
1. Give it a look with .head and/or .tail
2. .describe and check if all the numeric values make sense (e.g. year=300 makes no sense in our context)
3. Check the format: objects to date if we need it. Check if all the dates are in the same format: us or eu
    also check if there's any 29/02/2013. It doesn't exist right? Maybe this is too much lol
4. Check for duplicates
5. CHECK FOR CORRELATION: df.corr() (e.g. with longer duration, there are more actors)
6. ADD or Remove columns?
7. GROUPBY selects the elements and makes group out of it, combines the numeric fields of each specific group
#TODO We could use it grouping for language and looking at how many anime are made in japan, stating it's the first country where the culture of doing (and watching) anime is SO big
8. Aggregations: we can apply multiple different aggregated functions (e.g. for the first column you sum the data, for the second you do the average and so on)
9. Transformations: apply operations and return results aligned with the original DF

10. Removing NaNs is wrong in general because Pandas will skip it.
We do it when? Is it safe to remove NaNs rows if EVERY field in the row is empty? I hope so lol
BE CAREFUL if they are foreign keys: for example, if a person has a nan in the "anime he worked in" field, it shouldn't be dropped
NEVER replace with invalid values (e.g. -1)
IF we use df.dropna(subset=["name"],inplace=True)
the inplace means that the df itself is modified and will result in the one without the na. Without "inplace=true" you'll need to assign the result to another df (or the same) 


Proviamo i plot? df.plot()

11. Check if data are consistent (e.g. normalizing names of countries and/or numeric fields, describing them and checking what they are)

12. Normalize data types all in the same place (e.g. all the dates in the same cell)


BONUS: NEVER USE LOOP FOR, NEVER DUPLICATE DATA (unless necessary)


##### First look

In [None]:
character_nicknames_df.shape

In [None]:
character_nicknames_df.head()

In [None]:
character_nicknames_df.columns
#will list all the columns. Not necessary here but kept as a pattern to follow with the following files

In [None]:
character_nicknames_df.dtypes

### Data preparation (cleaning)


##### On the first dataset we may need to check for duplicates.
What does that mean? We have 102 rows that are duplicated over a 37080 rows dataset.
Why is that? Are there some characters that have multiple nicknames, so they are repeated in the dataset?

In [15]:
character_nicknames_df.loc[character_nicknames_df.duplicated()]
#by default will give us the second

Unnamed: 0,character_mal_id,nickname
328,276501,Koko
329,276501,Cao Cao
352,275883,King
356,275835,Eldest Brother
447,274437,The Half-Fool
...,...,...
37051,281206,Apemon
37052,281206,Apemon
37053,281206,Apemon
37054,281206,Apemon


Mhh they're somehow different so yeah, the same character could have different nicknames.
We want to drop though the ones that are exactly the same.

In [None]:
#this way we drop the duplicates on the first dataset

# OLD VERSION 
# character_nicknames_df = character_nicknames_df.loc[~character_nicknames_df.duplicated()].reset_index(drop=True).copy()

#we don't need to use a subset here because there are just 2 columns

#why don't we just use drop_duplicates()?
character_nicknames_df.drop_duplicates(keep='first', inplace=True)

Let's check for nan values

In [None]:
character_nicknames_df[character_nicknames_df.isna().any(axis=1)]


In [None]:
#cleaning the df from nan values
character_nicknames_df.dropna(inplace=True)


## SECOND DATASET

##### On the second dataset we may need to check for missing values and/or inconsistent values, since there are no duplicates

In [19]:
details_df
#anime details
#japanes title could be dropped?
#members stand for how many users have this anime added to their list.
#explicit_genres is empty so can be removed
#licensor and streaming are mostly empty. Do we care?

Unnamed: 0,mal_id,title,title_japanese,url,image_url,type,status,score,scored_by,start_date,...,demographics,source,rating,episodes,season,year,producers,explicit_genres,licensors,streaming
0,59356,-Socket-,-socket-,https://myanimelist.net/anime/59356/-Socket-,https://cdn.myanimelist.net/images/anime/1043/...,Movie,Finished Airing,,,2010-01-01 00:00:00+00:00,...,[],Original,G - All Ages,1.0,,,['Nagoya Zokei University'],[],[],[]
1,56036,......,......,https://myanimelist.net/anime/56036/-,https://cdn.myanimelist.net/images/anime/1057/...,Music,Finished Airing,6.53,503.0,2023-06-11 00:00:00+00:00,...,[],Original,PG-13 - Teens 13 or older,1.0,,,[],[],[],[]
2,2928,.hack//G.U. Returner,.HACK//G.U. RETURNER,https://myanimelist.net/anime/2928/hack__GU_Re...,https://cdn.myanimelist.net/images/anime/1798/...,OVA,Finished Airing,6.65,9745.0,2007-01-18 00:00:00+00:00,...,[],Game,PG-13 - Teens 13 or older,1.0,,,"['Bandai Visual', 'CyberConnect2']",[],[],[]
3,3269,.hack//G.U. Trilogy,.hack//G.U. Trilogy,https://myanimelist.net/anime/3269/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/1566/...,Movie,Finished Airing,7.06,15373.0,2007-12-22 00:00:00+00:00,...,[],Game,PG-13 - Teens 13 or older,1.0,,,['Bandai Visual'],[],"['Funimation', 'Bandai Entertainment']",[]
4,4469,.hack//G.U. Trilogy: Parody Mode,.hack//G.U. Trilogy,https://myanimelist.net/anime/4469/hack__GU_Tr...,https://cdn.myanimelist.net/images/anime/10/86...,Special,Finished Airing,6.35,4317.0,2008-03-25 00:00:00+00:00,...,[],Game,PG-13 - Teens 13 or older,1.0,,,['Bandai Visual'],[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28950,59421,Zutaboro Reijou wa Ane no Moto Konyakusha ni D...,ずたぼろ令嬢は姉の元婚約者に溺愛される,https://myanimelist.net/anime/59421/Zutaboro_R...,https://cdn.myanimelist.net/images/anime/1518/...,TV,Finished Airing,7.37,15624.0,2025-07-05 00:00:00+00:00,...,['Josei'],Light novel,PG-13 - Teens 13 or older,12.0,summer,2025.0,"['Studio Pierrot', 'Mainichi Broadcasting Syst...",[],[],"['Crunchyroll', 'Aniplus TV', 'Bahamut Anime C..."
28951,31245,Zutto Mae kara Suki deshita. Kokuhaku Jikkou I...,ずっと前から好きでした。～告白実行委員会～,https://myanimelist.net/anime/31245/Zutto_Mae_...,https://cdn.myanimelist.net/images/anime/3/821...,Movie,Finished Airing,7.20,104106.0,2016-04-23 00:00:00+00:00,...,[],Music,PG-13 - Teens 13 or older,1.0,,,"['Aniplex', 'Dentsu', 'Kadokawa Shoten', 'Movi...",[],['Aniplex of America'],[]
28952,36305,Zutto Mae kara Suki deshita. Kokuhaku Jikkou I...,ずっと前から好きでした。～告白実行委員会～ 「金曜日のおはよう」,https://myanimelist.net/anime/36305/Zutto_Mae_...,https://cdn.myanimelist.net/images/anime/6/883...,Special,Finished Airing,7.17,10038.0,2016-10-26 00:00:00+00:00,...,[],Music,PG - Children,1.0,,,['Aniplex'],[],[],[]
28953,34895,Zutto Suki Datta,ずっと好きだった,https://myanimelist.net/anime/34895/Zutto_Suki...,https://cdn.myanimelist.net/images/anime/1498/...,OVA,Finished Airing,5.68,1887.0,2017-04-21 00:00:00+00:00,...,[],Manga,Rx - Hentai,2.0,,,"['Queen Bee', 'Mediabank']",[],[],[]


In [21]:
# we want to see what are "type"
details_df['type'].value_counts()

type
TV            8414
Movie         4915
OVA           4184
ONA           4096
Music         3999
Special       1755
TV Special     767
CM             483
PV             275
Name: count, dtype: int64

In [None]:
details_df.loc[details_df.duplicated(subset=['url'], keep='first')]

In [None]:
details_df.query('year>2025')

In [None]:
details_df.describe()

In [None]:
details_df[['start_date','season']].query("season.notna()")
#season can be removed? Do we care about the season? We can "calculate" it from the "start_date" field

In [None]:
details_df.query("episodes > 2500")

In [18]:
details_df[["start_date", "end_date"]] = details_df[["start_date", "end_date"]].apply(
    pd.to_datetime, errors="coerce"
)

In [None]:
details_df[["scored_by", "rank", "episodes", "year"]] = (
    details_df[["scored_by", "rank", "episodes", "year"]].astype("Int64")
)


In [None]:
details_df.dtypes
#scored_by, rank, episodes, year can be an int instead of a float
#start and end dates are not objects but dates
#do we need to swap the empty [] with Nan or not? WE should in order to be able to use the .isna() method and other pandas methods


### THIRD DATASET

In [22]:
favs_df

Unnamed: 0,username,fav_type,id
0,ishikawas,anime,45649
1,ishikawas,anime,38680
2,ishikawas,anime,795
3,ishikawas,anime,37510
4,ishikawas,anime,820
...,...,...,...
4178742,vincent0607,character,497
4178743,vincent0607,character,118739
4178744,vincent0607,character,188177
4178745,vincent0607,character,141354


In [24]:
# we want to see what are "fav_type"
favs_df['fav_type'].value_counts()

fav_type
character    1598040
anime        1531857
people        862186
company       186664
Name: count, dtype: int64

In [None]:
favs_df.isna().sum()


In [None]:
favs_df.dtypes

In [None]:
favs_df.duplicated().sum()
#there are no duplicates


### FOURTH DATASET

In [13]:
person_alternate_names_df

Unnamed: 0,person_mal_id,alt_name
0,1,Seki Mondoya
1,1,門戸 開
2,1,Monto Hiraku
3,3,雪野五月
4,10,Kevin Hatcher
...,...,...
20460,89567,Sydsnap
20461,89567,Queen of Degeneracy
20462,89826,陳浩
20463,89842,Chidori


In [None]:
person_alternate_names_df.dtypes

In [None]:
person_alternate_names_df.isna().sum()
person_alternate_names_df[person_alternate_names_df.isna().any(axis=1)]


In [None]:
person_alternate_names_df.dropna(inplace=True)



In [None]:
# person_alternate_names_df.loc[person_alternate_names_df['person_mal_id'].duplicated()]
person_alternate_names_df[person_alternate_names_df.duplicated(subset=['person_mal_id','alt_name'], keep=False)].sort_values(['person_mal_id','alt_name'])


In [None]:
person_alternate_names_df.drop_duplicates(keep='first', inplace=True)

### FIFTH DATASET

In [17]:
person_details_df

Unnamed: 0,person_mal_id,url,website_url,image_url,name,given_name,family_name,birthday,favorites,relevant_location
0,1,https://myanimelist.net/people/1/Tomokazu_Seki,,https://cdn.myanimelist.net/images/voiceactors...,Tomokazu Seki,智一,関,1972-09-08T00:00:00+00:00,6219,"Berlin, Germany"
1,2,https://myanimelist.net/people/2/Tomokazu_Sugita,https://agrs.co.jp/,https://cdn.myanimelist.net/images/voiceactors...,Tomokazu Sugita,智和,杉田,1980-10-11T00:00:00+00:00,47666,"Los Angeles, USA"
2,3,https://myanimelist.net/people/3/Satsuki_Yukino,,https://cdn.myanimelist.net/images/voiceactors...,Satsuki Yukino,さつき,ゆきの,1970-05-25T00:00:00+00:00,1777,"Madrid, Spain"
3,4,https://myanimelist.net/people/4/Aya_Hirano,http://ayahirano.jp/,https://cdn.myanimelist.net/images/voiceactors...,Aya Hirano,綾,平野,1987-10-08T00:00:00+00:00,18374,"Paris, France"
4,5,https://myanimelist.net/people/5/Kenichi_Suzumura,https://intention-k.com,https://cdn.myanimelist.net/images/voiceactors...,Kenichi Suzumura,健一,鈴村,1974-09-12T00:00:00+00:00,5176,"Osaka, Japan"
...,...,...,...,...,...,...,...,...,...,...
76694,90011,https://myanimelist.net/people/90011/Nanako_Ki...,,https://cdn.myanimelist.net/img/sp/icon/apple-...,Nanako Kishimoto,七子,岸本,,0,"Mumbai, India"
76695,90012,https://myanimelist.net/people/90012/Pamon,,https://cdn.myanimelist.net/img/sp/icon/apple-...,Pamon,,파몬,,0,"Tokyo, Japan"
76696,90013,https://myanimelist.net/people/90013/Tomoru_Emoto,,https://cdn.myanimelist.net/img/sp/icon/apple-...,Tomoru Emoto,ともる,柄本,,0,"Tokyo, Japan"
76697,90014,https://myanimelist.net/people/90014/Hirari,https://hirari.2-d.jp/,https://cdn.myanimelist.net/img/sp/icon/apple-...,Hirari,,ひらり,,0,"Paris, France"


In [None]:
person_details_df.loc[person_details_df['person_mal_id'].duplicated()]
#we found that the duplicates differ for the "relevant_location" field, which has no interest for us so we drop the duplicates

In [None]:
person_details_df.drop_duplicates(subset=['person_mal_id', 'url', 'name'], keep='first', inplace=True)


In [None]:
person_details_df.dtypes
#we need to change birthday from object to data

In [20]:
person_details_df["birthday"].min(), person_details_df["birthday"].max()
#makes sense because they're just composers of used music in anime

(Timestamp('1678-03-04 00:00:00+0000', tz='UTC'),
 Timestamp('2023-06-15 00:00:00+0000', tz='UTC'))

In [19]:
person_details_df["birthday"] = pd.to_datetime(person_details_df["birthday"], errors='coerce')


In [None]:
person_details_df.isna().sum()
#we have to check the nan values

In [None]:
person_details_df[person_details_df["name"].isna()]


We can join the two tables person_details_df and person_alternate_names_df having the keys that match.
Putting the alternate names in a new column called alt_name and having a list of those inside

### SIXTH DATASET

In [None]:
person_anime_works_df

In [None]:
person_anime_works_df.dtypes
#the types are correct

In [None]:
person_anime_works_df.isna().sum()
#There's no nan value

### SEVENTH DATASET

In [None]:
stats_df.filter(regex="_votes$").astype("Int64")
stats_df[stats_df.filter(regex="_votes$").columns] = (
    stats_df.filter(regex="_votes$").astype("Int64")
)

In [None]:
stats_df.isna().sum()
#there are 430 series without any votes

### EIGHTH DATASET 

In [None]:
ratings_df

In [11]:
ratings_df.columns()

NameError: name 'ratings_df' is not defined

In [None]:
# we have to understand the sense of "num_watched_episodes" and the link with "is_rewatching"
ratings_df.query('is_rewatching == 1')

In [None]:
ratings_df.dtypes

In [None]:
# change "is_rewatching" from float to Int8, to save memory
ratings_df["is_rewatching"] = ratings_df["is_rewatching"].astype("Int8") 

In [None]:
ratings_df[["anime_id","score","num_watched_episodes"]].agg(["min", "max"])


In [None]:
# to save up some memory, we can change "anime_id" from Int64 to Int32 because there are no anime with id > 2,147,483,647
ratings_df["anime_id"] = ratings_df["anime_id"].astype("Int32")

In [None]:
ratings_df[ratings_df.duplicated(subset=['username','anime_id'], keep=False)].sort_values(['username','anime_id'])

In [None]:
# usually we should drop all the occurrence of a duplicate and keep the first
# in this case though, it looks like the latest occurence is the most updated one, contaning more info than the first one, so we drop the first one
ratings_df.drop_duplicates(subset=['username', 'anime_id'], keep='last', inplace=True)

We had just 6 duplicates having the same username and anime_id

In [None]:
# we check for Nan values.
#TODO
# if it is necessary check if the num_watched_episodes is greater than number of episodes of anime, we can remove the Nan values and put one or zero. 
ratings_df.isna().sum()

In [None]:
# drop "username" with Nan values?
#TODO
ratings_df[ratings_df['username'].isna()]

Check this username that there is in the profiles_df

### NINTH DATASET

In [12]:
characters_df

Unnamed: 0,character_mal_id,url,name,name_kanji,image,favorites,about
0,280386,https://myanimelist.net/character/280386/Envi_...,Envi Mel Champagne,エンヴィ・メル・シャンパーニュ,https://cdn.myanimelist.net/images/characters/...,0,
1,280354,https://myanimelist.net/character/280354/Eleven,Eleven,イレヴン,https://cdn.myanimelist.net/images/characters/...,0,
2,280353,https://myanimelist.net/character/280353/Stud,Stud,スタッド,https://cdn.myanimelist.net/images/characters/...,0,
3,280352,https://myanimelist.net/character/280352/Judge,Judge,ジャッジ,https://cdn.myanimelist.net/images/characters/...,0,
4,280339,https://myanimelist.net/character/280339/Eiji_...,Eiji Kurokawa,黒川 英治,https://cdn.myanimelist.net/img/sp/icon/apple-...,0,
...,...,...,...,...,...,...,...
209958,282276,https://myanimelist.net/character/282276/Farra...,Farrah Van Dorothy,ファラ・ヴァン・ドロシー,https://cdn.myanimelist.net/images/characters/...,0,
209959,282277,https://myanimelist.net/character/282277/Harri...,Harris Mead,ハリス・ミード,https://cdn.myanimelist.net/images/characters/...,0,
209960,282278,https://myanimelist.net/character/282278/Rob,Rob,ロブ,https://cdn.myanimelist.net/images/characters/...,0,
209961,282281,https://myanimelist.net/character/282281/Grimm,Grimm,グリム,https://cdn.myanimelist.net/images/characters/...,0,


In [None]:
# check types of dataset columns
characters_df.dtypes

In [11]:
# change "character_mal_id" and "favorites" from float to int
characters_df["character_mal_id"] = characters_df["character_mal_id"].astype("Int64")
characters_df["favorites"] = characters_df["favorites"].astype("Int64")

In [None]:
characters_df.describe()

In [None]:
# we have only 2 rows where all columns are Nan, the rows with Nan values in "name_kanji" and "about" we shouldn't drop because they have other values that are important.
characters_df.isna().sum()

In [None]:
# here we want to check if the Nan values are concentrate in only two rows
characters_df[characters_df['character_mal_id'].isna()]

In [None]:
# Apart "name_kanji" and "about" the others Nan values are concentrate in two rows so we drop the two rows with all columns Nan
characters_df.dropna(how='all', inplace=True)


In [None]:
# we want to see all duplicates to understand if we have to drop or not
characters_df.loc[characters_df.duplicated(subset=['character_mal_id', 'url', 'name'], keep=False)]

In [None]:
# we drop the duplicates because they have all same values 
characters_df.drop_duplicates(subset=['character_mal_id', 'url', 'name'], keep='first', inplace=True)

### TENTH DATASET

In [13]:
# role of character anime
character_anime_works_df

Unnamed: 0,anime_mal_id,character_mal_id,character_name,role
0,2928,5781,Atoli,Main
1,2928,33,Haseo,Main
2,2928,32,Ovan,Main
3,2928,34,Shino,Main
4,2928,5785,Aina,Supporting
...,...,...,...,...
236811,31245,137157,"Shibasaki, Ken",Supporting
236812,36305,136064,"Hamanaka, Midori",Main
236813,36305,133916,"Narumi, Sena",Main
236814,36305,124942,"Hayasaka, Akari",Supporting


In [None]:
# check types of columns
character_anime_works_df.dtypes

In [None]:
# check the number of Nan value
character_anime_works_df.isna().sum()

In [None]:
# check the number of duplicates
character_anime_works_df.loc[character_anime_works_df.duplicated(subset=['anime_mal_id', 'character_mal_id'])]

There is no need to clean this dataset 

### ELEVENTH DATASET

In [14]:
person_voice_works_df

Unnamed: 0,person_mal_id,role,anime_mal_id,character_mal_id,language
0,1,Main,55830,2514,Japanese
1,1,Supporting,60602,2822,Japanese
2,1,Supporting,59229,140499,Japanese
3,1,Supporting,60427,275856,Japanese
4,1,Supporting,62067,190335,Japanese
...,...,...,...,...,...
489511,89839,Supporting,40111,274622,Mandarin
489512,89840,Supporting,60544,266412,Mandarin
489513,89841,Supporting,60544,266416,Mandarin
489514,89842,Supporting,36896,279922,Japanese


In [9]:
person_voice_works_df['language'].value_counts()

language
Japanese           203537
English             94902
French              43340
Spanish             39153
Portuguese (BR)     37100
Italian             30712
German              26805
Korean               7981
Mandarin             3529
Hungarian            1378
Hebrew               1079
Name: count, dtype: int64

In [None]:
person_voice_works_df['language'].sum()

In [None]:
#I want to group by language and see the different languages available in the dataset.
person_voice_works_df.groupby('language').size()

In [None]:
person_voice_works_df.dtypes

In [None]:
person_voice_works_df.isna().sum()

In [None]:
# check if the duplicates are in all columns
person_voice_works_df.loc[person_voice_works_df.duplicated(keep=False)]

In [None]:
# drop the duplicates because they have all same values
person_voice_works_df.drop_duplicates(keep='first', inplace=True)

### TWELFTH DATASET

In [None]:
# Should we delete the last five columns?
# TODO
profiles_df

In [None]:
# check if the types are right for each field
profiles_df.dtypes

In [None]:
# change types of columns "birthday" and "joined" from object to date and the others columns that they should be int
profiles_df["birthday"] = pd.to_datetime(profiles_df["birthday"], errors='coerce')
profiles_df["joined"] = pd.to_datetime(profiles_df["joined"], errors='coerce')

In [None]:
profiles_df["birthday"].min(), profiles_df["birthday"].max()


In [None]:
weird_birthdays = profiles_df[
    (profiles_df["birthday"] < "1900-01-01") |
    (profiles_df["birthday"] > "2025-12-31")
]

weird_birthdays


In [None]:
profiles_df.loc[
    profiles_df["birthday"] > profiles_df["joined"],
    ["birthday", "joined"]
]


In [None]:
profiles_df["joined"].min(), profiles_df["joined"].max()

In [None]:
profiles_df[profiles_df["birthday"].dt.year == 1930]


In [None]:
profiles_df["birthday"].dt.year.value_counts().sort_index().head(30)


In [None]:
# decided to remove the birthdays of people older than 100 at join
mask_too_old = (profiles_df["joined"] - profiles_df["birthday"]).dt.days / 365.25 > 100
profiles_df.loc[mask_too_old, "birthday"] = pd.NaT


In [None]:
profiles_df.isna().sum()

In [None]:
# check if there is any duplicate on "username"
profiles_df.loc[profiles_df.duplicated(subset=['username'], keep='first')]

### THIRTEENTH DATASET

In [None]:
recommendations_df

In [None]:
recommendations_df.dtypes

In [None]:
recommendations_df.isna().sum()

In [None]:
recommendations_df.loc[profiles_df.duplicated(keep='first')]

# TO KEEP

In [None]:
# 1st
# dropping duplicates
character_nicknames_df.drop_duplicates(keep='first', inplace=True)
# dropping nan values
character_nicknames_df.dropna(inplace=True)

# 2nd
# no need to clean from duplicates nor missing values
# change column types from object to datetime
details_df[["start_date", "end_date"]] = details_df[["start_date", "end_date"]].apply(
    pd.to_datetime, errors="coerce"
)
# change column types from object to Int64
details_df[["scored_by", "rank", "episodes", "year"]] = (
    details_df[["scored_by", "rank", "episodes", "year"]].astype("Int64")
)

# 3rd
# no need to clean


# 4th
#dropping nan values because they don't give any useful information
person_alternate_names_df.dropna(inplace=True)
# dropping duplicates
person_alternate_names_df.drop_duplicates(keep='first', inplace=True)


# 5th
# Dropping duplicates choosing to keep the first occurrence. They only differ for the "relevant_location" field which has no interest for us
person_details_df.drop_duplicates(subset=['person_mal_id', 'url', 'name'], keep='first', inplace=True)
# change column types from object to datetime
person_details_df["birthday"] = pd.to_datetime(person_details_df["birthday"], errors='coerce')

# There are two weird rows with nan values in most columns but we don't drop them because they may be linked to other datasets
# Could we join the two tables person_details_df and person_alternate_names_df having the keys that match.
# Putting the alternate names in a new column called alt_name and having a list of those inside
# TODO


# 6th
# no need to clean


# 7th
# change column types to save memory
stats_df.filter(regex="_votes$").astype("Int64")
stats_df[stats_df.filter(regex="_votes$").columns] = (
    stats_df.filter(regex="_votes$").astype("Int64")
)


# 8th
# change column types to save memory
ratings_df["is_rewatching"] = ratings_df["is_rewatching"].astype("Int8")
ratings_df["anime_id"] = ratings_df["anime_id"].astype("Int32")
# drop duplicates keeping the last entry (most recent)
ratings_df.drop_duplicates(subset=['username', 'anime_id'], keep='last', inplace=True)
# there's a username with NaN value, we'll keep it for now as it matches with the profiles_df
# we only have 1 Nan value in "username" in the profiles_df so we can keep it for now
# I could manage it this way: generate a deterministic placeholder, e.g.:
# the maximum existing user_id + 1
# or a specific labeled ID like "unknown_user"
# THEN we can use this placeholder consistently across all datasets to maintain referential integrity.
# profiles_df["user_id"] = profiles_df["user_id"].fillna(new_id)
# ratings_df["user_id"] = ratings_df["user_id"].fillna(new_id)
# TODO


# 9th
# change "character_mal_id" and "favorites" from float to int
characters_df[["character_mal_id", "favorites"]] = (
    characters_df[["character_mal_id", "favorites"]].astype("Int64")
)

# we drop the two rows with all columns Nan
characters_df.dropna(how='all', inplace=True)
# we drop the duplicates because they have all same values 
characters_df.drop_duplicates(subset=['character_mal_id', 'url', 'name'], keep='first', inplace=True)


# 10th
# no need to clean


# 11th
person_voice_works_df.drop_duplicates(keep='first', inplace=True)


# 12th
# change column types from object to datetime
profiles_df[["birthday", "joined"]] = profiles_df[["birthday", "joined"]].apply(
    pd.to_datetime, errors="coerce"
)
# setting birthdays outside a reasonable range to NaT
profiles_df.loc[
    profiles_df["birthday"] > profiles_df["joined"],
    "birthday"
] = pd.NaT
# decided to remove the birthdays of people older than 100 at join
mask_too_old = (profiles_df["joined"] - profiles_df["birthday"]).dt.days / 365.25 > 100
profiles_df.loc[mask_too_old, "birthday"] = pd.NaT
#I mean MongoDB, a female born in 1930-07-09, from Thailand and joined in 2017-07-09 makes perfect sense, doesn't it?

# 13th
# no need to clean
