# Data Cleaning
Data cleaning of a MyAnimeList dataset. It can be found [here](https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews)

The dataset contains 3 files:

- **animes.csv** contains list of anime, with title, title synonyms, genre, duration, rank, populatiry, score, airing date, episodes and many other important data about individual anime providing sufficient information about trends in time about important aspects of anime. Rank is in float format in csv, but it contains only integer value. This is due to NaN values and their representation in pandas.

- **profiles.csv** contains information about users who watch anime, namely username, birth date, gender, and favorite animes list.

- **reviews.csv** contains information about reviews users x animes, with text review and scores.

In [51]:
import numpy as np
import pandas as pd
import plotly.express as px

## Profiles Dataset
This notebook will clean and feature engineer the profiles dataset.

In [52]:
profiles = pd.read_csv('../data/profiles.csv')

In [53]:
profiles.isna().sum()

profile                0
gender             27871
birthday           34920
favorites_anime        0
link                   0
dtype: int64

In [54]:
# takes in any string, strips all punctuation and returns an array
# import re
import ast
def perfectEval(anonstring):
        try:
            ev = ast.literal_eval(anonstring)
            return ev
        except ValueError:
            corrected = "\'" + anonstring + "\'"
            ev = ast.literal_eval(corrected)
            return ev

In [55]:
len(profiles)

81727

In [56]:
profiles['favorites_anime'] = profiles['favorites_anime'].apply(perfectEval)

## Removing Duplicates
Check if there is any duplicated data

In [38]:
print(len(profiles))
print('unique profiles:')
print(len(profiles['profile'].unique()))

81727
unique profiles:
47885


In [39]:
profiles = profiles.sort_values('profile').drop_duplicates(subset=['profile'], keep='last')

# Removing Users with No Favorites
Users without favorites do not have enough information for machine learning to take place, so they should be removed.

In [40]:
profiles['fav_len'] = profiles['favorites_anime'].apply(len)

In [58]:
profiles.head(20)

Unnamed: 0,profile,gender,birthday,favorites_anime,link
0,DesolatePsyche,Male,"Oct 2, 1994","[33352, 25013, 5530, 33674, 1482, 269, 18245, ...",https://myanimelist.net/profile/DesolatePsyche
1,baekbeans,Female,"Nov 10, 2000","[11061, 31964, 853, 20583, 918, 9253, 34599, 3...",https://myanimelist.net/profile/baekbeans
2,skrn,,,"[918, 2904, 11741, 17074, 23273, 32281, 9989, ...",https://myanimelist.net/profile/skrn
3,edgewalker00,Male,Sep 5,"[5680, 849, 2904, 3588, 37349]",https://myanimelist.net/profile/edgewalker00
4,aManOfCulture99,Male,"Oct 30, 1999","[4181, 7791, 9617, 5680, 2167, 4382, 849, 235,...",https://myanimelist.net/profile/aManOfCulture99
5,eneri,,,"[5114, 4898, 2904, 1575, 1482]",https://myanimelist.net/profile/eneri
6,Waffle_Empress,,"May 29, 1996","[338, 322, 440, 199, 28223, 12815, 2800, 18679...",https://myanimelist.net/profile/Waffle_Empress
7,NIGGER_BONER,Male,"Jan 1, 1985","[11061, 30, 6594, 28701, 10087, 6746, 918, 153...",https://myanimelist.net/profile/NIGGER_BONER
8,jchang,Male,"Jul 29, 1992","[846, 2904, 5114, 2924, 72]",https://myanimelist.net/profile/jchang
9,shadowsplat,,,[],https://myanimelist.net/profile/shadowsplat


In [42]:
print(len(profiles[profiles['fav_len'] == 0]))

10422


There are over 10k users who have no favorites, about a quarter of all unique users.

In [43]:
profiles = profiles[profiles['fav_len'] > 0]

# Users Without Gender
Gender is likely to be a very useful metric in predictions, however some users don't have a gender specified in their profile.

In [44]:
profiles[profiles['gender'].isna() == False]

Unnamed: 0,profile,gender,birthday,favorites_anime,link,fav_len
79974,--Mizu--,Female,"Jul 3, 1995","[21, 177, 6864, 4081, 5678, 23289]",https://myanimelist.net/profile/--Mizu--,6
43928,--Sunclaudius,Male,,"[34561, 6594, 13125]",https://myanimelist.net/profile/--Sunclaudius,3
2829,--animeislife--,Female,"Jul 19, 1996","[249, 14467, 13601, 9989, 10793, 16498, 8460, ...",https://myanimelist.net/profile/--animeislife--,8
69663,--d41,Male,"Jan 7, 1999","[35180, 9253, 21, 22789, 10165, 10162, 24439]",https://myanimelist.net/profile/--d41,7
41023,--mimika--,Female,"Aug 21, 1999","[813, 481, 550, 249, 32995]",https://myanimelist.net/profile/--mimika--,5
...,...,...,...,...,...,...
31692,zyke,Male,"Feb 7, 1994","[6746, 17074, 19815, 37171, 10357, 31933, 2605...",https://myanimelist.net/profile/zyke,10
54340,zyx210,Male,"Apr 21, 1995","[17265, 4382]",https://myanimelist.net/profile/zyx210,2
22012,zzeroparticle,Male,"Dec 26, 1983","[134, 164, 19]",https://myanimelist.net/profile/zzeroparticle,3
65485,zzs,Female,"Mar 22, 1993","[269, 1535, 2904, 1735, 1575]",https://myanimelist.net/profile/zzs,5


In [45]:
profiles[profiles['gender'].isna() == True]

Unnamed: 0,profile,gender,birthday,favorites_anime,link,fav_len
10861,-----noname-----,,"Dec 31, 2019","[6774, 245, 2001, 11061, 16592, 1575, 21]",https://myanimelist.net/profile/-----noname-----,7
74177,---SnowFlake---,,,"[2904, 6773, 10790]",https://myanimelist.net/profile/---SnowFlake---,3
55952,-Ancient,,,"[12293, 1519, 889, 6351]",https://myanimelist.net/profile/-Ancient,4
63869,-Belka,,,"[1254, 539, 263, 18679, 20973]",https://myanimelist.net/profile/-Belka,5
63271,-Candyz-,,,"[226, 33674, 35849, 32281]",https://myanimelist.net/profile/-Candyz-,4
...,...,...,...,...,...,...
39362,zygisrko,,"Apr 12, 1991","[918, 7674, 9513, 6746, 9253, 1017]",https://myanimelist.net/profile/zygisrko,6
48685,zylee,,,"[4181, 2167]",https://myanimelist.net/profile/zylee,2
15912,zyoxo,,,"[47, 572, 1535, 205, 1]",https://myanimelist.net/profile/zyoxo,5
76009,zzSorazz,,,"[11933, 11757, 4224, 23273, 13759, 390]",https://myanimelist.net/profile/zzSorazz,6


There are 27000 users with gender and about 9000 users with no gender specified. about 80% of the data has gender, so it should be OK to drop unknown gender entries.

It might be possible to predict a user's gender with something like a logistic regression. It might increase the accuracy of unsupervised clustering, however that would likely be a stretch goal.

In [46]:
profiles = profiles[profiles['gender'].isna() == False]

In [48]:
profiles.isna().sum()

profile               0
gender                0
birthday           5016
favorites_anime       0
link                  0
fav_len               0
dtype: int64

# Saving Data

In [62]:
profiles[['profile', 'gender', 'favorites_anime']].to_csv('../data/profiles_clean.csv')

In [63]:
profiles

Unnamed: 0,profile,gender,birthday,favorites_anime,link
0,DesolatePsyche,Male,"Oct 2, 1994","[33352, 25013, 5530, 33674, 1482, 269, 18245, ...",https://myanimelist.net/profile/DesolatePsyche
1,baekbeans,Female,"Nov 10, 2000","[11061, 31964, 853, 20583, 918, 9253, 34599, 3...",https://myanimelist.net/profile/baekbeans
2,skrn,,,"[918, 2904, 11741, 17074, 23273, 32281, 9989, ...",https://myanimelist.net/profile/skrn
3,edgewalker00,Male,Sep 5,"[5680, 849, 2904, 3588, 37349]",https://myanimelist.net/profile/edgewalker00
4,aManOfCulture99,Male,"Oct 30, 1999","[4181, 7791, 9617, 5680, 2167, 4382, 849, 235,...",https://myanimelist.net/profile/aManOfCulture99
...,...,...,...,...,...
81722,lovelessxd,Female,"Aug 6, 1992","[853, 5114]",https://myanimelist.net/profile/lovelessxd
81723,Shattered_Angel,Female,"Sep 6, 1994","[150, 27, 1520, 121, 31452, 32995, 877, 14713,...",https://myanimelist.net/profile/Shattered_Angel
81724,FluffyWalrus,Male,,"[121, 43, 237, 202, 205]",https://myanimelist.net/profile/FluffyWalrus
81725,camco,Female,Sep 23,"[199, 4224, 7054, 13601, 14713]",https://myanimelist.net/profile/camco
