# Data Cleaning
Data cleaning of a MyAnimeList dataset. It can be found [here](https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews)

The dataset contains 3 files:

- **animes.csv** contains list of anime, with title, title synonyms, genre, duration, rank, populatiry, score, airing date, episodes and many other important data about individual anime providing sufficient information about trends in time about important aspects of anime. Rank is in float format in csv, but it contains only integer value. This is due to NaN values and their representation in pandas.

- **profiles.csv** contains information about users who watch anime, namely username, birth date, gender, and favorite animes list.

- **reviews.csv** contains information about reviews users x animes, with text review and scores.

In [2]:
import pandas as pd

## Profiles Dataset
As before, we will load the dataset and explore the features.

In [3]:
profiles = pd.read_csv('../data/profiles.csv')

In [4]:
profiles.isna().sum()

profile                0
gender             27871
birthday           34920
favorites_anime        0
link                   0
dtype: int64

In [5]:
profiles.head()

Unnamed: 0,profile,gender,birthday,favorites_anime,link
0,DesolatePsyche,Male,"Oct 2, 1994","['33352', '25013', '5530', '33674', '1482', '2...",https://myanimelist.net/profile/DesolatePsyche
1,baekbeans,Female,"Nov 10, 2000","['11061', '31964', '853', '20583', '918', '925...",https://myanimelist.net/profile/baekbeans
2,skrn,,,"['918', '2904', '11741', '17074', '23273', '32...",https://myanimelist.net/profile/skrn
3,edgewalker00,Male,Sep 5,"['5680', '849', '2904', '3588', '37349']",https://myanimelist.net/profile/edgewalker00
4,aManOfCulture99,Male,"Oct 30, 1999","['4181', '7791', '9617', '5680', '2167', '4382...",https://myanimelist.net/profile/aManOfCulture99


We can see that there are profile, gender, birthday, favorites_anime, and link features.
1. profile. This is a unique identifier.
2. gender. The user's gender. There are four groups: Male, female, other, and none. None is when a user declines to specify their gender.
3. birthday. The user's birthday. As shown above, there are quite a few missing birthdays.
4. favorites_anime. An array containing the id's of animes that a user has favorited.
5. link. The link to a user's profile page.

Of these 5 features, the ones that are useful to us are the profile, gender, and favorites_anime features. Birthdays don't typically relate to a user's genre preferences, but even if they do, it will be difficult to determine the missing birthdates with any reasonable accuracy. Links are obviously not useful for a clustering algorithm, so it can also be dropped.

If we look at the gender feature, we find that a large number of users don't have a gender specified. We will have to decide what to do with these missing values. As discussed before, we could try dropping the missing values, but this is a poor choice. Gender is likely to have a large impact on a user's genre preferences. For example, the romance genre should have a larger proportion of female users than males. We could try filling in with the mode, the more common of the two genders. This is also bad, since it will skew the gender even more towards the greater one.

Instead, what we will be using a simple classification algorithm to fill in the missing gender values. We won't be covering this right now, but rather later on after we have finished cleaning up the rest of the data.

Let's look at the favorites_anime feature. Again we can see that it's in a string formatted as an array, rather than an actual array. We will need to use our trusty conversion method again.

In [6]:
# takes in any string, strips all punctuation and returns an array
# import re
import ast
def perfectEval(anonstring):
        try:
            ev = ast.literal_eval(anonstring)
            return ev
        except ValueError:
            corrected = "\'" + anonstring + "\'"
            ev = ast.literal_eval(corrected)
            return ev

In [7]:
profiles['favorites_anime'] = profiles['favorites_anime'].apply(perfectEval)

## Removing Duplicates
Let's take a look at duplicated data in the profiles dataset, and remove them.

In [8]:
print('Total profiles: ')
print(len(profiles))
print('unique profiles:')
print(len(profiles['profile'].unique()))

Total profiles: 
81727
unique profiles:
47885


In [9]:
profiles = profiles.sort_values('profile').drop_duplicates(subset=['profile'], keep='last')

In [10]:
print(len(profiles))

47885


>Why is removing duplicates important in data science? It is because duplicates can skew data by affecting the mean or the mode, which can throw off tests like p-value. Another reason to remove duplicates is because they can slow down the training process. More entries means that the algorithm has to spend more time processing the data.

# Removing Users with No Favorites
Next, we have to consider the case of users who have not made any favorites. This could be a new user or a user who uses the website for other purposes than keeping a log of favorites. Our algorithm cannot handle predicting users without favorites because there isn't enough information for the algorithm to work on. We will have to remove these.

In [11]:
profiles['fav_len'] = profiles['favorites_anime'].apply(len)

In [12]:
profiles.head()

Unnamed: 0,profile,gender,birthday,favorites_anime,link,fav_len
10861,-----noname-----,,"Dec 31, 2019","[6774, 245, 2001, 11061, 16592, 1575, 21]",https://myanimelist.net/profile/-----noname-----,7
74177,---SnowFlake---,,,"[2904, 6773, 10790]",https://myanimelist.net/profile/---SnowFlake---,3
66586,---was-----,,,[],https://myanimelist.net/profile/---was-----,0
70699,--EYEPATCH--,Male,"Oct 28, 2000",[],https://myanimelist.net/profile/--EYEPATCH--,0
79974,--Mizu--,Female,"Jul 3, 1995","[21, 177, 6864, 4081, 5678, 23289]",https://myanimelist.net/profile/--Mizu--,6


In [13]:
print(len(profiles[profiles['fav_len'] == 0]))

10422


We can see that there are over 10k users who have no favorites, about a quarter of all unique users.

In [14]:
profiles = profiles[profiles['fav_len'] > 0]

With duplicates removed, we should now save the data into a csv file so we won't lose all of our progress.

In [15]:
profiles.head()

Unnamed: 0,profile,gender,birthday,favorites_anime,link,fav_len
10861,-----noname-----,,"Dec 31, 2019","[6774, 245, 2001, 11061, 16592, 1575, 21]",https://myanimelist.net/profile/-----noname-----,7
74177,---SnowFlake---,,,"[2904, 6773, 10790]",https://myanimelist.net/profile/---SnowFlake---,3
79974,--Mizu--,Female,"Jul 3, 1995","[21, 177, 6864, 4081, 5678, 23289]",https://myanimelist.net/profile/--Mizu--,6
43928,--Sunclaudius,Male,,"[34561, 6594, 13125]",https://myanimelist.net/profile/--Sunclaudius,3
2829,--animeislife--,Female,"Jul 19, 1996","[249, 14467, 13601, 9989, 10793, 16498, 8460, ...",https://myanimelist.net/profile/--animeislife--,8


In [16]:
profiles[['profile', 'gender', 'favorites_anime']].to_csv('../data/profiles_clean.csv', index=0)

Finally, we are done pre-processing the data. All of our data has been cleaned, duplicates removed, and formatted properly for our next task, which is to combine the two datasets together. We will be covering the prediction of user gender as well as how to combine the two datasets in preparation for machine learning in the next blog post.