# Data Cleaning
Data cleaning of a MyAnimeList dataset. It can be found [here](https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews)

The dataset contains 3 files:

- **animes.csv** contains list of anime, with title, title synonyms, genre, duration, rank, populatiry, score, airing date, episodes and many other important data about individual anime providing sufficient information about trends in time about important aspects of anime. Rank is in float format in csv, but it contains only integer value. This is due to NaN values and their representation in pandas.

- **profiles.csv** contains information about users who watch anime, namely username, birth date, gender, and favorite animes list.

- **reviews.csv** contains information about reviews users x animes, with text review and scores.

In [82]:
import numpy as np
import pandas as pd
import plotly.express as px

## Animes Dataset
This notebook will clean and feature engineer the animes dataset.

In [83]:
animes = pd.read_csv('../data/animes.csv')
# profiles = pd.read_csv('./data/profiles.csv')
# reviews = pd.read_csv('./data/reviews.csv')

In [84]:
animes.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


The animes dataframe contains data about each anime, uniquely identified with uid. The preliminary relevant features to be considered are the genre, members, popularity, and score features.
1. *Genre* - The genre of the anime/movie. This data is in an array format for each title.
2. *Members* - How many members follow a title
3. *popularity* - Global ranking of the number of members following a title
4. *score* - The average score of a title given by members

### Exploring the Anime Dataframe
Let's look at the data in the anime dataframe

In [85]:
animes.describe()

Unnamed: 0,uid,episodes,members,popularity,ranked,score
count,19311.0,18605.0,19311.0,19311.0,16099.0,18732.0
mean,19358.904096,11.460414,34726.09,7720.830304,6866.524194,6.436107
std,14271.446515,47.950386,112177.2,4676.786104,4390.018768,1.007941
min,1.0,1.0,25.0,1.0,1.0,1.25
25%,4833.5,1.0,388.0,3725.0,2895.5,5.77
50%,18327.0,2.0,2389.0,7539.0,6963.0,6.41
75%,33896.5,12.0,14501.5,11613.0,10601.5,7.15
max,40960.0,3057.0,1871043.0,16338.0,14675.0,9.23


In [86]:
animes.isna().sum()

uid              0
title            0
synopsis       975
genre            0
aired            0
episodes       706
members          0
popularity       0
ranked        3212
score          579
img_url        180
link             0
dtype: int64

In [87]:
animes.count()

uid           19311
title         19311
synopsis      18336
genre         19311
aired         19311
episodes      18605
members       19311
popularity    19311
ranked        16099
score         18732
img_url       19131
link          19311
dtype: int64

#### Null Values and Un-needed Features

The `aired`, `image-url`, and `link` features are unecessary for clustering so they can be dropped.
The `ranked` feature is simply a sorting of the score value, so it can also be dropped.

The `episodes` feature has some missing values, but this is likely because some shows are announced but they don't have episodes yet. Since users cant really judge a show that has no episodes out, it's best to drop these null values.

`synopsis` can be dropped for now. It won't be relevant for clustering users but could be useful for NLP.

`score` is an important feature. The null values should be dropped since there are not too many of them.

In [88]:
animes = animes.drop(['aired', 'img_url', 'link', 'synopsis', 'ranked', 'popularity'], axis=1)

In [89]:
animes.dropna(subset=['episodes', 'score'], inplace=True)

In [90]:
animes.count()

uid         18419
title       18419
genre       18419
episodes    18419
members     18419
score       18419
dtype: int64

In [91]:
animes.head()

Unnamed: 0,uid,title,genre,episodes,members,score
0,28891,Haikyuu!! Second Season,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,489888,8.82
1,23273,Shigatsu wa Kimi no Uso,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,995473,8.83
2,34599,Made in Abyss,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...",13.0,581663,8.83
3,5114,Fullmetal Alchemist: Brotherhood,"['Action', 'Military', 'Adventure', 'Comedy', ...",64.0,1615084,9.23
4,31758,Kizumonogatari III: Reiketsu-hen,"['Action', 'Mystery', 'Supernatural', 'Vampire']",1.0,214621,8.83


#### Unpacking and encoding the genres feature
The genre feature is an array of genres that describe a title. In order to use this in clustering, it must be unpacked from its array form and then encoded as features

The genres are written as arrays, but in a string format. To get the data back out, use the ast library's literal_eval to convert the string back into an array

In [92]:
# takes in any string, strips all punctuation and returns an array
# import re
import ast
def perfectEval(anonstring):
        try:
            ev = ast.literal_eval(anonstring)
            return ev
        except ValueError:
            corrected = "\'" + anonstring + "\'"
            ev = ast.literal_eval(corrected)
            return ev

In [93]:
animes['genre'] = animes['genre'].apply(perfectEval)

### Finding the Most Popular Genres
We are interested in the most used genres. Find all the genres from all titles and count which genres are the most common.

In [96]:
# get all the genres of every title
genres = []
for entry in animes['genre']:
    for genre in entry:
        genres.append(genre)

In [97]:
genres

['Comedy',
 'Sports',
 'Drama',
 'School',
 'Shounen',
 'Drama',
 'Music',
 'Romance',
 'School',
 'Shounen',
 'Sci-Fi',
 'Adventure',
 'Mystery',
 'Drama',
 'Fantasy',
 'Action',
 'Military',
 'Adventure',
 'Comedy',
 'Drama',
 'Magic',
 'Fantasy',
 'Shounen',
 'Action',
 'Mystery',
 'Supernatural',
 'Vampire',
 'Action',
 'Slice of Life',
 'Comedy',
 'Supernatural',
 'Adventure',
 'Supernatural',
 'Drama',
 'Action',
 'Demons',
 'Historical',
 'Shounen',
 'Supernatural',
 'Mystery',
 'Comedy',
 'Supernatural',
 'Vampire',
 'Action',
 'Military',
 'Sci-Fi',
 'Super Power',
 'Drama',
 'Mecha',
 'Comedy',
 'Sports',
 'Drama',
 'School',
 'Shounen',
 'Action',
 'Comedy',
 'Historical',
 'Parody',
 'Samurai',
 'Sci-Fi',
 'Shounen',
 'Action',
 'Sci-Fi',
 'Comedy',
 'Historical',
 'Parody',
 'Samurai',
 'Shounen',
 'Action',
 'Comedy',
 'Historical',
 'Parody',
 'Samurai',
 'Sci-Fi',
 'Shounen',
 'Slice of Life',
 'Comedy',
 'Supernatural',
 'Drama',
 'Romance',
 'Action',
 'Comedy',
 'His

In [98]:
# count the different genres from all the titles
from collections import Counter
genre_count = Counter(genres)

In [99]:
# Sort the genres
top_genres = []
for item in genre_count.most_common(50):
    top_genres.append(item[0])

In [100]:
top_genres

['Comedy',
 'Action',
 'Fantasy',
 'Adventure',
 'Drama',
 'Sci-Fi',
 'Hentai',
 'Kids',
 'Shounen',
 'Romance',
 'Slice of Life',
 'Music',
 'School',
 'Supernatural',
 'Historical',
 'Mecha',
 'Magic',
 'Seinen',
 'Mystery',
 'Sports',
 'Ecchi',
 'Shoujo',
 'Super Power',
 'Parody',
 'Military',
 'Demons',
 'Space',
 'Horror',
 'Harem',
 'Dementia',
 'Martial Arts',
 'Psychological',
 'Game',
 'Police',
 'Samurai',
 'Vampire',
 'Thriller',
 'Cars',
 'Josei',
 'Shounen Ai',
 'Shoujo Ai',
 'Yuri',
 'Yaoi']

In [70]:
# Create feature columns for each of the top genres and encode each show's genres in these features
def encode_genre(genre, genre_list):
    if genre in genre_list:
        return 1
    return 0

In [71]:
for genre_feat in top_genres:
    animes[genre_feat] = animes['genre'].apply(lambda x: encode_genre(genre_feat, x))

In [72]:
animes

Unnamed: 0,uid,title,genre,episodes,members,score,',Unnamed: 8,",",e,...,Police,Samurai,Vampire,Thriller,Cars,Josei,Shounen Ai,Shoujo Ai,Yuri,Yaoi
0,28891,Haikyuu!! Second Season,"[Comedy, Sports, Drama, School, Shounen]",25.0,489888,8.82,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,23273,Shigatsu wa Kimi no Uso,"[Drama, Music, Romance, School, Shounen]",22.0,995473,8.83,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,34599,Made in Abyss,"[Sci-Fi, Adventure, Mystery, Drama, Fantasy]",13.0,581663,8.83,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,5114,Fullmetal Alchemist: Brotherhood,"[Action, Military, Adventure, Comedy, Drama, M...",64.0,1615084,9.23,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,31758,Kizumonogatari III: Reiketsu-hen,"[Action, Mystery, Supernatural, Vampire]",1.0,214621,8.83,1,1,1,1,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19306,32979,Flip Flappers,"[Sci-Fi, Adventure, Comedy, Magic]",13.0,134252,7.73,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
19307,123,Fushigi Yuugi,"[Adventure, Fantasy, Magic, Martial Arts, Come...",52.0,84407,7.73,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
19308,1281,Gakkou no Kaidan,"[Mystery, Horror, Supernatural]",19.0,83093,7.73,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
19309,450,InuYasha Movie 2: Kagami no Naka no Mugenjo,"[Action, Adventure, Comedy, Historical, Demons...",1.0,71989,7.73,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [73]:
animes.drop('genre', axis=1, inplace=True)

In [74]:
animes.isna().sum()

uid           0
title         0
episodes      0
members       0
score         0
             ..
Josei         0
Shounen Ai    0
Shoujo Ai     0
Yuri          0
Yaoi          0
Length: 92, dtype: int64

# Remove duplicates
Every anime seems to be duplicated. Remove before saving.

In [75]:
animes[animes['uid'] == 28891]

Unnamed: 0,uid,title,episodes,members,score,',Unnamed: 7,",",e,i,...,Police,Samurai,Vampire,Thriller,Cars,Josei,Shounen Ai,Shoujo Ai,Yuri,Yaoi
0,28891,Haikyuu!! Second Season,25.0,489888,8.82,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3077,28891,Haikyuu!! Second Season,25.0,489888,8.82,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [76]:
animes = animes.drop_duplicates(subset='uid', keep='last')

# Saving Cleaned Data

In [77]:
animes.head()

Unnamed: 0,uid,title,episodes,members,score,',Unnamed: 7,",",e,i,...,Police,Samurai,Vampire,Thriller,Cars,Josei,Shounen Ai,Shoujo Ai,Yuri,Yaoi
3046,9317,Doll Saaya,1.0,609,4.61,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3048,38339,Suzumi-bune,1.0,137,5.0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3051,39731,Na Bbeun Sang Sa,1.0,149,5.61,1,1,1,1,1,...,0,0,0,1,0,1,0,0,0,0
3052,40131,Junjou Juugeki Cosplay Shoujo,1.0,117,3.95,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3057,5569,Tsui no Sora,1.0,1821,2.84,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [29]:
animes.to_csv('../data/animes_clean.csv', index=False)

# Genres List
Saving a list of all the genres for future use as column names.

In [80]:
genres

['Comedy',
 'Sports',
 'Drama',
 'School',
 'Shounen',
 'Drama',
 'Music',
 'Romance',
 'School',
 'Shounen',
 'Sci-Fi',
 'Adventure',
 'Mystery',
 'Drama',
 'Fantasy',
 'Action',
 'Military',
 'Adventure',
 'Comedy',
 'Drama',
 'Magic',
 'Fantasy',
 'Shounen',
 'Action',
 'Mystery',
 'Supernatural',
 'Vampire',
 'Action',
 'Slice of Life',
 'Comedy',
 'Supernatural',
 'Adventure',
 'Supernatural',
 'Drama',
 'Action',
 'Demons',
 'Historical',
 'Shounen',
 'Supernatural',
 'Mystery',
 'Comedy',
 'Supernatural',
 'Vampire',
 'Action',
 'Military',
 'Sci-Fi',
 'Super Power',
 'Drama',
 'Mecha',
 'Comedy',
 'Sports',
 'Drama',
 'School',
 'Shounen',
 'Action',
 'Comedy',
 'Historical',
 'Parody',
 'Samurai',
 'Sci-Fi',
 'Shounen',
 'Action',
 'Sci-Fi',
 'Comedy',
 'Historical',
 'Parody',
 'Samurai',
 'Shounen',
 'Action',
 'Comedy',
 'Historical',
 'Parody',
 'Samurai',
 'Sci-Fi',
 'Shounen',
 'Slice of Life',
 'Comedy',
 'Supernatural',
 'Drama',
 'Romance',
 'Action',
 'Comedy',
 'His

In [101]:
import pickle
f = open('../data/genres.pickle', 'wb')
pickle.dump(top_genres, f)
f.close()