# Data Cleaning
Data cleaning of a MyAnimeList dataset. It can be found [here](https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews)

The dataset contains 3 files:

- **animes.csv** contains list of anime, with title, title synonyms, genre, duration, rank, populatiry, score, airing date, episodes and many other important data about individual anime providing sufficient information about trends in time about important aspects of anime. Rank is in float format in csv, but it contains only integer value. This is due to NaN values and their representation in pandas.

- **profiles.csv** contains information about users who watch anime, namely username, birth date, gender, and favorite animes list.

- **reviews.csv** contains information about reviews users x animes, with text review and scores.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

## Reviews Dataset
This notebook will clean and feature engineer the reviews dataset.

In [2]:
reviews = pd.read_csv('../data/reviews.csv')

In [3]:
reviews.head()

Unnamed: 0,uid,profile,anime_uid,text,score,scores,link
0,255938,DesolatePsyche,34096,\n \n \n \n ...,8,"{'Overall': '8', 'Story': '8', 'Animation': '8...",https://myanimelist.net/reviews.php?id=255938
1,259117,baekbeans,34599,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=259117
2,253664,skrn,28891,\n \n \n \n ...,7,"{'Overall': '7', 'Story': '7', 'Animation': '9...",https://myanimelist.net/reviews.php?id=253664
3,8254,edgewalker00,2904,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '9', 'Animation': '9...",https://myanimelist.net/reviews.php?id=8254
4,291149,aManOfCulture99,4181,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=291149


In [4]:
reviews.isna().sum()

uid          0
profile      0
anime_uid    0
text         0
score        0
scores       0
link         0
dtype: int64

In [5]:
reviews.describe()

Unnamed: 0,uid,anime_uid,score
count,192112.0,192112.0,192112.0
mean,187648.127525,15273.300283,7.570235
std,98748.902397,13480.565379,2.255167
min,1.0,1.0,0.0
25%,101779.5,2167.0,6.0
50%,210913.5,10793.0,8.0
75%,270383.0,30205.0,9.0
max,325747.0,40807.0,11.0


### Structure of Scores Feature
The scores feature is a dictionary. Unpack the values and add them as individual features.

In [7]:
reviews.iloc[0]['scores']

str

In [9]:
# Use ast to convert the string into a dictionary
import ast
reviews['scores'] = reviews['scores'].apply(ast.literal_eval)

In [14]:
reviews.iloc[0]['scores']

{'Overall': '8',
 'Story': '8',
 'Animation': '8',
 'Sound': '10',
 'Character': '9',
 'Enjoyment': '8'}

In [16]:
score_labels = ['Overall', 'Story', 'Animation', 'Sound', 'Character', 'Enjoyment']

In [17]:
# Find the
def encode_score(score_label, scores):
    return int(scores[score_label])

In [21]:
for label in score_labels:
    reviews[label] = reviews['scores'].apply(lambda x: encode_score(label, x))

In [22]:
reviews.head()

Unnamed: 0,uid,profile,anime_uid,text,score,scores,link,Overall,Story,Animation,Sound,Character,Enjoyment
0,255938,DesolatePsyche,34096,\n \n \n \n ...,8,"{'Overall': '8', 'Story': '8', 'Animation': '8...",https://myanimelist.net/reviews.php?id=255938,8,8,8,10,9,8
1,259117,baekbeans,34599,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=259117,10,10,10,10,10,10
2,253664,skrn,28891,\n \n \n \n ...,7,"{'Overall': '7', 'Story': '7', 'Animation': '9...",https://myanimelist.net/reviews.php?id=253664,7,7,9,8,8,8
3,8254,edgewalker00,2904,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '9', 'Animation': '9...",https://myanimelist.net/reviews.php?id=8254,9,9,9,10,10,9
4,291149,aManOfCulture99,4181,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=291149,10,10,8,9,10,10


### Removing Unecessary Features

In [23]:
reviews.drop(['link', 'scores'], axis=1, inplace=True)

# Saving Cleaned Dataframes
For now, the text dataframe is unecessary for analysis. However, if there is time this can be used for NLP for sentiment analysis.

So it will be saved as a separate file.

In [24]:
reviews[['uid', 'profile', 'anime_uid', 'text']].to_csv('../data/reviews_text.csv')

In [26]:
reviews.drop('text', axis=1).to_csv('../data/reviews_clean.csv')