# Data Cleaning
This notebook will focus on the preprocessing of the datasets before using them in the two recommendation systems (Content-based and collaborative filtering) I will be developing later on.

In [None]:
# Import libraries

import pandas as pd
import numpy as np

# Suppress scientific notation of pandas
pd.set_option('display.float_format', lambda x: '%.0f' % x)

## Anime Dataset

Keys:
- anime_id - myanimelist.net's unique id identifying an anime.
- name - full name of anime.
- genre - comma separated list of genres for this anime.
- type - movie, TV, OVA, etc.
- episodes - how many episodes in this show. (1 if movie).
- rating - average rating out of 10 for this anime.
- members - number of community members that are in this anime's
"group".

In [None]:
# load anime dataset
anime_df = pd.read_csv("datasets/anime.csv")

anime_df.head(10)

Just by looking at the top 5 entries in the dataframe. We already see there seems to be duplicates due to formatting or differences in the way values where entered. I will need to clean this. 

In [None]:
# number of rows and columns in the dataframe
anime_df.shape

In [None]:
# are the columns using suitable datatypes
anime_df.dtypes

In [None]:
anime_df[["rating", "members"]].describe()

In [None]:
# Check which rows have missing values
anime_df.isnull().any()

In [None]:
# How many missing values do we have for each column?
anime_df.isnull().sum()

It seems like there are missing values for the genre(62), type(25), and rating(230) columns. For a recommendation system, missing values may make a content based filtering system less inaccurate as features like the genre a film may influence the enjoyment one may have viewing a certain anime. 

In this project, I am only going to be using animes that have a "type" value of "TV". As a result of this, I will be removing all rows where "type" is not equal to "TV".

In [None]:
# remove rows where the film is not classified as "TV"
anime_df = anime_df[anime_df["type"] == "TV"]
anime_df.head(10)

In [None]:
# check how many missing values we have now
anime_df.isnull().sum()

It seems we still have 10 rows with missing genre values and also 116 rows with missing rating values. For the content based filtering, the genre of the show will be required so when it is time to develop the content based filtering model I will be dropping those 10 rows with missing genre values. On the other hand, with the collaborative filtering system, I am not required to used the genre values at all so I will simply use the whole dataset without the genre column.

## Ratings Dataset

Keys:
- user_id - non identifiable randomly generated user id.
- anime_id - the anime that this user has rated.
- rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).

In [None]:
# load ratings dataset
rating_df = pd.read_csv("datasets/rating.csv")

rating_df.head()

In [None]:
rating_df.tail()

In [None]:
rating_df.shape

In [None]:
rating_df.dtypes

In [None]:
rating_df.isnull().any()

Using -1 as the value for a no-rating may skew future analysis and the building of the recommender. Instead of using -1, I will replace all ratings of -1 with a null value.

In [None]:
rating_df["rating"].replace({-1: np.nan}, inplace=True)
values = rating_df["rating"].unique()
values.sort
print(values)

Now the ratings has values from 1-10 and nan for empty ratings.

The ratings dataset doesn't seem to need any more cleaning unless other issues arise.

## Exporting Dataframes to CSV

In [None]:
anime_df.to_csv("datasets/cleaned_anime.csv", index=False)
rating_df.to_csv("datasets/cleaned_rating.csv", index=False)