# Data Cleaning
This notebook will focus on the preprocessing of the datasets before using them in the two recommendation systems (Content-based and collaborative filtering) I will be developing later on.

In [15]:
# Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress scientific notation of pandas
pd.set_option('display.float_format', lambda x: '%.0f' % x)

## Anime Dataset

Keys:
- anime_id - myanimelist.net's unique id identifying an anime.
- name - full name of anime.
- genre - comma separated list of genres for this anime.
- type - movie, TV, OVA, etc.
- episodes - how many episodes in this show. (1 if movie).
- rating - average rating out of 10 for this anime.
- members - number of community members that are in this anime's
"group".

In [16]:
# load anime dataset
anime_df = pd.read_csv("datasets/anime.csv")

anime_df.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9,81109


Just by looking at the top 5 entries in the dataframe. We already see there seems to be duplicates due to formatting or differences in the way values where entered. I will need to clean this. 

In [17]:
# number of rows and columns in the dataframe
anime_df.shape

(12294, 7)

In [18]:
# are the columns using suitable datatypes
anime_df.dtypes

anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

In [19]:
anime_df[["rating", "members"]].describe()

Unnamed: 0,rating,members
count,12064,12294
mean,6,18071
std,1,54821
min,2,5
25%,6,225
50%,7,1550
75%,7,9437
max,10,1013917


In [20]:
# Check which rows have missing values
anime_df.isnull().any()

anime_id    False
name        False
genre        True
type         True
episodes    False
rating       True
members     False
dtype: bool

In [21]:
# How many missing values do we have for each column?
anime_df.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

It seems like there are missing values for the genre(62), type(25), and rating(230) columns. For a recommendation system, missing values may make a content based filtering system less inaccurate as features like the genre a film may influence the enjoyment one may have viewing a certain anime. 

In [22]:
from fuzzywuzzy import process, fuzz
anime_df["name"] = anime_df["name"].str.strip()

process.extract("Naruto", anime_df["name"], scorer=fuzz.token_set_ratio)

[('Boruto: Naruto the Movie', 100, 486),
 ('Naruto: Shippuuden', 100, 615),
 ('The Last: Naruto the Movie', 100, 719),
 ('Naruto: Shippuuden Movie 6 - Road to Ninja', 100, 784),
 ('Naruto', 100, 841)]

In [27]:
import re
def ngrams(string, n=3):
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower() #make lower case
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string) #remove the list of chars defined above
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single space
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectored_anime = anime_df["name"]
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(vectored_anime)

In [37]:
lol = [process.extract(x, vectored_anime, scorer=fuzz.token_set_ratio) for x in vectored_anime]
for i in lol:
    print(i)

KeyboardInterrupt: 

In [30]:
score_sort = [(x,) + i for x in vectored_anime for i in process.extract(x, vectored_anime, scorer=fuzz.token_set_ratio)]

similarty_sort = pd.DataFrame(score_sort, columns=["anime_name", "match_name", "sort_score"])
similarty_sort.head()

KeyboardInterrupt: 

<br>
Though this may be the case, I believe that for now, the anime dataset does not need any more cleaning.




## Ratings Dataset

Keys:
- user_id - non identifiable randomly generated user id.
- anime_id - the anime that this user has rated.
- rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).

In [15]:
# load ratings dataset
rating_df = pd.read_csv("datasets/rating.csv")

rating_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [16]:
rating_df.tail()

Unnamed: 0,user_id,anime_id,rating
7813732,73515,16512,7
7813733,73515,17187,9
7813734,73515,22145,10
7813735,73516,790,9
7813736,73516,8074,9


In [17]:
rating_df.shape

(7813737, 3)

In [18]:
rating_df.dtypes

user_id     int64
anime_id    int64
rating      int64
dtype: object

In [19]:
rating_df.isnull().any()

user_id     False
anime_id    False
rating      False
dtype: bool

Using -1 as the value for a no-rating may skew future analysis and the building of the recommender. Instead f using -1, I will replace all ratings of -1 with a null value.

In [None]:
rating_df["rating"].replace({-1: np.nan}, inplace=True)
values = rating_df["rating"].unique()
values.sort
print(values)

Now the ratings has values from 1-10 and nan for empty ratings.

The ratings dataset doesn't seem to need any more cleaning unless other issues arise.