# Anime Recommendation Engine: Data Cleaning & Preparation
- Reduced input data by 95% to machine learning algorithm by performing analytics on database to extract data that will improve algorithm's performance while minimizing training and inference time
- Scripts I used to extract relevant data from the csv, transform to JSON and load to machine learning backend of web application
- Winner at the [Hacktoon](https://devpost.com/software/myanimebutler?ref_content=my-projects-tab&ref_feature=my_projects) hackathon project [MyAnimeButler](https://github.com/arjun-krishna1/MyAnimeButlerData)
- Inspired by [Collaborative Filtering On Anime Data](https://www.kaggle.com/ajmichelutti/collaborative-filtering-on-anime-data)

In [30]:
# Import relevant libraries 

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.metrics.pairwise import cosine_similarity
import operator
import json
import os
%matplotlib inline

In [31]:
anime = pd.read_csv('../input/anime.csv')
rating = pd.read_csv('../input/rating.csv')

# How big is the initial database?

In [32]:
print(anime.shape)
print(rating.shape)

(12294, 7)
(7813737, 3)


### Replace missing values (-1 in source) with NaN so they don't affect mean

In [33]:
rating.rating.replace({-1: np.nan}, regex=True, inplace = True)
rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,
1,1,24,
2,1,79,
3,1,226,
4,1,241,


### We only want TV shows (not comics, etc)

In [34]:
anime_tv = anime[anime['type']=='TV']
anime_tv.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351


### Join the ratings and anime table to create on database with all information

In [35]:
merged = rating.merge(anime_tv, left_on = 'anime_id', right_on = 'anime_id', suffixes= ['_user', ''])
merged.rename(columns = {'rating_user':'user_rating'}, inplace = True)
merged.head()

Unnamed: 0,user_id,anime_id,user_rating,name,genre,type,episodes,rating,members
0,1,20,,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
1,3,20,8.0,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
2,5,20,6.0,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
3,6,20,,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
4,10,20,,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297


### Reduce table size to reduce machine learning training time

In [36]:
merged=merged[['user_id', 'anime_id', 'user_rating']]
merged_sub= merged[merged.user_id <= 10000]
merged_sub.head()

Unnamed: 0,user_id,anime_id,user_rating
0,1,20,
1,3,20,8.0
2,5,20,6.0
3,6,20,
4,10,20,


For collaborative filtering we'll need to create a pivot table of users on one axis and tv show names along the other. The pivot table will help us in defining the similarity between users and shows to better predict who will like what.

In [37]:
piv = merged_sub.pivot_table(index=['user_id'], columns=['anime_id'], values='user_rating')
piv.head()

anime_id,1,6,7,8,15,16,17,18,19,20,...,33028,33037,33046,33113,33222,33241,33274,33341,33394,33421
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,8.0,...,,,,,,,,,,
5,,8.0,,,6.0,,6.0,6.0,,6.0,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,7.0,,


### Turn pivot table to a JSON format
### Create a list of all the anime's that user has watched, and it's transpose

In [38]:
def get_idx_not_nan(row):
    return list(row[row.notna()].index)
get_idx_not_nan(piv.iloc[0])

[8074, 11617, 11757, 15451]

In [39]:
def df_to_filter_dict(df, filter_fncn):
    res = {}
    for i in range(df.shape[0]):
        res[i] = filter_fncn(df.iloc[i])
    return res
    
# key is user_id, value is a list of anime_id of anime they reviewed
user_to_watched_anime_map = df_to_filter_dict(piv, get_idx_not_nan)

# the output are the animes that this user has reviewed
print(user_to_watched_anime_map[0])
print(user_to_watched_anime_map[1])

[8074, 11617, 11757, 15451]
[11771]


In [40]:
anime_to_users_that_reviewed_map = df_to_filter_dict(piv.transpose(), get_idx_not_nan)

# the output are the users that have reviewed this anime
print(str(anime_to_users_that_reviewed_map[0])[:100] + "...")

[19, 21, 23, 32, 34, 43, 46, 50, 51, 55, 68, 72, 80, 81, 103, 126, 129, 139, 152, 160, 163, 173, 175...


### Our efforts reduced the input data to the machine learning algorithms by 99.85%
### This reduced the training and inference time

In [48]:
init_rows = anime.shape[0] + rating.shape[0]
final_rows = len(anime_to_users_that_reviewed_map) + len(user_to_watched_anime_map)

red_percent = round(((init_rows - final_rows) / init_rows)*100, 2)
print(f"Reduction in input data {red_percent}%")

Reduction in input data 99.85%


### Output data to JSON format for website backend

In [29]:
if not os.path.isdir('data'):
    os.mkdir("data")

with open("data/users_key_anime_value1.json", "w") as outfile:
    json.dump(user_to_watched_anime_map, outfile)
with open("data/anime_key_users_value1.json", "w") as outfile:
    json.dump(anime_to_users_that_reviewed_map, outfile)