# Classifyiing Users by Genre Preferences

## Introduction
Ratings websites are a great way to learn about a variety of entertainment mediums. From TV shows, movies, songs, to games, ratings give users a general overview of that medium’s popularity. However, these general ratings can sometimes be too general to be informative. Not every user will give ratings the same way. They are sure to have preferences in certain genres, and may dislike others. As such, a general rating may hide different groups of users who on average would rate something differently than users in other groups. For example, shoujo is a demographic for shows that are aimed at younger girls. It makes sense that female users may rate such a show a bit higher than male users.

The question now is, how do we find these groups of users? It can be very difficult to discern what a user likes simply by looking at what kind of shows they watch or music they listen to. How do we determine if one user is similar to another simply by looking at the shows they watch, or rather the genres that they tend to watch? The answer is to use unsupervised learning. Unsupervised learning is a technique in machine learning that takes in data and groups the data based on how similar the data is. It does not need to learn from pre-classified data, it simply learns from unlabeled data and spits out labels for them. For this particular project, we will be using an unsupervised learning algorithm called hierarchical agglomerative clustering.  Now, in order to use this learning algorithm, we will need data. Often, data will not be in a format that will be suitable for use in learning algorithms, which is why the data must be preprocessed. This is what we will be learning in this blog post.


## Getting the Data
For this blog series, we will be analyzing a dataset from the website MyAnimeList. It can be found here: https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews

This dataset contains information on anime titles, users, and reviews. For this project, we will only be looking at the titles and users data files.

To start, let's look at the `animes.csv` file.


In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

In [2]:
animes = pd.read_csv('../data/animes.csv')
# profiles = pd.read_csv('./data/profiles.csv')
# reviews = pd.read_csv('./data/reviews.csv')

In [3]:
animes.head()

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


We can see that there are 12 feature columns: uid, title, synopsis, genre, aired, episodes, members, popularity, ranked, score, img_url, link. They come in various forms, numerical, text, links, dates. 

### Removing Unecessary Columns

For the purposes of learning user preferences from genres, a number of these columns are unecessary. For starters, the URLs are not useful for machine learning. The date also doesn't factor into the genre of a show, so it can also be eliminated. We can see that popularity and ranked are actually derivatives of score and members. Popularity is simply a ranking of the number of members an anime has, and ranked is just a ranking of the score. Finally, the synopsis is irrelevant for machine learning, at least with the algorithm that we are using. It might be useful for some kind of natural language processing task, but that is beyond the purview of this blog post.

So that leaves us with uid, title, genre, episodes, member, and score.

In [4]:
animes = animes.drop(['aired', 'img_url', 'link', 'synopsis', 'ranked', 'popularity'], axis=1)

### Dealing with Missing Values

We will need to figure out how to handle missing values. Typically missing values don't play well with learning algorithms, so we can take several approaches to dealing with them.

1. Drop the values (rows)
    - If the number of rows with null values is very small in proportion to the data set then they can be dropped.
2. Drop the column
    - If a column has a large amount of missing values, say more than 40% you can elect to drop that entire column. However, you should think about whether this column contains important information that would be beneficial for machine learning algorithms to learn from.
3. 'Fill in' the values (imputation)
    - Sometimes a missing value may represent an unknown. If the feature is categorical in nature, these null values could be represented as an 'unknown' category rather than a null value.
    - In the case of numerical data missing values could be filled in with the mean, or the average. While this keeps the mean the same, it can change the standard deviation or cause other unwanted results.
    - A similar approach can be done with categorical values, using the mode, or most common value instead. This comes with its own caveats as well.
4. Asking a domain expert
    - It may be the case that for your project, you have access to an 'domain expert', someone who is familiar with the data. They may be able to tell you what a missing value means, or how to fill in your data based on their past experience. However, a domain expert is not always available, or it may be very time consuming to fill in data this way.
5. Generating new values
    - We can use a different machine learning algorithm to fill in the missing values. Using the already known, labeled data the algorithm can learn to fill in the missing values.

Let's take a look at which features have missing values.

In [5]:
animes.isna().sum()

uid           0
title         0
genre         0
episodes    706
members       0
score       579
dtype: int64

There are about 700 and 600 missing from episodes and score respectively. If we look at the total number of entries in the dataset,

In [6]:
len(animes)

19311

There are about 19000, so the missing amount constitutes around 5% of entries. In this case, it's easier to simply drop the missing values. If we think about why a show would have missing episodes or scores, there are some good reasons why we should drop them. If we look up the episodes with missing episodes, it becomes clear that these shows are ones that haven't aired yet, hence the missing episode count. Therefore, their scores should not be relevant since a user can't rate a show that has no episodes. For the shows without a score, it would be difficult to predict the average of users since there are no scores to work with. For these reasons, we should drop them.

In [7]:
animes.dropna(inplace=True)

In [8]:
animes.head()

Unnamed: 0,uid,title,genre,episodes,members,score
0,28891,Haikyuu!! Second Season,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,489888,8.82
1,23273,Shigatsu wa Kimi no Uso,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,995473,8.83
2,34599,Made in Abyss,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...",13.0,581663,8.83
3,5114,Fullmetal Alchemist: Brotherhood,"['Action', 'Military', 'Adventure', 'Comedy', ...",64.0,1615084,9.23
4,31758,Kizumonogatari III: Reiketsu-hen,"['Action', 'Mystery', 'Supernatural', 'Vampire']",1.0,214621,8.83


## Checking for Duplicated Data

Sometimes when gathering data, data can be duplicated. This can be a problem when trying to perform machine learning because duplicated data can cause biases or increase training time.

In [9]:
print(len(animes))
print(len(animes['uid'].unique()))

18419
15613


We can see that the there are more entries than there are unique entries, which means some of them are duplicated. Here, we are checking how many unique ID's there are. It is important to do a duplicate check on a feature that is supposed to have unique values, such as this ID feature. Sometimes duplicated ID's can indicate that there is a one to many relationship, where a show could have more than one entry. In this case, these entries are simply duplicated, likely due to being scraped from the website multiple times. We will drop them before proceeding to encoding the genre features.

In [10]:
animes = animes.drop_duplicates()
print(len(animes))

15690


## Encoding the Genres as Features

The genres feature contains an array of the genres associated with a particular show. This is what is known as nested data, which is not the format ML algorithms typically expect. What we want is to have what is known as 'flat' data. Each type of genre should be encoded as a feature so that the ML algorithms can understand and utilize the data to make predictions.

But first, we will have to deal with the improperly formatted data in the genre column.

In [11]:
type(animes.iloc[0].genre)

str

It may look like it's an array, but it is actually a string. This is because when the creator of the dataset saved their data into a csv, the arrays were converted into string format. We will have to convert it back before we can continue.

We will use the following function to convert the stringified array back into a proper array.

In [12]:
# takes in any string, strips all punctuation and returns an array
# import re
import ast
def perfectEval(anonstring):
        try:
            ev = ast.literal_eval(anonstring)
            return ev
        except ValueError:
            corrected = "\'" + anonstring + "\'"
            ev = ast.literal_eval(corrected)
            return ev

In [13]:
animes['genre'] = animes['genre'].apply(perfectEval)

In [14]:
type(animes.iloc[0].genre)

list

If we look at a genre entry we can see that it has changed from a string to a list, which is what we want. This above function is very useful for this kind of task that you may encounter while working with data. I suggest that you copy down this function for future use.

### Creating the feature labels
Right now, we don't know how many genres there are. We will need a list of them in order to create a label of all the genres associated with a show. We can do this by putting every genre entry of every show and then finding the unique genre values. We could use the .unique function to get them, but what I will be doing is using the Counter object. Not only does this give me unique values, but it also counts how many times a unique value shows up. This is useful for sorting the genres based on how many shows have that genre, which will be helpful later on when we graph out the data.

In [15]:
# get all the genres of every title
genres = []
for entry in animes['genre']:
    for genre in entry:
        genres.append(genre)

In [16]:
# count the different genres from all the titles
from collections import Counter
genre_count = Counter(genres)

A counter object holds two values, the unique object and the number of occurences of that object.

In [17]:
# Sort the genres
top_genres = []
for item in genre_count.most_common(50):
    # item[0] contains the genre itself. item[1] is the count of the genre
    top_genres.append(item[0])

Now that we have the genre feature labels, we can begin encoding the genre features. We will be taking a show's genres and turning it into a binary encoding, where if a show has a particular genre that column will be set as 1, and 0 otherwise.

In [18]:
# Create feature columns for each of the top genres and encode each show's genres in these features
def encode_genre(genre, genre_list):
    if genre in genre_list:
        return 1
    return 0

In [19]:
for genre_feat in top_genres:
    animes[genre_feat] = animes['genre'].apply(lambda x: encode_genre(genre_feat, x))

In [20]:
animes

Unnamed: 0,uid,title,genre,episodes,members,score,Comedy,Action,Fantasy,Adventure,...,Police,Samurai,Vampire,Cars,Thriller,Josei,Shounen Ai,Shoujo Ai,Yaoi,Yuri
0,28891,Haikyuu!! Second Season,"[Comedy, Sports, Drama, School, Shounen]",25.0,489888,8.82,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,23273,Shigatsu wa Kimi no Uso,"[Drama, Music, Romance, School, Shounen]",22.0,995473,8.83,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,34599,Made in Abyss,"[Sci-Fi, Adventure, Mystery, Drama, Fantasy]",13.0,581663,8.83,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,5114,Fullmetal Alchemist: Brotherhood,"[Action, Military, Adventure, Comedy, Drama, M...",64.0,1615084,9.23,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,31758,Kizumonogatari III: Reiketsu-hen,"[Action, Mystery, Supernatural, Vampire]",1.0,214621,8.83,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19002,10075,Naruto x UT,"[Action, Comedy, Super Power, Martial Arts, Sh...",1.0,34155,7.50,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
19003,35828,Miira no Kaikata,"[Slice of Life, Comedy, Supernatural]",12.0,61459,7.50,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19004,10378,Shinryaku!? Ika Musume,"[Slice of Life, Comedy, Shounen]",12.0,67422,7.56,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19005,33082,Kingsglaive: Final Fantasy XV,[Action],1.0,41077,7.56,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we have our animes dataset formatted correctly, we should save it to a file, in case we mess something up or we lose the data.

In [29]:
animes.to_csv('../data/animes_clean.csv', index=False)

We can also save the genres labels for future use.

In [101]:
import pickle
f = open('../data/genres.pickle', 'wb')
pickle.dump(top_genres, f)
f.close()

Next, we will move on to cleaning the profiles dataset.