# Combining the Two Datasets
In this blog post, we will be taking the two datasets that we created last time and combining them into one dataset. Right now, we have the shows in a genre representation and the users with their list of favorites. We need to somehow combine these two in a way that each user has an overall representation of what genres they like to watch. How we will do this is to create a vector that captures the overall distribution of genres that a user has favorited.

### Vectorizing User Favorites
We will use the user's favorites_anime array to fetch all the genre data from the animes dataframe. Then, we will sum all of these rows together to create the final vector. What we are doing here is adding up how many of each genre is present across all of the shows favorited by a user. Genres that are featured in more of a user's favorites will have a higher number, and others will have lower or zero value. In other words, it gives a numerical representation of a user's genre preferences.

To begin, lets load the data we saved last time.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/profiles_clean.csv', index_col=0)

We will need to convert the arrays once again since saving to a csv turns arrays into strings.

In [3]:
import ast
def perfectEval(anonstring):
        try:
            ev = ast.literal_eval(anonstring)
            return ev
        except ValueError:
            corrected = "\'" + anonstring + "\'"
            ev = ast.literal_eval(corrected)
            return ev

In [4]:
df['favorites_anime'] = df['favorites_anime'].apply(perfectEval)

## Converting Favorited Anime into a Genre Representation
We will now create a function to take a user's favorites array and convert it into a vector. But before we can do that, we have to check what the data types for the animes uid and the favorites_anime are.

In [5]:
animes = pd.read_csv('../data/animes_clean.csv')

In [6]:
animes.head()

Unnamed: 0,uid,title,genre,episodes,members,score,Comedy,Action,Fantasy,Adventure,...,Police,Samurai,Vampire,Cars,Thriller,Josei,Shounen Ai,Shoujo Ai,Yaoi,Yuri
0,28891,Haikyuu!! Second Season,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...",25.0,489888,8.82,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,23273,Shigatsu wa Kimi no Uso,"['Drama', 'Music', 'Romance', 'School', 'Shoun...",22.0,995473,8.83,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,34599,Made in Abyss,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...",13.0,581663,8.83,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,5114,Fullmetal Alchemist: Brotherhood,"['Action', 'Military', 'Adventure', 'Comedy', ...",64.0,1615084,9.23,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,31758,Kizumonogatari III: Reiketsu-hen,"['Action', 'Mystery', 'Supernatural', 'Vampire']",1.0,214621,8.83,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0


In [7]:
print('Animes uid datatype:')
type(animes.iloc[0]['uid'])

Animes uid datatype:


numpy.int64

In [8]:
print('favorites_anime datatype: ')
type(df.iloc[0]['favorites_anime'][0])

favorites_anime datatype: 


str

Here, we can see that the uid is a number, while the favorites_anime array holds strings. This is probably because when we converted the string into an array, the function turned it into an array of strings rather than an array of numbers as intended. We can remedy this by turning the uid into string format.

In [9]:
animes['uid'] = animes['uid'].astype('str')

In [10]:
def genre_sum(user):
    # print(user)
    favs = animes[animes['uid'].isin(user['favorites_anime'])]
    favs.drop(['uid', 'title', 'genre', 'episodes', 'members', 'score'], axis=1, inplace=True)
    favs = favs.sum()
    # favs = pd.concat([user[['profile', 'gender']], favs])
    return favs

In [11]:
user_sum = df.apply(genre_sum, axis='columns', result_type='expand')
# genre_sum(test)

In [12]:
genre_total = pd.concat([df, user_sum], axis='columns')

In [13]:
genre_total.drop('favorites_anime', axis=1, inplace=True)

In [14]:
genre_total

Unnamed: 0,profile,gender,Comedy,Action,Fantasy,Adventure,Drama,Sci-Fi,Kids,Shounen,...,Police,Samurai,Vampire,Cars,Thriller,Josei,Shounen Ai,Shoujo Ai,Yaoi,Yuri
10861,-----noname-----,,3.0,3.0,1.0,2.0,3.0,2.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74177,---SnowFlake---,,1.0,3.0,1.0,0.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
79974,--Mizu--,Female,2.0,1.0,2.0,1.0,2.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
43928,--Sunclaudius,Male,0.0,2.0,1.0,1.0,1.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2829,--animeislife--,Female,2.0,6.0,2.0,1.0,4.0,2.0,0.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76009,zzSorazz,,3.0,1.0,1.0,1.0,3.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22012,zzeroparticle,Male,0.0,2.0,1.0,1.0,2.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
65485,zzs,Female,2.0,4.0,0.0,2.0,2.0,2.0,0.0,3.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
56449,zzzb,Male,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [16]:
genre_total.to_csv('../data/dataset.csv', index=False)