# Analysis of Clustered Users
For the final entry in this series, we will be analyzing the clusterings that we obtained in the last post. We last left off with 5 distinct groupings of users.
To better analyze a user's preferences, we will be converting each group's genre counts into percentages. In order to find a group's genre distribution percentage represented as bar plots, we take the group's sum and then divide by the sum of the sum of the group, which is the total amount of favorites by a user group. We will then compare the group's favorites percentages with the percentages of all users to determine what favorites stand out for a particular group.

In [1]:
import pandas as pd
import plotly.express as px

In [2]:
final_df = pd.read_csv('../data/clustered_df.csv')

In [3]:
fig = px.pie(final_df.groupby('group').count().reset_index(), names='group', values='profile', title='Distribution of Groups')
fig.show()

When we are comparing a group's genre distribution against all users, we will have to perform some division. This can be a problem if a group were to have 0 favorites in a particular genre, which would cause a divide by zero. In order to prevent this, we will be adding 1 to each group's average distribution.

In [21]:
all_users = final_df.drop(['profile', 'group', 'male'], axis=1)
all_users = (all_users.sum() + 1) / all_users.sum().sum()

In [22]:
fig = px.pie(final_df[final_df['group'] == 0].groupby('male').count().reset_index(), names='male', values='profile', 
        title='Gender of Group 0')
fig.show()

In [34]:
group_0 = final_df[final_df['group'] == 0].drop(['profile', 'group', 'male'], axis=1)
group_0 = (group_0.sum() + 1) / (group_0.sum() + 1).sum()
px.bar(group_0, color_discrete_sequence=['#00CC96'], title='Group 0 Favorites Distribution')

Group 0 has a gender distribution of roughly 1/4 female and 3/4 male. When plotted side-by-side against the average percentage of all user favorites, we can see that samurai and historical genres stand out considerably. We can further highlight this difference by taking a ratio of group percentages against all user percentages.

In [35]:
group_0_diff = group_0 / all_users
px.bar(group_0_diff, color_discrete_sequence=['#00CC96'], title='Group 0 Favorites Ratio')

When we take a ratio of the favorite data percentages, a ratio of 1 means that this group is about the same as all users favorites. A ratio between 1 and 0 would mean that this group of users favorites a genre less often than the average, and a ratio higher than 1 means the group favorites this genre more often than the average. Looking at the above graph, we can see that this group of users also favor cars, kids, martial arts, and parody genres. They dislike hentai, josei, shounen/shoujo ai, yaoi and yuri.

In [None]:
px.pie(final_df[final_df['group'] == 1].groupby('male').count().reset_index(), names='male', values='profile', title='Gender of Group 1')

In [32]:
group_1 = final_df[final_df['group'] == 1].drop(['profile', 'group', 'male'], axis=1)
group_1 = (group_1.sum() + 1) / (group_1.sum() + 1).sum()
px.bar(group_1, color_discrete_sequence=['#EF553B'] , title='Group 1 Favorites Distribution')

For group 1, it's actually rather interesting that most of this groups favorites are very similar to the average user. This is characterized by many of the genre's percentages being close to 1. Of note is the higher value in the 'dementia' theme, which is sort of a catch-all term for weird and experimental shows.

In [31]:
px.bar(group_1 / all_users, color_discrete_sequence=['#EF553B'] , title='Group 1 Favorites Ratio')

In [None]:
px.pie(final_df[final_df['group'] == 2].groupby('male').count().reset_index(), names='male', values='profile',
        title='Gender of Group 2')

In [28]:
group_2 = final_df[final_df['group'] == 2].drop(['profile', 'group', 'male'], axis=1)
group_2 = (group_2.sum() + 1) / (group_2.sum() + 1).sum()

In [29]:
px.bar(group_2, color_discrete_sequence=['#636EFA'], title='Group 2 Favorites Distribution')

In [36]:
px.bar(group_2 / all_users, color_discrete_sequence=['#636EFA'], title='Group 2 Favorites Ratio')

Group 2 has even less obvious features that stand out against all users. Most of the genre's are more or less equal in proportion to that of the average user. It is likely that this group is for all the users that don't have any well-defined features that would put them in one of the other groups.

In [None]:
px.pie(final_df[final_df['group'] == 3].groupby('male').count().reset_index(), names='male', values='profile',
        title='Gender of Group 3')

In [39]:
group_3 = final_df[final_df['group'] == 3].drop(['profile', 'group', 'male'], axis=1)
group_3 = (group_3.sum() + 1) / (group_3.sum() + 1).sum()
px.bar(group_3, color_discrete_sequence=['#FFA15A'] , title='Group 3 Favorites Distribution')

In [41]:
px.bar(group_3 / all_users, color_discrete_sequence=['#FFA15A'] , title='Group 3 Favorites Ratio')

For group 3 we have an extremely distinct group. This group basically accounts for all hentai, shoujo/shounen ai, yaoi, and yuri favorites. It is no surprise that the clustering algorithm put them in their own group given how unique these users are.

In [None]:
px.pie(final_df[final_df['group'] == 4].groupby('male').count().reset_index(), names='male', values='profile',
        title='Gender of Group 4')

In [42]:
group_4 = final_df[final_df['group'] == 4].drop(['profile', 'group', 'male'], axis=1)
group_4 = (group_4.sum() + 1) / (group_4.sum() + 1).sum()
px.bar(group_4, color_discrete_sequence=['#AB63FA'] , title='Group 4 Favorites Distribution')

In [43]:
px.bar(group_4 / all_users, color_discrete_sequence=['#AB63FA'] , title='Group 4 Favorites Ratio')

Finally, we have group 4. This group is distinct in that almost half of the users identify as female. We can see from the genre ratio graph that Josei, shounen/shoujo ai, yaoi, harem, and hentai have high ratios. It does seem to correlate with the higher amount of female users since these genres appeal to women more.

There are so many more ways that we could improve the clusterings, or to use them for other purposes. We could try balancing the male and female user amounts by either undersampling male users, or oversampling female users. This would likely change how the clustering will branch. Another thing we could try is use our grouped data to predict future users without using clustering techniques. Instead, we can use supervised learning algorithms such as logistic regression to predict whether a new user falls under one of our five user groups. There are near limitless possibilities for how to use this data. I hope that this blog series can help others to learn more about machine learning, or to provide inspiration for future projects.