# Recommender system

Carlos Pinto Pérez
## Exploratory Data Analysis

In [2]:
import pandas as pd
import numpy as np

### Users lists:

Users who hace problems accesing their data

In [2]:
bad_users = pd.read_csv('bad_users.csv', names=['user_name'], header=0)
print(f'Bad users shape: {bad_users.shape}')
print(f'Unique users: {len(bad_users.user_name.unique())}')
bad_users.head()

Bad users shape: (354, 1)
Unique users: 354


Unnamed: 0,user_name
0,young_kappa
1,Meew2
2,no_forehead
3,Alejandro13
4,kanno_


Clean users

In [3]:
good_users = pd.read_csv('good_users.csv', names=['user_name'], header=0)
print(f'Good users shape: {good_users.shape}')
print(f'Unique users: {len(good_users.user_name.unique())}')
good_users.head()

Good users shape: (11156, 1)
Unique users: 11156


Unnamed: 0,user_name
0,TheCriticsClub
1,Polyphemus
2,qrdel
3,Cobbles
4,Aja


### Mangas (items)

In [3]:
mangas = pd.read_csv('mangas.csv')
print(f'Mangas shape: {mangas.shape}')
print(f'Unique manga names: {len(mangas.manga_name.unique())}')
print(f'Unique manga ids: {len(mangas.manga_id.unique())}')
mangas.head()

Mangas shape: (20000, 3)
Unique manga names: 19249
Unique manga ids: 20000


Unnamed: 0,manga_id,manga_name,manga_rank
0,2,Berserk,1
1,1706,JoJo no Kimyou na Bouken Part 7: Steel Ball Run,2
2,25,Fullmetal Alchemist,3
3,13,One Piece,4
4,1,Monster,5


In [34]:
mangas['manga_name'].value_counts()

Red                                  5
Kanon                                5
Hatsukoi                             4
Zero                                 4
Clover                               4
Boyfriend                            4
Blue                                 4
Kurogane                             4
Honey                                3
Trickster                            3
Cherish                              3
Himawari                             3
Nostalgia                            3
Nurarihyon no Mago                   3
Taimadou Gakuen 35 Shiken Shoutai    3
Ningen Shikkaku                      3
Pinocchio                            3
Love Letter                          3
Slow Starter                         3
Ginga Eiyuu Densetsu                 3
Seirei Gensouki                      3
Step                                 3
Kitsune no Yomeiri                   3
Seikai no Monshou                    3
IS: Infinite Stratos                 3
Orange                   

#### Care with that!

In [32]:
mangas[mangas['manga_name'] == 'Red']

Unnamed: 0,manga_id,manga_name,manga_rank
2193,12713,Red,2194
2201,24665,Red,2202
10198,4191,Red,10199
16830,8588,Red,16831
18286,18653,Red,18287


### Reviews

Note that Myanimelist can enumerate the mangas (items) by id (1, 2, 3...), but I scraped them by sorting by the Myanimelist top manga list. Users don't have enumerated ids, so I can not get a straight way to get them all. Users are thus collected by the criteria that they are the authors of the public reviews done in the top 20.000 mangas. With that, I scraped 11.156 unique users and the list of scores that each one of them published. 

Review scores are more valuable than usual scores, but there are pretty few of them. I also scraped a 'weight' measure that is the quantity of likes each review received. This could make better insights on the data. However, for simplicity, I didn't use them.

In [35]:
reviews = pd.read_csv('reviews_scores.csv')
print(f'Reviews shape: {reviews.shape}')
reviews.head()

Reviews shape: (21987, 4)


Unnamed: 0,manga_id,score,user,weight
0,2,10,TheCriticsClub,1315
1,2,7,Polyphemus,771
2,2,10,qrdel,704
3,2,10,Cobbles,322
4,2,10,Aja,127


### Scores

About 630k scores from the 11k unique users obtained, making a mean of about 60 scores per user. This indicates that the users that have been chosen are really prolific.

In [34]:
scores = pd.read_csv('scores.csv')
print(f'Scores shape: {scores.shape}')
scores.head()

Scores shape: (633718, 3)


Unnamed: 0,manga_id,score,user
0,2,10,TheCriticsClub
1,4,10,TheCriticsClub
2,1067,4,Polyphemus
3,13,8,Polyphemus
4,682,2,Polyphemus


## Data processing

Merge items data with their scores

In [35]:
ratings = pd.merge(mangas, scores, on='manga_id')
print(f'Ratings shape: {ratings.shape}')
ratings.head()

Ratings shape: (548675, 5)


Unnamed: 0,manga_id,manga_name,manga_rank,score,user
0,2,Berserk,1,10,TheCriticsClub
1,2,Berserk,1,7,Polyphemus
2,2,Berserk,1,10,qrdel
3,2,Berserk,1,10,Aja
4,2,Berserk,1,6,Tumerking


Discard the manga names and other fields. Keep in mind that manga names can be duplicated, but their ids can not.

In [36]:
ratings = ratings[['manga_id', 'user', 'score']]
ratings.head()

Unnamed: 0,manga_id,user,score
0,2,TheCriticsClub,10
1,2,Polyphemus,7
2,2,qrdel,10
3,2,Aja,10
4,2,Tumerking,6


Count the number of scores of each manga. It is possible that there are mangas with no scores, or with only a few of them?

In [37]:
# Dict with (key, value): (manga_id, scores count)
item_scores = {}
for item in ratings.manga_id:
    if item not in item_scores.keys():
        item_scores[item] = 1
    else:
        item_scores[item] = item_scores[item] + 1

df_item_scores = pd.DataFrame(data = {'manga_id': list(item_scores.keys()), 'number_scores': list(item_scores.values())})
df_item_scores.head()

Unnamed: 0,manga_id,number_scores
0,2,1913
1,1706,532
2,25,2054
3,13,2228
4,1,981


Unique mangas with at least 20 scores:

In [7]:
sum(df_item_scores['number_scores'] > 19)

4248

In [39]:
mangas_filtered = df_item_scores[df_item_scores['number_scores'] > 19]
print(f'Mangas filtered shape: {mangas_filtered.shape}')
mangas_filtered.head()

Mangas filtered shape: (4248, 2)


Unnamed: 0,manga_id,number_scores
0,2,1913
1,1706,532
2,25,2054
3,13,2228
4,1,981


Mean scores

In [43]:
mangas_mean = ratings.groupby('manga_id').agg(np.mean)
mangas_mean.columns = ['mean_score']
mangas_mean

Unnamed: 0_level_0,mean_score
manga_id,Unnamed: 1_level_1
1,8.928644
2,9.003659
3,8.764706
4,8.618705
7,8.453782
8,7.990812
9,8.117834
10,8.172285
11,7.545304
12,7.011872


Filter the mangas to obtain the ones with more than 19 scores. This is also important not only for accuracy but for computational load.

In [44]:
mangas_cleaned = pd.merge(mangas, mangas_filtered, on=['manga_id'], how='inner')
mangas_cleaned = pd.merge(mangas_cleaned, mangas_mean, on=['manga_id'])
print(f'Mangas cleaned shape: {mangas_cleaned.shape}')
mangas_cleaned.head()

Mangas cleaned shape: (4248, 5)


Unnamed: 0,manga_id,manga_name,manga_rank,number_scores,mean_score
0,2,Berserk,1,1913,9.003659
1,1706,JoJo no Kimyou na Bouken Part 7: Steel Ball Run,2,532,8.969925
2,25,Fullmetal Alchemist,3,2054,9.022882
3,13,One Piece,4,2228,8.618492
4,1,Monster,5,981,8.928644


Save the filtered data

In [45]:
mangas_cleaned.to_csv('mangas_v2.csv', index=False)

There are still repeated names.

In [46]:
len(mangas_cleaned.manga_name.unique())

4175

Filter the scores over the cleaned mangas.

In [47]:
scores_on_mangas_v2 = pd.merge(mangas_cleaned, scores, on=['manga_id'], how='inner')
print(f'Scores cleaned shape: {scores_on_mangas_v2.shape}')
scores_on_mangas_v2.head()

Scores cleaned shape: (521678, 7)


Unnamed: 0,manga_id,manga_name,manga_rank,number_scores,mean_score,score,user
0,2,Berserk,1,1913,9.003659,10,TheCriticsClub
1,2,Berserk,1,1913,9.003659,7,Polyphemus
2,2,Berserk,1,1913,9.003659,10,qrdel
3,2,Berserk,1,1913,9.003659,10,Aja
4,2,Berserk,1,1913,9.003659,6,Tumerking


We have gone from 630k scores to 545k. Next step is filter by users: lets choose the ones that have done at least 20 ratings.

In [48]:
user_scores = {}
for user in scores_on_mangas_v2.user:
    if user not in user_scores.keys():
        user_scores[user] = 1
    else:
        user_scores[user] = user_scores[user] + 1

df_user_scores = pd.DataFrame(data = {'user': list(user_scores.keys()), 'number_scores': list(user_scores.values())})
df_user_scores.head()

Unnamed: 0,user,number_scores
0,TheCriticsClub,2
1,Polyphemus,372
2,qrdel,10
3,Aja,35
4,Tumerking,140


Number of 'cleaned' users:

In [49]:
sum(df_user_scores['number_scores'] > 19)

5175

In [50]:
users_filtered = df_user_scores[df_user_scores['number_scores'] > 19]
print(f'Users filtered shape: {users_filtered.shape}')
users_filtered.head()

Users filtered shape: (5175, 2)


Unnamed: 0,user,number_scores
1,Polyphemus,372
3,Aja,35
4,Tumerking,140
5,aindah,113
6,infinity,36


So the final dataset is...

In [51]:
scores_cleaned = pd.merge(scores_on_mangas_v2[['manga_id', 'manga_name', 'manga_rank', 'score', 'user']], 
                          users_filtered, on=['user'], how='inner')
print(f'Scores cleaned shape: {scores_cleaned.shape}')
scores_cleaned.head()

Scores cleaned shape: (484502, 6)


Unnamed: 0,manga_id,manga_name,manga_rank,score,user,number_scores
0,2,Berserk,1,7,Polyphemus,372
1,25,Fullmetal Alchemist,3,7,Polyphemus,372
2,13,One Piece,4,8,Polyphemus,372
3,1,Monster,5,4,Polyphemus,372
4,4632,Oyasumi Punpun,6,10,Polyphemus,372


The filtering can continue, but this seems a reasonable number.

In [23]:
scores_cleaned[['manga_id', 'score', 'user']].to_csv('scores_v2.csv', index=False)

Unique final users:

In [24]:
len(scores_cleaned.user.unique())

5175

The mean of the number of ratings done by each user:

In [25]:
scores_cleaned.shape[0]/len(scores_cleaned.user.unique())

93.62357487922705

## Pivot matrix

Load the data:

In [26]:
mangas = pd.read_csv('mangas_v2.csv')
print(f'Mangas shape: {mangas.shape}')
print(f'Unique mangas names: {len(mangas.manga_name.unique())}')
print(f'Unique mangas ids: {len(mangas.manga_id.unique())}')
mangas.head()

Mangas shape: (4248, 4)
Unique mangas names: 4175
Unique mangas ids: 4248


Unnamed: 0,manga_id,manga_name,manga_rank,number_scores
0,2,Berserk,1,1913
1,1706,JoJo no Kimyou na Bouken Part 7: Steel Ball Run,2,532
2,25,Fullmetal Alchemist,3,2054
3,13,One Piece,4,2228
4,1,Monster,5,981


In [27]:
scores = pd.read_csv('scores_v2.csv')
print(f'Scores shape: {scores.shape}')
scores.head()

Scores shape: (484502, 3)


Unnamed: 0,manga_id,score,user
0,2,7,Polyphemus
1,25,7,Polyphemus
2,13,8,Polyphemus
3,1,4,Polyphemus
4,4632,10,Polyphemus


In [28]:
ratings = pd.merge(mangas, scores, on='manga_id')
print(f'Ratings shape: {ratings.shape}')
ratings.head()

Ratings shape: (484502, 6)


Unnamed: 0,manga_id,manga_name,manga_rank,number_scores,score,user
0,2,Berserk,1,1913,7,Polyphemus
1,2,Berserk,1,1913,10,Aja
2,2,Berserk,1,1913,6,Tumerking
3,2,Berserk,1,1913,10,aindah
4,2,Berserk,1,1913,9,infinity


In [29]:
ratings = ratings[['manga_id', 'user', 'score']]
ratings.head()

Unnamed: 0,manga_id,user,score
0,2,Polyphemus,7
1,2,Aja,10
2,2,Tumerking,6
3,2,aindah,10
4,2,infinity,9


Get the pivot matrix. It has a row per user and a column per item (manga). This could be done much more efficient if it where a sparse matrix, but in this approach I used a typical DataFrame.

In [31]:
pivot = ratings.pivot_table(index=['user'], columns=['manga_id'], values='score')
pivot

manga_id,1,2,3,4,7,8,9,10,11,12,...,19947,19952,19961,19968,19980,19981,19983,19984,19987,19995
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--Zora--,9.0,,,,,,,,,,...,,,,,,,,,,
--ariste,,,,,,8.0,,,,,...,,,,,,,,,4.0,
-Alians-,,7.0,10.0,,,,,,,,...,,,,,,,,,,
-Anokata,10.0,,10.0,,10.0,,,,7.0,5.0,...,,,,,,,,,,
-BlackRabbit-,9.0,,,,,,6.0,,8.0,,...,,,,,6.0,,,1.0,,
-Chrissi-chan-,,,,,,,,,8.0,7.0,...,,,,,,,,,,
-Ereya-,10.0,10.0,10.0,,,,8.0,9.0,6.0,6.0,...,,,,,,,,,,
-Everlasting-,,,,,,,,,,,...,,,,,,,,,,
-FAWKYOURFACE-,,,,,,,,,,,...,,,,,,,,,,
-Gia-,,,,,,10.0,10.0,,,,...,,,,,,,,,,


Note that one of the most difficult steps that creating a recomender system has is the generation of this matrix. In this example this can be done in a straight way, but usually this requires an enormous memory capacity. Because of that, a serious project will require matrix factorization techniques at this point.

---