# Recommender system

Carlos Pinto Pérez

## Recommender system background

In [1]:
import pandas as pd
import numpy as np

Load data

In [3]:
mangas = pd.read_csv('data/mangas_v2.csv')
scores = pd.read_csv('data/scores_v2.csv')
ratings = pd.merge(mangas, scores, on='manga_id')
ratings = ratings[['manga_id', 'user', 'score']]
print(f'Ratings shape: {ratings.shape}')
ratings.head()

Ratings shape: (484502, 3)


Unnamed: 0,manga_id,user,score
0,2,Polyphemus,7
1,2,Aja,10
2,2,Tumerking,6
3,2,aindah,10
4,2,infinity,9


Pivot matrix. Analysis can be done by rows (users) or by columns (items). I will start from the last.

In [4]:
pivot_mangas = ratings.pivot_table(index=['user'], columns=['manga_id'], values='score')
pivot_mangas.head()

manga_id,1,2,3,4,7,8,9,10,11,12,...,19947,19952,19961,19968,19980,19981,19983,19984,19987,19995
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--Zora--,9.0,,,,,,,,,,...,,,,,,,,,,
--ariste,,,,,,8.0,,,,,...,,,,,,,,,4.0,
-Alians-,,7.0,10.0,,,,,,,,...,,,,,,,,,,
-Anokata,10.0,,10.0,,10.0,,,,7.0,5.0,...,,,,,,,,,,
-BlackRabbit-,9.0,,,,,,6.0,,8.0,,...,,,,,6.0,,,1.0,,


### Items similarity

Let's see how a particular recomendation works and then generalize it. I will use the manga with id = 2

In [5]:
mangas[mangas['manga_id'] == 2]

Unnamed: 0,manga_id,manga_name,manga_rank,number_scores,mean_score
0,2,Berserk,1,1913,9.003659


This column indicates its ratings

In [6]:
ratings_manga_2 = pivot_mangas[2]
ratings_manga_2

user
--Zora--          NaN
--ariste          NaN
-Alians-          7.0
-Anokata          NaN
-BlackRabbit-     NaN
                 ... 
zombor11         10.0
zonnikku          8.0
zucchinichop     10.0
zuziako           NaN
zybactik          NaN
Name: 2, Length: 5175, dtype: float64

The approach is getting the correlation of this column with the rest of columns stored, and then will order the values. This gives a measure of "similarity".

In [7]:
similar_mangas_of_2 = pivot_mangas.corrwith(other=ratings_manga_2, method='pearson').dropna() 
df_similar_mangas_of_2 = pd.DataFrame(similar_mangas_of_2)
df_similar_mangas_of_2.columns = ['Similarity']
df_similar_mangas_of_2 = pd.merge(df_similar_mangas_of_2, mangas, on=['manga_id'])
df_similar_mangas_of_2.sort_values(by=['Similarity'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
3986,19995,1.0,Tayutama: Kiss on my Deity,13561,22,6.318182
2103,5585,1.0,69,15757,20,6.500000
2985,11771,1.0,Romantist Egoist,9440,38,6.500000
2972,11690,1.0,18-sai no Kodou,13713,53,6.415094
370,559,1.0,Ura Peach Girl,10182,59,6.932203
...,...,...,...,...,...,...
1011,1632,-1.0,Princess Recipe,6013,47,6.808511
1878,4587,-1.0,Smash 1,5349,48,7.333333
212,333,-1.0,Gravitation EX,3006,39,6.897436
524,785,-1.0,SOS,7647,38,6.868421


There are a lot of 'perfect' similarities with this particular manga, probably dued to the little amount of data. With that, we can add an additional sorting by mangas popularity or by mean scores.

Sorting by popularity:

In [8]:
df_similar_mangas_of_2.sort_values(by=['Similarity', 'manga_rank'], ascending=[False, True])

Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
786,1222,1.0,Little Busters! The 4-koma,1205,25,7.280000
2505,8473,1.0,Promise,3174,27,7.814815
1752,4154,1.0,"Kiss, Zekkou, Kiss",3254,26,7.192308
2329,7219,1.0,Strange Orange,3426,35,7.257143
3624,16787,1.0,All of You in the World,3641,26,7.923077
...,...,...,...,...,...,...
2126,5694,-1.0,Hot Gimmick S,13464,23,6.478261
3980,19961,-1.0,Moyashi Otoko to Tane Shoujo,13481,38,6.263158
1700,3960,-1.0,Chohatsu BABY,13702,43,6.372093
2318,7183,-1.0,Itsuka Hanayome ni,14507,43,6.325581


Sorting by mean score:

In [9]:
df_similar_mangas_of_2.sort_values(by=['Similarity', 'mean_score'], ascending=False)

Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
3624,16787,1.0,All of You in the World,3641,26,7.923077
2505,8473,1.0,Promise,3174,27,7.814815
3603,16654,1.0,Taiyou ga Yondeiru!,7207,32,7.500000
1745,4119,1.0,Hakobune Hakusho,4298,37,7.486486
3102,12673,1.0,Shitsuren Chocolatier,4226,27,7.444444
...,...,...,...,...,...,...
1700,3960,-1.0,Chohatsu BABY,13702,43,6.372093
2318,7183,-1.0,Itsuka Hanayome ni,14507,43,6.325581
3980,19961,-1.0,Moyashi Otoko to Tane Shoujo,13481,38,6.263158
3660,17100,-1.0,Lian Ai Make♥Up!,12191,36,6.222222


However, this is still weak. We can do some additional filtering:

In [10]:
df_similar_mangas_of_2_filtered = df_similar_mangas_of_2[df_similar_mangas_of_2['number_scores'] > 99]
df_similar_mangas_of_2_filtered.sort_values(by=['Similarity'], ascending=False)

Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
1,2,1.000000,Berserk,1,1913,9.003659
2513,8519,0.917663,Yoru no Gakkou e Oide yo!,3263,109,7.486239
2882,11133,0.774194,37°C no Boyfriend,11326,100,6.630000
1242,2538,0.772172,Legend of Nereid,4097,136,7.279412
275,423,0.758929,Pichi Pichi Pitch: Mermaid Melody,8827,159,6.490566
...,...,...,...,...,...,...
2757,10409,-0.762279,Himitsu no Himegimi Uwasa no Ouji,8082,149,7.020134
2053,5339,-0.870572,Koi Suta,7428,127,6.834646
2400,7676,-0.944911,Ouchi e Kaerou,4212,176,7.500000
2253,6808,-1.000000,Saikou no Kiss wo Ageru,4106,106,7.292453


### Users similarity

This time the pivot matrix will have the users as columns:

In [11]:
pivot_users = ratings.pivot_table(index=['manga_id'], columns=['user'], values='score')
pivot_users.head()

user,--Zora--,--ariste,-Alians-,-Anokata,-BlackRabbit-,-Chrissi-chan-,-Ereya-,-Everlasting-,-FAWKYOURFACE-,-Gia-,...,zman75,znyggisen,zoddtheimmortal,zogwarg,zombiesonacid,zombor11,zonnikku,zucchinichop,zuziako,zybactik
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,9.0,,,10.0,9.0,,10.0,,,,...,,7.0,,,,10.0,10.0,8.0,,
2,,,7.0,,,,10.0,,,,...,,,10.0,9.0,,10.0,8.0,10.0,,
3,,,10.0,10.0,,,10.0,,,,...,,,,10.0,,10.0,10.0,,,
4,,,,,,,,,,,...,,,,,,,,,,8.0
7,,,,10.0,,,,,,,...,,,,,,10.0,,,,


It's a good idea to take into consideration how many ratings has done each user. This could be useful to improve the recommendation system, altough it is not implemented in this demo.

In [12]:
users = pd.DataFrame(data={'user': ratings['user'].unique()})
n_scores_temp = ratings[['user', 'score']].groupby('user').count()
n_scores = pd.merge(users, n_scores_temp, on=['user'])
n_scores.columns=['user', 'n_scores']
n_scores.head()

Unnamed: 0,user,n_scores
0,Polyphemus,372
1,Aja,35
2,Tumerking,140
3,aindah,113
4,infinity,36


As I did with the manga with id == 2, this time I will focus on the user "infinity".

In [13]:
ratings[ratings['user'] == 'infinity']

Unnamed: 0,manga_id,user,score
4,2,inf,9
1738,25,inf,9
3359,13,inf,8
9756,651,inf,7
42473,44,inf,8
56714,267,inf,9
66206,583,inf,7
70777,373,inf,7
75612,735,inf,8
80705,1076,inf,6


Let's see who are the users with most similarity with this one, using the correlations again:

In [14]:
ratings_user_infinity = pivot_users['infinity'].dropna()
ratings_user_infinity

manga_id
2        9.0
11       7.0
12       7.0
13       8.0
15       8.0
19       7.0
25       9.0
44       8.0
47       8.0
48       7.0
114      7.0
136      9.0
221      7.0
267      9.0
278      8.0
373      7.0
447      9.0
564      7.0
572      7.0
583      7.0
598      8.0
616      7.0
648      7.0
651      7.0
671      7.0
735      8.0
908      9.0
967      8.0
1076     6.0
1534     8.0
2436     9.0
5113     9.0
5801     9.0
5911     7.0
11329    8.0
12586    7.0
Name: infinity, dtype: float64

In [15]:
similar_users_of_infinity = pivot_users.corrwith(other=ratings_user_infinity, method='pearson').dropna() 
df_similar_users_of_infinity = pd.DataFrame(similar_users_of_infinity)
df_similar_users_of_infinity.columns = ['Similarity']
df_similar_users_of_infinity = pd.merge(df_similar_users_of_infinity, n_scores, on=['user'])
df_similar_users_of_infinity.sort_values(by=['Similarity'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,user,Similarity,n_scores
589,FiraDeviant,1.0,20
608,ForgotMyRice,1.0,21
944,Koutsuna_sama,1.0,29
673,Grardox,1.0,23
2757,mixing-scents,1.0,158
...,...,...,...
2278,courty_cupcake,-1.0,95
1393,PinkShippuden,-1.0,458
2737,miaka15,-1.0,345
1209,MrLegitimacy,-1.0,25


As happened with the mangas first approach, there are a lot of perfect correlations. We can use a secondary metric to get a better order:

In [16]:
df_similar_users_of_infinity.sort_values(by=['Similarity', 'n_scores'], ascending=False)

Unnamed: 0,user,Similarity,n_scores
1317,OURANLOVERJINX,1.0,470
620,Fujaku,1.0,450
1092,MahouShoujoLain,1.0,303
1196,Moon_Light,1.0,255
2757,mixing-scents,1.0,158
...,...,...,...
1209,MrLegitimacy,-1.0,25
893,KawaiiNeko,-1.0,24
2742,mild_kitto,-1.0,24
1791,Takros_Knonnar,-1.0,23


So now we can identify which items (mangas) have been the most valued between the users that are similar to the target user. First, let's choose only the first ten most similar users:

In [17]:
df = df_similar_users_of_infinity.sort_values(by=['Similarity', 'n_scores'], ascending=False)

# This is not the best way to choose the best 10.
# Also, I choose 10 as this is a dema.
most_similar_users_to_infinity = []
for i in range(10):
    most_similar_users_to_infinity.append(df.iloc[i].user)
    
most_similar_users_to_infinity

['OURANLOVERJINX',
 'Fujaku',
 'MahouShoujoLain',
 'Moon_Light',
 'mixing-scents',
 'lilytenjouXP',
 'MangaGreat',
 'basbas',
 'tweetlepie',
 'arrowofthenight']

The scores of the first user, ordered by ranking.

In [18]:
df1 = ratings[ratings['user'] == most_similar_users_to_infinity[0]].sort_values(by=['score'], ascending=False)
df1.index = df1.manga_id
df1 = df1[[column for column in df1.columns if column not in ['manga_id']]]
# Drop from the list the mangas that the target user has already read.
df1 = df1.drop(ratings_user_infinity.index, errors='ignore')
df1.head()

Unnamed: 0_level_0,user,score
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1
610,OURANLOVERJINX,10
7008,OURANLOVERJINX,10
4515,OURANLOVERJINX,10
8042,OURANLOVERJINX,10
125,OURANLOVERJINX,10


Multiply the score of each item with score > 5 up to the correlation the current user has with the target user. This way I get a "recomendation score".

I also chose users with positive correlation with the target user. It doesn't make sense get the other users: whether a person with different preferences than the user doesn't like something doesn't indicate that the user will like it.

In [19]:
df = df_similar_users_of_infinity.sort_values(by=['Similarity', 'n_scores'], ascending=False)
df = df[df['Similarity'] > 0]

# This cell takes a long time
recomendations_by_similar_users = pd.Series()
for i in range(df.shape[0]):  # Now I get all the similar users, not only the first ten.
    current_user = df.iloc[i].user
    user_similarity = df.iloc[i].Similarity

    series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
    series_current_user.index=ratings[ratings['user'] == current_user].manga_id

    series_current_user.drop(ratings_user_infinity.index, errors='ignore')
    # I also penalize the bad scores:
    series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
    recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
# Using the sum as aggregating function means that the recommender system take into account the popularity of the mangas:
# For a specific recommendation the aggregation function can be, for instance, the geometric mean.
recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
recomendations_by_similar_users

  recomendations_by_similar_users = pd.Series()


1        4959
2        8892
3        5301
4        2118
7        2564
         ... 
19981     154
19983     202
19984     427
19987      35
19995      50
Length: 4248, dtype: int64

In [20]:
recomendations_by_similar_users.sort_values(ascending=False)

25       10777
13       10124
11        9864
21        9262
2         8892
         ...  
7733        15
12472       14
8674        14
9606        13
213         12
Length: 4248, dtype: int64

Sort the results and show it:

In [21]:
df_users_infinity = pd.DataFrame(recomendations_by_similar_users.sort_values(ascending=False))
df_users_infinity.columns = ['recomended_score']
df_users_infinity['manga_id'] = df_users_infinity.index

recomendations_by_users_infinity = pd.merge(mangas[['manga_id', 'manga_name', 'mean_score']], 
                                            df_users_infinity, on=['manga_id'])
recomendations_by_users_infinity.sort_values('recomended_score', ascending=False)

Unnamed: 0,manga_id,manga_name,mean_score,recomended_score
2,25,Fullmetal Alchemist,9.022882,10777
3,13,One Piece,8.618492,10124
331,11,Naruto,7.545304,9864
20,21,Death Note,8.469379,9262
0,2,Berserk,9.003659,8892
...,...,...,...,...
3149,7733,Sonna Koe Dashicha Iya,6.545455,15
3687,12472,Bara no Kusari,6.800000,14
3525,8674,Deep Black,7.000000,14
2928,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,6.750000,13


We can also use the Spearman correlation, that seems more adecuate in this case that the classical Pearson correlation. That is, instead of using the distance between the users' scores to get their similarities, it seems more intuitive to use de distances between the **personal rankings** of the users (https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php).

Take into consideration that this is only valid as a similarity measure between users, not between items.

In [22]:
similar_users_of_infinity_spearman = pivot_users.corrwith(other=ratings_user_infinity, method='spearman').dropna() 
df_similar_users_of_infinity_spearman = pd.DataFrame(similar_users_of_infinity_spearman)
df_similar_users_of_infinity_spearman.columns = ['Similarity']
df_similar_users_of_infinity_spearman = pd.merge(df_similar_users_of_infinity_spearman, n_scores, on=['user'])
df_similar_users_of_infinity_spearman.sort_values(by=['Similarity'], ascending=False).head()



Unnamed: 0,user,Similarity,n_scores
1215,MrSanjuro,1.0,33
1858,Tinhinane-Ingui,1.0,39
1545,RoxRobstah,1.0,34
1365,Paraturtle,1.0,36
1551,RukiaRocks,1.0,50


In [23]:
df_similar_users_of_infinity_spearman.sort_values(by=['Similarity', 'n_scores'], ascending=False).head()

Unnamed: 0,user,Similarity,n_scores
620,Fujaku,1.0,450
402,DORAGONFLY,1.0,177
1622,Selaht27,1.0,172
2757,mixing-scents,1.0,158
1093,Mahou_Bujin,1.0,114


From now on, users similarities will be calculated ponderating their Spearman-Pearson correlations, with weights 70-30.

### Generalization (recommender system by product similarities)

Calculate the correlation matrix. This process takes time.

In [24]:
corr_pearson_mangas = pivot_mangas.corr(method='pearson')  
corr_pearson_mangas.head()

manga_id,1,2,3,4,7,8,9,10,11,12,...,19947,19952,19961,19968,19980,19981,19983,19984,19987,19995
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.409006,0.573935,0.296401,0.243751,0.221626,0.179385,0.053958,0.119629,0.048844,...,0.738549,-0.521817,-1.0,0.665016,0.294547,-0.354663,0.269692,0.212567,,1.0
2,0.409006,1.0,0.366907,0.261819,0.307543,0.297297,0.093504,0.07823,0.13769,0.166987,...,0.289414,0.175412,-1.0,0.050965,0.175027,0.358535,-0.203496,0.450499,,1.0
3,0.573935,0.366907,1.0,0.288774,0.192548,0.315443,0.139139,0.181722,0.072584,0.152991,...,0.944911,0.045162,1.0,0.5,-0.238073,-0.01494,0.394557,0.052437,,
4,0.296401,0.261819,0.288774,1.0,0.066961,0.437503,0.123408,0.114721,-0.053623,0.136216,...,,0.980581,,,0.503631,0.408248,0.416881,0.498585,,0.995871
7,0.243751,0.307543,0.192548,0.066961,1.0,0.04709,0.123593,0.247823,0.304135,0.303212,...,,0.794461,,,0.706897,0.326164,0.684177,0.445927,,


Apply a filter: to calculate the correlation between two items, they must share at least 100 scores.

In [25]:
corr_pearson_mangas_filtered = pivot_mangas.corr(method='pearson', min_periods=100)  
corr_pearson_mangas_filtered.head()

manga_id,1,2,3,4,7,8,9,10,11,12,...,19947,19952,19961,19968,19980,19981,19983,19984,19987,19995
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.409006,0.573935,0.296401,0.243751,,0.179385,0.053958,0.119629,0.048844,...,,,,,,,,,,
2,0.409006,1.0,0.366907,0.261819,0.307543,,0.093504,0.07823,0.13769,0.166987,...,,,,,,,,,,
3,0.573935,0.366907,1.0,0.288774,0.192548,,0.139139,0.181722,0.072584,0.152991,...,,,,,,,,,,
4,0.296401,0.261819,0.288774,1.0,,,,,-0.053623,0.136216,...,,,,,,,,,,
7,0.243751,0.307543,0.192548,,1.0,,,,0.304135,0.303212,...,,,,,,,,,,


This matrix can be saved to improve the times, but it requires a lot of memory (scales quickly with the amount of data).

In [26]:
corr_pearson_mangas.to_csv('data/corr_pearson_mangas.csv')
corr_pearson_mangas_filtered.to_csv('data/corr_pearson_mangas_filtered.csv')

Let's choose a random user and get some recomendation for her.

In [27]:
random_user = pivot_mangas.iloc[2678].dropna()
random_user

manga_id
1        10.0
2         8.0
3         9.0
4         9.0
26        7.0
51        9.0
104       9.0
149       8.0
399      10.0
401       8.0
436       8.0
481      10.0
642       8.0
656      10.0
657      10.0
705       5.0
731      10.0
743       4.0
745       9.0
768       8.0
909       4.0
912       6.0
936       8.0
1373      9.0
1470      7.0
1471      8.0
1706      9.0
3008      5.0
3009      7.0
3258      8.0
3731      8.0
4632      9.0
5461      8.0
6604      6.0
7216      7.0
7375      9.0
8967      9.0
10690     5.0
11471     5.0
11734     3.0
14790     6.0
14893     6.0
15355     8.0
17192     5.0
17353     5.0
Name: Swarnadeep, dtype: float64

In [28]:
len(random_user)

45

For each manga _m_ that the user has scored, I get every similar manga and multiply its correlation coefficient up to the score the user has assigned to _m_. This way, the user manga score discriminates which mangas seems more appropiate for her.

Furthermore, to aggregate all this measures for each manga _m_, I will use the sum() function. However, although that election is very intuitive, it could be improved.

In [29]:
user_recomendation = pd.Series()

for manga_index in range(len(random_user)):
    similar_mangas = corr_pearson_mangas[random_user.index[manga_index]].dropna()
    # Multiplies the correlation coefficient up to the score assigned by the user.
    similar_mangas = similar_mangas.map(lambda x: x * random_user.values[manga_index])
    # Get the recommendation scores
    user_recomendation = user_recomendation.append(similar_mangas)
user_recomendation

  user_recomendation = pd.Series()


1        10.000000
2         4.090057
3         5.739346
4         2.964012
7         2.437513
           ...    
19968     2.847474
19980     0.855408
19981     5.000000
19983    -1.724072
19984    -4.682774
Length: 153809, dtype: float64

In [30]:
len(user_recomendation)

153809

With the previous operation I repeated a lot of items:

In [31]:
len(user_recomendation.index.unique())

4248

So here is when I use the sum() as aggregation function, as said before.

In [32]:
user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
user_recomendation

1        110.798177
2         98.830469
3        109.449835
4         76.107387
7         82.814613
            ...    
19981     77.091644
19983     62.172647
19984     85.640385
19987     18.875690
19995     57.915941
Length: 4248, dtype: float64

Remove from the recommended mangas those already scored by the target user:

In [33]:
user_recomendation = user_recomendation.drop(random_user.index, errors='ignore')  
user_recomendation.head(10)

7     82.814613
8     79.144959
9     58.433464
10    63.848659
11    49.501681
12    52.022681
13    44.019191
14    73.508536
15    53.204647
16    62.206179
dtype: float64

In [35]:
user_recomendations = pd.DataFrame(user_recomendation)
user_recomendations.columns = ['recomended_score']
user_recomendations['manga_id'] = user_recomendations.index
user_recomendations.index = np.arange(user_recomendations.shape[0])
user_recomendations

Unnamed: 0,recomended_score,manga_id
0,82.814613,7
1,79.144959,8
2,58.433464,9
3,63.848659,10
4,49.501681,11
...,...,...
4198,77.091644,19981
4199,62.172647,19983
4200,85.640385,19984
4201,18.875690,19987


And get a full view of the results ordered by their recommendation score:

In [36]:
df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
df_user_recomendation.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3673,199.198508,15484,Nemuru Baka,9452,22,6.727273
1002,186.284587,1610,Maniac Road,7413,21,7.095238
4119,177.127647,19283,Uchuu no SPARROW,9112,21,6.476190
2365,172.521669,6934,Green Beans,16328,23,5.086957
2674,170.770075,8799,Umi-chan no Otomodachi,14613,20,6.100000
...,...,...,...,...,...,...
2075,-68.730577,5182,Beast Harem,8069,76,6.894737
2295,-69.410732,6200,Ai wo Tomenaide,16252,37,5.351351
141,-71.338015,184,Shishunki Miman Okotowari,7027,42,7.500000
2758,-72.777751,9276,Aishikata mo Wakarazuni,9328,29,6.482759


### Generalización (sistema de recomendación por similaridad entre usuarios)

En principio había guardado en .csv tres matrices de correlaciones: de Pearson, de Spearman, y una ponderada. Pero eso me guardaría las correlaciones entre todos los usuarios existentes para que luego me bastara con cargar la lista por ahorrarme el cálculo. Esto vale (y de hecho lo hago) en la similaridad entre productos, pero aquí no (me sirve tener las correlaciones de entre todos los productos entre ellos en vez de hacer el cálculo de la correlación entre cada nuevo producto por todos los productos, porque es un dato que voy a usar constantemente). Aquí, repitiendo lo de arriba, solo necesito obtener la correlación entre un usuario y el resto con el único usuario objetivo, que la mayoría de los casos no estará de antemano en esa lista, sino que será un perfil nuevo, con lo que no tiene sentido guardar las matrices (o la matriz final ponderada) de las correlaciones entre usuarios.

In [7]:
# Las matrices de correlaciones completas serían así:

# corr_pearson_users = pivot_users.corr(method='pearson')  
# corr_spearman_users = pivot_users.corr(method='spearman')  

# La similitud entre usuarios la valoramos con una ponderación entre estas dos matrices.
# 70% correlación de Spearman - 30% correlación de Pearson.

# corr_users = 0.7*corr_spearman_users + 0.3*corr_pearson_users

# Notemos que la matriz de correlaciones de Pearson tiene más valores (menos NA) que la matriz
# de correlaciones de Spearman. Al hacer esta operación se mantienen como NA los valores NA de
# la matriz de Spearman. Otra opción sería que para esos NA se usara el valor de 
# 0.3*corr_pearson_users. Prefiero dejarlo como está porque así aseguro más paralelismo entre
# perfiles de usuario al haber menos coeficientes de similaridad entre ellos.

# p, q = corr_users.shape[0]*corr_users.shape[1], corr_users.isna().sum().sum()

# print(f'De {p} valores totales de la matriz, tenemos rellenados {p-q}. \nEsto es, una proporción de {(p-q)/p}%.')

# De 26780625 valores totales de la matriz, tenemos rellenados 16900466. 
# Esto es, una proporción de 0.631 %. No está nada mal.

Vamos a escoger un usuario cualquiera y generar alguna recomendación para él (escojo el mismo que para la recomendación por similaridad de producto, para ver las diferencias o parecidos):

In [37]:
random_user = pivot_mangas.iloc[2678].dropna()
random_user

manga_id
1        10.0
2         8.0
3         9.0
4         9.0
26        7.0
51        9.0
104       9.0
149       8.0
399      10.0
401       8.0
436       8.0
481      10.0
642       8.0
656      10.0
657      10.0
705       5.0
731      10.0
743       4.0
745       9.0
768       8.0
909       4.0
912       6.0
936       8.0
1373      9.0
1470      7.0
1471      8.0
1706      9.0
3008      5.0
3009      7.0
3258      8.0
3731      8.0
4632      9.0
5461      8.0
6604      6.0
7216      7.0
7375      9.0
8967      9.0
10690     5.0
11471     5.0
11734     3.0
14790     6.0
14893     6.0
15355     8.0
17192     5.0
17353     5.0
Name: Swarnadeep, dtype: float64

Ahora recorro todos los usuarios parecidos a él según el coeficiente de similaridad asignado por la matriz ponderada entre los dos tipos de correlaciones.

In [38]:
# Correlamos el resto de columnas que representan a los demás usuarios con la seleccionada:
# Notemos que la instrucción tarda un poco y se tiene que hacer esto cada
# vez que se le quiera recomendar algo a un usuario por este método.
similar_users_by_pearson = pivot_users.corrwith(other=random_user, method='pearson').dropna() 
similar_users_by_spearman = pivot_users.corrwith(other=random_user, method='spearman').dropna() 
# La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson

df_similar_users = pd.DataFrame(similar_users)
df_similar_users.columns = ['Similarity']
# Me quedo solo con los usuarios con puntuación de similaridad positiva.
df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
df_similar_users.sort_values(by=['Similarity'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,Similarity
user,Unnamed: 1_level_1
ManU-Alchemist,1.000000
kidxatxheart,1.000000
Locokoko182,1.000000
gtzice2,1.000000
skutieos,1.000000
...,...
Morrisummer,0.005905
Eskies,0.005814
Roninski,0.004563
FeelinWoozie,0.002205


In [39]:
# Esta función también tarda:
recomendations_by_similar_users = pd.Series()
for i in range(df_similar_users.shape[0]):
    current_user = df_similar_users.index[i]
    user_similarity = df_similar_users.values[i][0]
    # Hago una serie para tratar los datos de cada usuario similar:
    series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
    series_current_user.index=ratings[ratings['user'] == current_user].manga_id
    # Elimino de esta serie los productos que ya ha consumido el objetivo:
    series_current_user = series_current_user.drop(random_user.index, errors='ignore')
    # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
    # notas suspensas (aunque no con mucha fuerza)
    series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
    recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
# Obtengo una serie de pandas con muchas ids de mangas repetidas:
recomendations_by_similar_users

  recomendations_by_similar_users = pd.Series()


21      4.729451
30      4.729451
102     5.911814
1033    5.911814
9296   -2.364725
          ...   
2985    2.770422
3651   -1.038908
5930    3.116725
5673    0.000000
3661    2.077817
Length: 170652, dtype: float64

#### Recomendación por popularidad:

Utilizo la suma como función de agregación:

In [40]:
recomendations_by_similar_users_pop = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
recomendations_by_similar_users_pop

7         908.177542
8         478.019110
9         975.394837
10        994.361270
11       2678.260347
            ...     
19981      60.156474
19983      26.962647
19984     118.773506
19987       8.227400
19995      15.052941
Length: 4202, dtype: float64

In [41]:
# Montado como dataframe:
df_recomendations_by_similar_users_pop = pd.DataFrame(recomendations_by_similar_users_pop)
df_recomendations_by_similar_users_pop.columns = ['recomended_score']
df_recomendations_by_similar_users_pop['manga_id'] = df_recomendations_by_similar_users_pop.index
df_recomendations_by_similar_users_pop.index = np.arange(df_recomendations_by_similar_users_pop.shape[0])

df_recomendations_by_similar_users_pop = pd.merge(df_recomendations_by_similar_users_pop, mangas, on=['manga_id'], how='inner')
# Recomendaciones:
df_recomendations_by_similar_users_pop.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
18,3305.534129,25,Fullmetal Alchemist,3,2054,9.022882
14,3263.201058,21,Death Note,35,2433,8.469379
6,3231.974949,13,One Piece,4,2228,8.618492
4,2678.260347,11,Naruto,741,3013,7.545304
2792,2382.394903,9711,Bakuman.,149,1555,8.311254
...,...,...,...,...,...,...
1905,-2.442584,4536,Gekkou Denchi Shiki Ningyou Gekijou,15474,28,4.892857
2391,-2.650803,7118,Test Flight Girls,16326,23,4.565217
3492,-16.303456,14218,My Sweet Sisters,16373,29,3.379310
1232,-18.872427,2177,Battle Royale II: Blitz Royale,16367,85,3.647059


In [42]:
# Mangas que menos le van a gustar:
df_recomendations_by_similar_users_pop.sort_values(by=['recomended_score'], ascending=True)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3195,-37.062626,12200,High School Musical,16375,68,1.970588
1232,-18.872427,2177,Battle Royale II: Blitz Royale,16367,85,3.647059
3492,-16.303456,14218,My Sweet Sisters,16373,29,3.379310
2391,-2.650803,7118,Test Flight Girls,16326,23,4.565217
1905,-2.442584,4536,Gekkou Denchi Shiki Ningyou Gekijou,15474,28,4.892857
...,...,...,...,...,...,...
2792,2382.394903,9711,Bakuman.,149,1555,8.311254
4,2678.260347,11,Naruto,741,3013,7.545304
6,3231.974949,13,One Piece,4,2228,8.618492
14,3263.201058,21,Death Note,35,2433,8.469379


#### Recomendación por especifidad:

Media aritmética como función de agregación:

In [43]:
recomendations_by_similar_users_esp_1 = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
recomendations_by_similar_users_esp_1

7        3.278619
8        3.649001
9        3.306423
10       3.440696
11       2.810347
           ...   
19981    2.615499
19983    0.842583
19984    3.125619
19987    2.056850
19995    1.881618
Length: 4202, dtype: float64

In [44]:
# Montado como dataframe:
df_recomendations_by_similar_users_esp_1 = pd.DataFrame(recomendations_by_similar_users_esp_1)
df_recomendations_by_similar_users_esp_1.columns = ['recomended_score']
df_recomendations_by_similar_users_esp_1['manga_id'] = df_recomendations_by_similar_users_esp_1.index
df_recomendations_by_similar_users_esp_1.index = np.arange(df_recomendations_by_similar_users_esp_1.shape[0])

df_recomendations_by_similar_users_esp_1 = pd.merge(df_recomendations_by_similar_users_esp_1, mangas, on=['manga_id'], how='inner')
# Recomendaciones:
df_recomendations_by_similar_users_esp_1.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3239,9.000000,12472,Bara no Kusari,13246,20,6.800000
1248,8.500000,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854
2543,6.890234,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
2779,6.645900,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
1196,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
...,...,...,...,...,...,...
1232,-0.314540,2177,Battle Royale II: Blitz Royale,16367,85,3.647059
3456,-0.486279,13984,3H Before Kiss,16171,31,5.354839
2427,-0.920782,7281,Ennui na Kanojo,14172,27,6.592593
3492,-0.959027,14218,My Sweet Sisters,16373,29,3.379310


In [45]:
# Mangas que menos le van a gustar:
df_recomendations_by_similar_users_esp_1.sort_values(by=['recomended_score'], ascending=True)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3195,-1.090077,12200,High School Musical,16375,68,1.970588
3492,-0.959027,14218,My Sweet Sisters,16373,29,3.379310
2427,-0.920782,7281,Ennui na Kanojo,14172,27,6.592593
3456,-0.486279,13984,3H Before Kiss,16171,31,5.354839
1232,-0.314540,2177,Battle Royale II: Blitz Royale,16367,85,3.647059
...,...,...,...,...,...,...
1196,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
2779,6.645900,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
2543,6.890234,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
1248,8.500000,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854


Una prueba con la media geométrica como función de agregación:

In [46]:
from scipy.stats.mstats import gmean

In [47]:
recomendations_by_similar_users_esp_2 = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).aggregate(gmean)
recomendations_by_similar_users_esp_2

  log_a = np.log(np.array(a, dtype=dtype))
  log_a = np.log(np.array(a, dtype=dtype))


7        NaN
8        NaN
9        NaN
10       NaN
11       NaN
        ... 
19981    NaN
19983    NaN
19984    NaN
19987    0.0
19995    NaN
Length: 4202, dtype: float64

In [48]:
recomendations_by_similar_users_esp_2 = recomendations_by_similar_users.groupby(
                                        recomendations_by_similar_users.index).apply(lambda group: group.product() ** (1 / float(len(group) )))
recomendations_by_similar_users_esp_2

  recomendations_by_similar_users.index).apply(lambda group: group.product() ** (1 / float(len(group) )))


7        0.0
8        0.0
9        0.0
10       0.0
11       0.0
        ... 
19981    0.0
19983    0.0
19984    0.0
19987    0.0
19995    0.0
Length: 4202, dtype: float64

Esto ofrece muy pocos resultados:

In [49]:
# Montado como dataframe:
df_recomendations_by_similar_users_esp_2 = pd.DataFrame(recomendations_by_similar_users_esp_2)
df_recomendations_by_similar_users_esp_2.columns = ['recomended_score']
df_recomendations_by_similar_users_esp_2['manga_id'] = df_recomendations_by_similar_users_esp_2.index
df_recomendations_by_similar_users_esp_2.index = np.arange(df_recomendations_by_similar_users_esp_2.shape[0])

df_recomendations_by_similar_users_esp_2 = pd.merge(df_recomendations_by_similar_users_esp_2, mangas, on=['manga_id'], how='inner')
df_recomendations_by_similar_users_esp_2 = df_recomendations_by_similar_users_esp_2[df_recomendations_by_similar_users_esp_2['recomended_score'] > 0].dropna()
# Recomendaciones:
df_recomendations_by_similar_users_esp_2.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3239,9.000000,12472,Bara no Kusari,13246,20,6.800000
1248,8.485281,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854
2543,6.822470,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
2779,6.555104,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
1196,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
...,...,...,...,...,...,...
1218,0.683184,2142,Ouji-sama no Kanojo,15431,53,6.000000
4035,0.631316,18566,Asura,8486,35,6.771429
3064,0.410497,11430,VITA Sexualis,11018,21,6.238095
2417,0.364778,7246,Kiseki no Koibito,13859,26,6.576923


In [50]:
# Mangas que menos le van a gustar:
df_recomendations_by_similar_users_esp_2.sort_values(by=['recomended_score'], ascending=True)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
4125,0.356221,19334,I Will Be Cinderella,2337,28,7.642857
2417,0.364778,7246,Kiseki no Koibito,13859,26,6.576923
3064,0.410497,11430,VITA Sexualis,11018,21,6.238095
4035,0.631316,18566,Asura,8486,35,6.771429
1218,0.683184,2142,Ouji-sama no Kanojo,15431,53,6.000000
...,...,...,...,...,...,...
1196,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
2779,6.555104,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
2543,6.822470,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
1248,8.485281,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854


A la vista de la diferencia entre aplicar la media aritmética o la geométrica como función de agregación, usaremos la media aritmética.

### Resultados

#### Similaridad entre productos

Cargo las matrices de correlaciones:

In [56]:
corr_mangas = pd.read_csv('data/corr_pearson_mangas.csv', header=0, index_col=0)
corr_mangas_reduced = pd.read_csv('data/corr_pearson_mangas_filtered.csv', header=0, index_col=0)

Comprimo el proceso anterior usando la matriz de correlaciones filtrada (mangas con más de 100 puntuaciones) y haciendo la recomendación por popularidad:

In [57]:
random_user_id = 2678
random_user = pivot_mangas.iloc[random_user_id].dropna()
user_recomendation = pd.Series()
for manga_index in range(len(random_user)):
    similar_mangas = corr_mangas_reduced[str(random_user.index[manga_index])].dropna()
    similar_mangas = similar_mangas.drop(random_user.index, errors='ignore')
    score = random_user.values[manga_index]
    similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
    user_recomendation = user_recomendation.append(similar_mangas)
# Popularidad:
# user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
# Especifidad:
user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
user_recomendations = pd.DataFrame(user_recomendation)
user_recomendations.columns = ['recomended_score']
user_recomendations['manga_id'] = user_recomendations.index
user_recomendations.index = np.arange(user_recomendations.shape[0])
df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
df_user_recomendation.sort_values(by=['recomended_score'], ascending=False)

  user_recomendation = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
339,4.575700,6812,Kyou kara Ore wa!!,119,166,8.445783
308,4.478454,4628,Shin Yami no Koe - Kaidan,4407,127,7.000000
479,4.005417,19844,Mouryou no Yurikago,6455,218,6.954128
239,3.812048,1908,Hatsukoi Limited.,3900,267,7.142322
411,3.737173,13102,Kanojo wo Mamoru 51 no Houhou,2719,186,7.440860
...,...,...,...,...,...,...
395,-14.740539,11577,Stardust★Wink,4017,318,7.191824
186,-14.802375,1237,Love♥Monster,2314,521,7.485605
439,-14.830238,14633,Seiyuu Ka!,627,367,8.100817
288,-14.882718,3757,Sweet Black,5895,229,7.344978


En forma de función:

In [58]:
def recomendations_by_item_similarity_for_given_user(user_id, filtered=False, subtype='specifity'):
    random_user = pivot_mangas.iloc[user_id].dropna()
    user_recomendation = pd.Series()
    for manga_index in range(len(random_user)):
        if filtered:
            similar_mangas = corr_mangas_reduced[str(random_user.index[manga_index])].dropna()
        else:
            similar_mangas = corr_mangas[str(random_user.index[manga_index])].dropna()
        similar_mangas = similar_mangas.drop(random_user.index, errors='ignore')
        score = random_user.values[manga_index]
        similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
        user_recomendation = user_recomendation.append(similar_mangas)
    if subtype is 'popularity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
    elif subtype is 'specifity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
    user_recomendations = pd.DataFrame(user_recomendation)
    user_recomendations.columns = ['recomended_score']
    user_recomendations['manga_id'] = user_recomendations.index
    user_recomendations.index = np.arange(user_recomendations.shape[0])
    df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
    recomendations = df_user_recomendation.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

def recomendations_by_item_similarity_for_new_user(new_user, filtered=False, subtype='specifity'):
    user_recomendation = pd.Series()
    for manga_index in range(len(new_user)):
        if filtered:
            similar_mangas = corr_mangas_reduced[str(new_user.index[manga_index])].dropna()
        else:
            similar_mangas = corr_mangas[str(new_user.index[manga_index])].dropna()
        similar_mangas = similar_mangas.drop(new_user.index, errors='ignore')
        score = new_user.values[manga_index]
        similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
        user_recomendation = user_recomendation.append(similar_mangas)
    if subtype is 'popularity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
    elif subtype is 'specifity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
    user_recomendations = pd.DataFrame(user_recomendation)
    user_recomendations.columns = ['recomended_score']
    user_recomendations['manga_id'] = user_recomendations.index
    user_recomendations.index = np.arange(user_recomendations.shape[0])
    df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
    recomendations = df_user_recomendation.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

  if subtype is 'popularity':
  elif subtype is 'specifity':
  if subtype is 'popularity':
  elif subtype is 'specifity':


In [59]:
recomendations_by_item_similarity_for_given_user(2)

  user_recomendation = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,6.562621,4456,Suzunari!,9084,20,6.200000
1,4.777334,3254,Tetsuwan Birdy,3088,21,7.666667
2,4.716608,449,Miracle☆Girls,9966,33,6.939394
3,4.333647,386,Kimi no Unaji ni Kanpai!,10297,25,6.440000
4,3.826842,8473,Promise,3174,27,7.814815
...,...,...,...,...,...,...
4222,-8.270055,9276,Aishikata mo Wakarazuni,9328,29,6.482759
4223,-8.351268,8113,Tekken Chinmi Legends,1539,20,7.700000
4224,-9.743286,1366,Dennou Believers,5769,21,7.142857
4225,-10.198359,3546,Miseinen Lovers,8059,25,6.920000


Mejor aún, haciendo la recomendación a través de unas puntuaciones variables y personalizables:

In [60]:
some_personal_scores = {'Berserk': 10, 'Neon Genesis Evangelion': 10, 'Gantz': 8, 'Monster': 8, 
                        'One Piece': 10, 'Akira': 5, 'Kiseijuu': 7, 'Death Note': 9,
                        'Dragon Ball': 7, 'Hunter x Hunter': 9}

for manga_name in some_personal_scores.keys():
    if manga_name not in mangas['manga_name'].values:
        raise Exception(f'The manga {manga_name} is not on the manga database.')
print('Data introduced is ok.')

Data introduced is ok.


In [61]:
def prepare_new_user(dict_scores):
    manga_dict = {}
    for manga_name in dict_scores.keys():
        manga_id = mangas[mangas['manga_name'] == manga_name].manga_id.iloc[0]
        manga_dict[manga_id] = dict_scores[manga_name]
    new_user = pd.Series(manga_dict)
    return new_user

new_user = prepare_new_user(some_personal_scores)
new_user

2      10
698    10
564     8
1       8
13     10
664     5
401     7
21      9
42      7
26      9
dtype: int64

In [62]:
recomendations_by_item_similarity_for_new_user(new_user)

  user_recomendation = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,7.873697,13844,Koisuru Yajuu,11451,23,6.521739
1,7.825673,6572,Get the Moon,8352,21,7.190476
2,7.746810,4348,Love Laboratory,10086,25,6.200000
3,7.651809,11659,Close to My Sweetheart,11876,21,6.428571
4,7.446664,7281,Ennui na Kanojo,14172,27,6.592593
...,...,...,...,...,...,...
4233,-8.720813,1366,Dennou Believers,5769,21,7.142857
4234,-8.793985,13166,Kodomo wa Tomaranai,2720,25,7.760000
4235,-9.625891,14382,Aki-chan no Iibun,10111,20,6.600000
4236,-11.920275,9428,Fukigen na Aibu,13322,25,6.760000


In [63]:
recomendations_by_item_similarity_for_new_user(new_user, filtered=True)

  user_recomendation = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,5.081413,7458,Death Note Another Note: Los Angeles BB Renzok...,192,273,8.285714
1,4.148694,14721,D-Frag!,890,214,7.556075
2,4.120732,534,Spiral: Suiri no Kizuna,760,182,7.956044
3,4.045906,1110,Shiawase Kissa 3-choume,397,458,8.268559
4,3.888315,1009,Hachimitsu to Clover,213,219,8.374429
...,...,...,...,...,...,...
473,-1.899938,932,Koroshiya 1,1582,295,7.593220
474,-2.050779,1023,Koukaku Kidoutai: The Ghost in the Shell,597,265,7.509434
475,-2.264999,3614,Subarashii Sekai,1239,332,7.885542
476,-2.463196,904,Kozure Ookami,36,215,8.637209


In [64]:
recomendations_by_item_similarity_for_new_user(new_user, subtype='popularity')

  user_recomendation = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,54.609464,10704,Koi no Uta,6170,49,7.265306
1,52.018524,4375,Lamp no Ousama,13987,42,6.404762
2,50.985912,4213,Operation Liberate Men,3346,24,7.666667
3,50.030282,8721,Kimi to Boku no Junjou Renai Jijou,7796,37,6.891892
4,48.566500,12748,Futari Awasete Puramai Zero,10582,71,6.816901
...,...,...,...,...,...,...
4233,-54.402380,15156,Akachan no Oshigoto,16209,22,5.454545
4234,-56.607874,641,Love♡Witch,13370,22,6.181818
4235,-60.366268,10792,Ashita mo Kitto Koishiteru,10673,21,6.904762
4236,-62.229120,2130,Benkyou Shinasai!,15189,54,5.833333


In [65]:
recomendations_by_item_similarity_for_new_user(new_user, filtered=True, subtype='popularity')

  user_recomendation = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,21.879390,3537,Yankee-kun to Megane-chan,1176,644,7.818323
1,21.676449,5664,Nurarihyon no Mago,831,470,7.778723
2,21.382791,219,Alive: Saishuu Shinkateki Shounen,1043,381,7.690289
3,20.380160,15578,GE: Good Ending,1214,679,7.572901
4,20.333029,13702,Tonari no Kaibutsu-kun,367,1004,8.074701
...,...,...,...,...,...,...
473,-14.355456,1023,Koukaku Kidoutai: The Ghost in the Shell,597,265,7.509434
474,-14.779173,904,Kozure Ookami,36,215,8.637209
475,-15.854993,3614,Subarashii Sekai,1239,332,7.885542
476,-16.902517,1373,Nijigahara Holograph,2183,498,7.558233


#### Similaridad entre usuarios

Cargo la tabla pivot:

In [66]:
pivot_users = ratings.pivot_table(index=['manga_id'], columns=['user'], values='score')
pivot_mangas = ratings.pivot_table(index=['user'], columns=['manga_id'], values='score')

Comprimo el proceso visto en el notebook haciendo la recomendación por especifidad:

In [67]:
random_user = pivot_mangas.iloc[2678].dropna()
# Calculo las correlaciones con el resto de usuarios:
similar_users_by_pearson = pivot_users.corrwith(other=random_user, method='pearson').dropna() 
similar_users_by_spearman = pivot_users.corrwith(other=random_user, method='spearman').dropna() 
# La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson

# Tabla de usuarios similares:
df_similar_users = pd.DataFrame(similar_users)
df_similar_users.columns = ['Similarity']
# Me quedo solo con los usuarios con puntuación de similaridad positiva.
df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]

# Empiezo a construir las puntuaciones de las recomendaciones:
recomendations_by_similar_users = pd.Series()
for i in range(df_similar_users.shape[0]):
    current_user = df_similar_users.index[i]
    user_similarity = df_similar_users.values[i][0]
    # Hago una serie para tratar los datos de cada usuario similar:
    series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
    series_current_user.index=ratings[ratings['user'] == current_user].manga_id
    # Elimino de esta serie los productos que ya ha consumido el objetivo:
    series_current_user = series_current_user.drop(random_user.index, errors='ignore')
    # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
    # notas suspensas (aunque no con mucha fuerza)
    series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
    recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)

# Función de agregación: por popularidad o por especifidad:
# recomendations_by_similar_users_pop = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
recomendations_by_similar_users_esp = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()

# Construyo el dataframe para la presentación de resultados:
df_recomendations_by_similar_users_esp = pd.DataFrame(recomendations_by_similar_users_esp)
df_recomendations_by_similar_users_esp.columns = ['recomended_score']
df_recomendations_by_similar_users_esp['manga_id'] = df_recomendations_by_similar_users_esp.index
df_recomendations_by_similar_users_esp.index = np.arange(df_recomendations_by_similar_users_esp.shape[0])

df_recomendations_by_similar_users_esp = pd.merge(df_recomendations_by_similar_users_esp, mangas, on=['manga_id'], how='inner')

# Recomendaciones:
df_recomendations_by_similar_users_esp.sort_values(by=['recomended_score'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
  recomendations_by_similar_users = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3239,9.000000,12472,Bara no Kusari,13246,20,6.800000
1248,8.500000,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854
2543,6.890234,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
2779,6.645900,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
1196,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
...,...,...,...,...,...,...
1232,-0.314540,2177,Battle Royale II: Blitz Royale,16367,85,3.647059
3456,-0.486279,13984,3H Before Kiss,16171,31,5.354839
2427,-0.920782,7281,Ennui na Kanojo,14172,27,6.592593
3492,-0.959027,14218,My Sweet Sisters,16373,29,3.379310


En forma de función:

In [68]:
def recomendations_by_user_similarity_for_given_user(user_id, subtype='specifity'):
    random_user = pivot_mangas.iloc[user_id].dropna()
    # Calculo las correlaciones con el resto de usuarios:
    similar_users_by_pearson = pivot_users.corrwith(other=random_user, method='pearson').dropna() 
    similar_users_by_spearman = pivot_users.corrwith(other=random_user, method='spearman').dropna() 
    # La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
    similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson
    # Tabla de usuarios similares:
    df_similar_users = pd.DataFrame(similar_users)
    df_similar_users.columns = ['Similarity']
    # Me quedo solo con los usuarios con puntuación de similaridad positiva.
    df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
    # Empiezo a construir las puntuaciones de las recomendaciones:
    recomendations_by_similar_users = pd.Series()
    for i in range(df_similar_users.shape[0]):
        current_user = df_similar_users.index[i]
        user_similarity = df_similar_users.values[i][0]
        # Hago una serie para tratar los datos de cada usuario similar:
        series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
        series_current_user.index=ratings[ratings['user'] == current_user].manga_id
        # Elimino de esta serie los productos que ya ha consumido el objetivo:
        series_current_user = series_current_user.drop(random_user.index, errors='ignore')
        # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
        # notas suspensas (aunque no con mucha fuerza)
        series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
        recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
    # Función de agregación: por popularidad o por especifidad:
    if subtype is 'popularity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
    elif subtype is 'specifity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
    # Construyo el dataframe para la presentación de resultados:
    recomendations = pd.DataFrame(recomendations_by_similar_users)
    recomendations.columns = ['recomended_score']
    recomendations['manga_id'] = recomendations.index
    recomendations.index = np.arange(recomendations.shape[0])
    recomendations = pd.merge(recomendations, mangas, on=['manga_id'], how='inner')
    # Recomendaciones:
    recomendations = recomendations.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations
    

def recomendations_by_user_similarity_for_new_user(new_user, subtype='specifity'):
    similar_users_by_pearson = pivot_users.corrwith(other=new_user, method='pearson').dropna() 
    similar_users_by_spearman = pivot_users.corrwith(other=new_user, method='spearman').dropna() 
    # La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
    similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson
    # Tabla de usuarios similares:
    df_similar_users = pd.DataFrame(similar_users)
    df_similar_users.columns = ['Similarity']
    # Me quedo solo con los usuarios con puntuación de similaridad positiva.
    df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
    # Empiezo a construir las puntuaciones de las recomendaciones:
    recomendations_by_similar_users = pd.Series()
    for i in range(df_similar_users.shape[0]):
        current_user = df_similar_users.index[i]
        user_similarity = df_similar_users.values[i][0]
        # Hago una serie para tratar los datos de cada usuario similar:
        series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
        series_current_user.index=ratings[ratings['user'] == current_user].manga_id
        # Elimino de esta serie los productos que ya ha consumido el objetivo:
        series_current_user = series_current_user.drop(new_user.index, errors='ignore')
        # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
        # notas suspensas (aunque no con mucha fuerza)
        series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
        recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
    # Función de agregación: por popularidad o por especifidad:
    if subtype is 'popularity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
    elif subtype is 'specifity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
    # Construyo el dataframe para la presentación de resultados:
    recomendations = pd.DataFrame(recomendations_by_similar_users)
    recomendations.columns = ['recomended_score']
    recomendations['manga_id'] = recomendations.index
    recomendations.index = np.arange(recomendations.shape[0])
    recomendations = pd.merge(recomendations, mangas, on=['manga_id'], how='inner')
    # Recomendaciones:
    recomendations = recomendations.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

  if subtype is 'popularity':
  elif subtype is 'specifity':
  if subtype is 'popularity':
  elif subtype is 'specifity':


In [69]:
recomendations_by_user_similarity_for_given_user(2)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
  recomendations_by_similar_users = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,7.255016,95,Lagoon Engine,8193,20,6.750000
1,6.702769,18747,The Bear-Like Fox Meets The Wolf,5387,27,7.370370
2,6.537838,7554,Second Kiss,9878,21,7.095238
3,6.486384,4944,Cream,8345,21,6.714286
4,6.485082,4651,Shinayaka ni Kizutsuite,12368,20,6.400000
...,...,...,...,...,...,...
4222,-0.389406,2177,Battle Royale II: Blitz Royale,16367,85,3.647059
4223,-0.537210,8575,Chissana Koi no Melody,16314,27,4.666667
4224,-0.629721,14218,My Sweet Sisters,16373,29,3.379310
4225,-1.296964,13354,Hissatsu Surume Katame,16369,24,3.458333


Mejor aún, haciendo la recomendación a través de unas puntuaciones variables y personalizables:

In [70]:
some_personal_scores = {'Berserk': 10, 'Neon Genesis Evangelion': 10, 'Gantz': 8, 'Monster': 8, 
                        'One Piece': 10, 'Akira': 5, 'Kiseijuu': 7, 'Death Note': 9,
                        'Dragon Ball': 7, 'Hunter x Hunter': 9}

for manga_name in some_personal_scores.keys():
    if manga_name not in mangas['manga_name'].values:
        raise Exception(f'The manga {manga_name} is not on the manga database.')
print('Data introduced is ok.')

Data introduced is ok.


In [71]:
def prepare_new_user(dict_scores):
    manga_dict = {}
    for manga_name in dict_scores.keys():
        manga_id = mangas[mangas['manga_name'] == manga_name].manga_id.iloc[0]
        manga_dict[manga_id] = dict_scores[manga_name]
    new_user = pd.Series(manga_dict)
    return new_user

new_user = prepare_new_user(some_personal_scores)
new_user

2      10
698    10
564     8
1       8
13     10
664     5
401     7
21      9
42      7
26      9
dtype: int64

In [72]:
recomendations_by_user_similarity_for_new_user(new_user)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
  recomendations_by_similar_users = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,9.000000,12352,Yuki no Kuni kara,4629,23,7.043478
1,9.000000,15667,Make Sweet,6706,34,6.970588
2,9.000000,2481,Every Day Every Night,13771,20,6.600000
3,8.500000,8239,Vampire Knight: Ice Blue no Tsumi,1088,26,8.230769
4,8.500000,17633,Love & Noise!,3836,48,7.729167
...,...,...,...,...,...,...
4220,-1.282332,9060,Kokuhaku Gokko,15262,20,5.950000
4221,-1.693992,5922,Mobius Doumei,5794,31,7.290323
4222,-1.693992,18058,Setsuna No Rakuen,13179,51,6.960784
4223,-1.835775,5924,Yoru Koi,14234,43,6.023256


In [73]:
recomendations_by_user_similarity_for_new_user(new_user, subtype='popularity')

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
  recomendations_by_similar_users = pd.Series()


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,3158.877188,11,Naruto,741,3013,7.545304
1,2976.971433,25,Fullmetal Alchemist,3,2054,9.022882
2,2346.336936,8967,Onanie Master Kurosawa,128,1611,8.378647
3,2238.922578,12,Bleach,2117,2527,7.011872
4,2209.539790,9711,Bakuman.,149,1555,8.311254
...,...,...,...,...,...,...
4220,-4.000000,19639,Alice's Dream Files,14721,20,5.750000
4221,-4.139878,4061,Baby Pink KISS,15433,59,5.966102
4222,-8.566258,9000,Kagami no Kuni no Alice,16360,61,4.885246
4223,-9.689808,2177,Battle Royale II: Blitz Royale,16367,85,3.647059


# Motor de recomendación:

Preámbulos:

In [74]:
import pandas as pd
import numpy as np

In [75]:
# La versión de pandas debería ser la 0.24.2 o más reciente.
pd.__version__

'1.1.3'

In [76]:
mangas = pd.read_csv('data/mangas_v2.csv')
scores = pd.read_csv('data/scores_v2.csv')
ratings = pd.merge(mangas, scores, on='manga_id')
ratings = ratings[['manga_id', 'user', 'score']]
corr_mangas = pd.read_csv('data/corr_pearson_mangas.csv', header=0, index_col=0)
corr_mangas_reduced = pd.read_csv('data/corr_pearson_mangas_filtered.csv', header=0, index_col=0)
pivot_users = ratings.pivot_table(index=['manga_id'], columns=['user'], values='score')
pivot_mangas = ratings.pivot_table(index=['user'], columns=['manga_id'], values='score')

Funciones:

In [4]:
def recomendations_by_item_similarity_for_new_user(new_user, filtered=False, subtype='specifity'):
    user_recomendation = pd.Series()
    for manga_index in range(len(new_user)):
        if filtered:
            similar_mangas = corr_mangas_reduced[str(new_user.index[manga_index])].dropna()
        else:
            similar_mangas = corr_mangas[str(new_user.index[manga_index])].dropna()
        similar_mangas = similar_mangas.drop(new_user.index, errors='ignore')
        score = new_user.values[manga_index]
        similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
        user_recomendation = user_recomendation.append(similar_mangas)
    if subtype == 'popularity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
    elif subtype == 'specifity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
    user_recomendations = pd.DataFrame(user_recomendation)
    user_recomendations.columns = ['recomended_score']
    user_recomendations['manga_id'] = user_recomendations.index
    user_recomendations.index = np.arange(user_recomendations.shape[0])
    df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
    recomendations = df_user_recomendation.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

def recomendations_by_user_similarity_for_new_user(new_user, subtype='specifity'):
    similar_users_by_pearson = pivot_users.corrwith(other=new_user, method='pearson').dropna() 
    similar_users_by_spearman = pivot_users.corrwith(other=new_user, method='spearman').dropna() 
    # La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
    similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson
    # Tabla de usuarios similares:
    df_similar_users = pd.DataFrame(similar_users)
    df_similar_users.columns = ['Similarity']
    # Me quedo solo con los usuarios con puntuación de similaridad positiva.
    df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
    # Empiezo a construir las puntuaciones de las recomendaciones:
    recomendations_by_similar_users = pd.Series()
    for i in range(df_similar_users.shape[0]):
        current_user = df_similar_users.index[i]
        user_similarity = df_similar_users.values[i][0]
        # Hago una serie para tratar los datos de cada usuario similar:
        series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
        series_current_user.index=ratings[ratings['user'] == current_user].manga_id
        # Elimino de esta serie los productos que ya ha consumido el objetivo:
        series_current_user = series_current_user.drop(new_user.index, errors='ignore')
        # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
        # notas suspensas (aunque no con mucha fuerza)
        series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
        recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
    # Función de agregación: por popularidad o por especifidad:
    if subtype == 'popularity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
    elif subtype == 'specifity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
    # Construyo el dataframe para la presentación de resultados:
    recomendations = pd.DataFrame(recomendations_by_similar_users)
    recomendations.columns = ['recomended_score']
    recomendations['manga_id'] = recomendations.index
    recomendations.index = np.arange(recomendations.shape[0])
    recomendations = pd.merge(recomendations, mangas, on=['manga_id'], how='inner')
    # Recomendaciones:
    recomendations = recomendations.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

def recomendations_by_item_similarity_for_given_user(user_id, filtered=False, subtype='specifity'):
    user = pivot_mangas.iloc[user_id].dropna()
    return recomendations_by_item_similarity_for_new_user(user, filtered=filtered, subtype=subtype)

def recomendations_by_user_similarity_for_given_user(user_id, subtype='specifity'):
    user = pivot_mangas.iloc[user_id].dropna()
    return recomendations_by_user_similarity_for_new_user(user, subtype=subtype)
    
def get_recomendations(user, main_type='user_similarity', subtype='specifity', reduced_dtb=False):
    if main_type == 'user_similarity':
        return recomendations_by_user_similarity_for_new_user(user, subtype=subtype)
    elif main_type == 'item_similarity':
        return recomendations_by_item_similarity_for_new_user(user, filtered=reduced_dtb, subtype=subtype)
    else:
        raise Exception("You must select between 'user_similarity' or 'item_similarity'.")

def validate_user(user):
    flag = True
    for manga_name in user.keys():
        if manga_name not in mangas['manga_name'].values:
            print(f'The manga {manga_name} is not on the manga database.')
            flag = False
    if flag:
        print('Data introduced is ok.')
    return flag

def prepare_new_user(user):
    manga_dict = {}
    for manga_name in user.keys():
        manga_id = mangas[mangas['manga_name'] == manga_name].manga_id.iloc[0]
        manga_dict[manga_id] = user[manga_name]
    new_user = pd.Series(manga_dict)
    return new_user

Ejemplo de uso:

In [8]:
user_example = {'Berserk': 10, 'Neon Genesis Evangelion': 10, 'Naruto': 8}

if validate_user(user_example):
    user = prepare_new_user(user_example)
    recomendations = get_recomendations(user, 'user_similarity', 'specifity')

recomendations

Data introduced is ok.


  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,10.000000,1021,The Summit,596,43,8.372093
1,10.000000,963,9-banme no Musashi,2557,20,7.450000
2,9.000000,412,Koibumi Biyori,6003,24,7.875000
3,9.000000,2576,Takeru ~ Opera Susanoh Sword of the Devil,2381,22,7.818182
4,9.000000,13213,Ragtonia,3271,28,7.642857
5,9.000000,11128,Miku-4,5801,27,7.296296
6,9.000000,166,Chou Shinri Genshou Nouryokusha Nanaki,6404,23,7.260870
7,9.000000,2996,Missing Piece,13306,21,7.047619
8,9.000000,8595,Houkago Orange,3819,32,6.968750
9,9.000000,4678,Dokuyaku to Otome,12070,44,6.681818


In [10]:
recomendations.sort_values('manga_name')

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
2022,49.500952,10000,"""Bungaku Shoujo"" Series",469,47,7.851064
1212,95.740403,11776,"""Bungaku Shoujo"" to Shi ni Tagari no Pierrot",2942,92,7.467391
367,282.688682,682,"""Kare"" First Love",2496,450,7.604444
1290,87.824673,69,"""Suki"" to Ienai.",11301,245,6.461224
2293,40.691849,4769,#000000: Ultra Black,9085,23,7.304348
229,406.872061,38,+Anima,1909,414,7.555556
698,167.534054,12922,+C: Sword and Cornett,4846,157,7.280255
1318,85.748514,17931,-Hitogatana-,5072,59,6.847458
3658,13.703018,1144,...Curtain.: Sensei to Kiyoraka ni Dousei,15626,101,6.217822
2150,44.441009,7886,...Seishunchuu!,8223,30,7.366667


Referencia: http://blog.findemor.es/2018/02/sistemas-de-recomendacion-en-python/

---