# Recommender system

Carlos Pinto Pérez

## Recommender system background

In [2]:
import pandas as pd
import numpy as np

Load data

In [3]:
mangas = pd.read_csv('mangas_v2.csv')
scores = pd.read_csv('scores_v2.csv')
ratings = pd.merge(mangas, scores, on='manga_id')
ratings = ratings[['manga_id', 'user', 'score']]
print(f'Ratings shape: {ratings.shape}')
ratings.head()

Ratings shape: (484502, 3)


Unnamed: 0,manga_id,user,score
0,2,Polyphemus,7
1,2,Aja,10
2,2,Tumerking,6
3,2,aindah,10
4,2,infinity,9


Pivot matrix. Analysis can be done by rows (users) or by columns (items). I will start from the last.

In [4]:
pivot_mangas = ratings.pivot_table(index=['user'], columns=['manga_id'], values='score')
pivot_mangas.head()

manga_id,1,2,3,4,7,8,9,10,11,12,...,19947,19952,19961,19968,19980,19981,19983,19984,19987,19995
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--Zora--,9.0,,,,,,,,,,...,,,,,,,,,,
--ariste,,,,,,8.0,,,,,...,,,,,,,,,4.0,
-Alians-,,7.0,10.0,,,,,,,,...,,,,,,,,,,
-Anokata,10.0,,10.0,,10.0,,,,7.0,5.0,...,,,,,,,,,,
-BlackRabbit-,9.0,,,,,,6.0,,8.0,,...,,,,,6.0,,,1.0,,


### Items similarity

Let's see how a particular recomendation works and then generalize it. I will use the manga with id = 2

In [4]:
mangas[mangas['manga_id'] == 2]

Unnamed: 0,manga_id,manga_name,manga_rank,number_scores,mean_score
0,2,Berserk,1,1913,9.003659


This column indicates its ratings

In [5]:
ratings_manga_2 = pivot_mangas[2]
ratings_manga_2

user
--Zora--            NaN
--ariste            NaN
-Alians-            7.0
-Anokata            NaN
-BlackRabbit-       NaN
-Chrissi-chan-      NaN
-Ereya-            10.0
-Everlasting-       NaN
-FAWKYOURFACE-      NaN
-Gia-               NaN
-K0K0-              NaN
-Karoshi-          10.0
-Kazu               NaN
-Keeper-            NaN
-Lunicorn-          NaN
-Lupa-              NaN
-Mr-Stick-          NaN
-Naami-             NaN
-Phantom-Fr         8.0
-ShWeePs-           NaN
-TACHYON-           NaN
-Zekkai-            NaN
-_-KAI-_-           NaN
-everglow-          NaN
-ice-chan-          NaN
-redux              NaN
-steez-             8.0
-weilyem-           NaN
01RuneMishe         NaN
06blackheart        NaN
                   ... 
zaiav               NaN
zalti               8.0
zamieo             10.0
zarinna             NaN
zariuq              7.0
zauru               NaN
zawa113             NaN
zawardoooo         10.0
zedwardzenyz        NaN
zeltron             NaN
zenron     

The approach is getting the correlation of this column with the rest of columns stored, and then will order the values. This gives a measure of "similarity".

In [6]:
similar_mangas_of_2 = pivot_mangas.corrwith(other=ratings_manga_2, method='pearson').dropna() 
df_similar_mangas_of_2 = pd.DataFrame(similar_mangas_of_2)
df_similar_mangas_of_2.columns = ['Similarity']
df_similar_mangas_of_2 = pd.merge(df_similar_mangas_of_2, mangas, on=['manga_id'])
df_similar_mangas_of_2.sort_values(by=['Similarity'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
3986,19995,1.0,Tayutama: Kiss on my Deity,13561,22,6.318182
2103,5585,1.0,69,15757,20,6.500000
2985,11771,1.0,Romantist Egoist,9440,38,6.500000
2972,11690,1.0,18-sai no Kodou,13713,53,6.415094
370,559,1.0,Ura Peach Girl,10182,59,6.932203
2966,11659,1.0,Close to My Sweetheart,11876,21,6.428571
3538,15964,1.0,Otome no Sainou,6055,31,6.870968
3854,18760,1.0,Hanjuku-hime,9672,43,6.534884
2534,8595,1.0,Houkago Orange,3819,32,6.968750
1701,3963,1.0,Tau,11745,20,6.800000


There are a lot of 'perfect' similarities with this particular manga, probably dued to the little amount of data. With that, we can add an additional sorting by mangas popularity or by mean scores.

Sorting by popularity:

In [7]:
df_similar_mangas_of_2.sort_values(by=['Similarity', 'manga_rank'], ascending=[False, True])

Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
786,1222,1.0,Little Busters! The 4-koma,1205,25,7.280000
2505,8473,1.0,Promise,3174,27,7.814815
1752,4154,1.0,"Kiss, Zekkou, Kiss",3254,26,7.192308
2329,7219,1.0,Strange Orange,3426,35,7.257143
3624,16787,1.0,All of You in the World,3641,26,7.923077
2534,8595,1.0,Houkago Orange,3819,32,6.968750
3102,12673,1.0,Shitsuren Chocolatier,4226,27,7.444444
1745,4119,1.0,Hakobune Hakusho,4298,37,7.486486
3064,12346,1.0,Ruby Doll,5484,24,7.333333
2444,7955,1.0,Buriki no Kanzume,5573,56,7.285714


Sorting by mean score:

In [8]:
df_similar_mangas_of_2.sort_values(by=['Similarity', 'mean_score'], ascending=False)

Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
3624,16787,1.0,All of You in the World,3641,26,7.923077
2505,8473,1.0,Promise,3174,27,7.814815
3603,16654,1.0,Taiyou ga Yondeiru!,7207,32,7.500000
1745,4119,1.0,Hakobune Hakusho,4298,37,7.486486
3102,12673,1.0,Shitsuren Chocolatier,4226,27,7.444444
3064,12346,1.0,Ruby Doll,5484,24,7.333333
3531,15839,1.0,Kirei no Tamago,5586,64,7.312500
2444,7955,1.0,Buriki no Kanzume,5573,56,7.285714
786,1222,1.0,Little Busters! The 4-koma,1205,25,7.280000
2329,7219,1.0,Strange Orange,3426,35,7.257143


However, this is still weak. We can do some additional filtering:

In [7]:
df_similar_mangas_of_2_filtred = df_similar_mangas_of_2[df_similar_mangas_of_2['number_scores'] > 99]
df_similar_mangas_of_2_filtred.sort_values(by=['Similarity'], ascending=False)

Unnamed: 0,manga_id,Similarity,manga_name,manga_rank,number_scores,mean_score
1,2,1.000000,Berserk,1,1913,9.003659
2513,8519,0.917663,Yoru no Gakkou e Oide yo!,3263,109,7.486239
2882,11133,0.774194,37°C no Boyfriend,11326,100,6.630000
1242,2538,0.772172,Legend of Nereid,4097,136,7.279412
275,423,0.758929,Pichi Pichi Pitch: Mermaid Melody,8827,159,6.490566
209,329,0.737011,Gokkun! Pucho,3408,126,7.261905
583,886,0.710174,"Sekai no Chuushin de, Ai wo Sakebu",1723,151,7.185430
53,58,0.702706,W-Juliet,976,259,7.945946
55,65,0.692258,17-sai: Hajimete no H,14226,162,6.271605
2330,7220,0.686349,Bokura wa Itsumo,3351,186,7.478495


### Users similarity

This time the pivot matrix will have the users as columns:

In [3]:
pivot_users = ratings.pivot_table(index=['manga_id'], columns=['user'], values='score')
pivot_users.head()

user,--Zora--,--ariste,-Alians-,-Anokata,-BlackRabbit-,-Chrissi-chan-,-Ereya-,-Everlasting-,-FAWKYOURFACE-,-Gia-,...,zman75,znyggisen,zoddtheimmortal,zogwarg,zombiesonacid,zombor11,zonnikku,zucchinichop,zuziako,zybactik
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,9.0,,,10.0,9.0,,10.0,,,,...,,7.0,,,,10.0,10.0,8.0,,
2,,,7.0,,,,10.0,,,,...,,,10.0,9.0,,10.0,8.0,10.0,,
3,,,10.0,10.0,,,10.0,,,,...,,,,10.0,,10.0,10.0,,,
4,,,,,,,,,,,...,,,,,,,,,,8.0
7,,,,10.0,,,,,,,...,,,,,,10.0,,,,


It's a good idea to take into consideration how many ratings has done each user. This could be useful to improve the recommendation system, altough it is not implemented in this demo.

In [5]:
users = pd.DataFrame(data={'user': ratings['user'].unique()})
n_scores_temp = ratings[['user', 'score']].groupby('user').count()
n_scores = pd.merge(users, n_scores_temp, on=['user'])
n_scores.columns=['user', 'n_scores']
n_scores.head()

Unnamed: 0,user,n_scores
0,Polyphemus,372
1,Aja,35
2,Tumerking,140
3,aindah,113
4,infinity,36


As I did with the manga with id == 2, this time I will focus on the user "infinity".

In [7]:
ratings[ratings['user'] == 'infinity']

Unnamed: 0,manga_id,user,score
4,2,inf,9
1738,25,inf,9
3359,13,inf,8
9756,651,inf,7
42473,44,inf,8
56714,267,inf,9
66206,583,inf,7
70777,373,inf,7
75612,735,inf,8
80705,1076,inf,6


Let's see who are the users with most similarity with this one, using the correlations again:

In [10]:
ratings_user_infinity = pivot_users['infinity'].dropna()
ratings_user_infinity

manga_id
2        9.0
11       7.0
12       7.0
13       8.0
15       8.0
19       7.0
25       9.0
44       8.0
47       8.0
48       7.0
114      7.0
136      9.0
221      7.0
267      9.0
278      8.0
373      7.0
447      9.0
564      7.0
572      7.0
583      7.0
598      8.0
616      7.0
648      7.0
651      7.0
671      7.0
735      8.0
908      9.0
967      8.0
1076     6.0
1534     8.0
2436     9.0
5113     9.0
5801     9.0
5911     7.0
11329    8.0
12586    7.0
Name: infinity, dtype: float64

In [11]:
similar_users_of_infinity = pivot_users.corrwith(other=ratings_user_infinity, method='pearson').dropna() 
df_similar_users_of_infinity = pd.DataFrame(similar_users_of_infinity)
df_similar_users_of_infinity.columns = ['Similarity']
df_similar_users_of_infinity = pd.merge(df_similar_users_of_infinity, n_scores, on=['user'])
df_similar_users_of_infinity.sort_values(by=['Similarity'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,user,Similarity,n_scores
589,FiraDeviant,1.0,20
1573,SadieCahill,1.0,39
1317,OURANLOVERJINX,1.0,470
554,Eternal_Light,1.0,71
3079,tateha,1.0,51
1545,RoxRobstah,1.0,34
1551,RukiaRocks,1.0,50
2140,anniebananie,1.0,22
1044,LucaxCubix,1.0,21
1869,Ton_Koti,1.0,36


As happened with the mangas first approach, there are a lot of perfect correlations. We can use a secondary metric to get a better order:

In [10]:
df_similar_users_of_infinity.sort_values(by=['Similarity', 'n_scores'], ascending=False)

Unnamed: 0,user,Similarity,n_scores
1317,OURANLOVERJINX,1.0,470
620,Fujaku,1.0,450
1092,MahouShoujoLain,1.0,303
1196,Moon_Light,1.0,255
2757,mixing-scents,1.0,158
2665,lilytenjouXP,1.0,137
1105,MangaGreat,1.0,92
2189,basbas,1.0,92
3127,tweetlepie,1.0,91
2155,arrowofthenight,1.0,88


So now we can identify which items (mangas) have been the most valued between the users that are similar to the target user. First, let's choose only the first ten most similar users:

In [18]:
df = df_similar_users_of_infinity.sort_values(by=['Similarity', 'n_scores'], ascending=False)

# This is not the best way to choose the best 10.
# Also, I choose 10 as this is a dema.
most_similar_users_to_infinity = []
for i in range(10):
    most_similar_users_to_infinity.append(df.iloc[i].user)
    
most_similar_users_to_infinity

['OURANLOVERJINX',
 'Fujaku',
 'MahouShoujoLain',
 'Moon_Light',
 'mixing-scents',
 'lilytenjouXP',
 'MangaGreat',
 'basbas',
 'tweetlepie',
 'arrowofthenight']

The scores of the first user, ordered by ranking.

In [38]:
df1 = ratings[ratings['user'] == most_similar_users_to_infinity[0]].sort_values(by=['score'], ascending=False)
df1.index = df1.manga_id
df1 = df1[[column for column in df1.columns if column not in ['manga_id']]]
# Drop from the list the mangas that the target user has already read.
df1 = df1.drop(ratings_user_infinity.index, errors='ignore')
df1.head()

Unnamed: 0_level_0,user,score
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1
610,OURANLOVERJINX,10
7008,OURANLOVERJINX,10
4515,OURANLOVERJINX,10
8042,OURANLOVERJINX,10
125,OURANLOVERJINX,10


Multiply the score of each item with score > 5 up to the correlation the current user has with the target user. This way I get a "recomendation score".

I also chose users with positive correlation with the target user. It doesn't make sense get the other users: whether a person with different preferences than the user doesn't like something doesn't indicate that the user will like it.

In [49]:
df = df_similar_users_of_infinity.sort_values(by=['Similarity', 'n_scores'], ascending=False)
df = df[df['Similarity'] > 0]

recomendations_by_similar_users = pd.Series()
for i in range(df.shape[0]):  # Now I get all the similar users, not only the first ten.
    current_user = df.iloc[i].user
    user_similarity = df.iloc[i].Similarity

    series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
    series_current_user.index=ratings[ratings['user'] == current_user].manga_id

    series_current_user.drop(ratings_user_infinity.index, errors='ignore')
    # I also penalize the bad scores:
    series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
    recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
# Using the sum as aggregating function means that the recommender system take into account the popularity of the mangas:
# For a specific recommendation the aggregation function can be, for instance, the geometric mean.
recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
recomendations_by_similar_users

1         4959
2         8892
3         5301
4         2118
7         2564
8         1918
9         3759
10        3522
11        9857
12        8678
13       10124
14        2638
15        3855
16        3151
17        1910
18        1204
19         830
20        2749
21        9253
22        5042
23        2375
24        5025
25       10777
26        5189
27         994
28        2386
29        1565
30        4367
31        2530
32        1839
         ...  
19802       33
19809      186
19810      201
19833       63
19839      384
19841       21
19844      821
19869       82
19871      124
19878       72
19885      843
19896      679
19911      270
19922      190
19925      217
19926      178
19931      106
19932       63
19939       88
19945       62
19947      150
19952      104
19961       59
19968      126
19980      644
19981      154
19983      202
19984      420
19987       35
19995       50
Length: 4248, dtype: int64

In [51]:
recomendations_by_similar_users.sort_values(ascending=False)

25       10777
13       10124
11        9857
21        9253
2         8892
12        8678
598       6957
9711      6615
8967      6310
564       5849
583       5812
3986      5621
10010     5509
11734     5451
3         5301
908       5247
26        5189
47        5057
22        5042
24        5025
1         4959
8586      4943
656       4923
4632      4847
7519      4805
102       4724
336       4675
42        4665
642       4507
30        4367
         ...  
8419        25
8892        25
3723        25
3555        25
2458        25
7367        24
1952        24
4456        24
7281        24
3474        24
4529        23
9428        22
19448       22
6882        22
7345        22
19841       21
13984       21
12330       21
5501        21
13021       19
17189       19
2461        19
6136        18
4348        16
1990        16
7733        15
12472       14
8674        14
9606        13
213         12
Length: 4248, dtype: int64

Sort the results and show it:

In [57]:
df_users_infinity = pd.DataFrame(recomendations_by_similar_users.sort_values(ascending=False))
df_users_infinity.columns = ['recomended_score']
df_users_infinity['manga_id'] = df_users_infinity.index

recomendations_by_users_infinity = pd.merge(mangas[['manga_id', 'manga_name', 'mean_score']], 
                                            df_users_infinity, on=['manga_id'])
recomendations_by_users_infinity.sort_values('recomended_score', ascending=False)

Unnamed: 0,manga_id,manga_name,mean_score,recomended_score
2,25,Fullmetal Alchemist,9.022882,10777
3,13,One Piece,8.618492,10124
331,11,Naruto,7.545304,9857
20,21,Death Note,8.469379,9253
0,2,Berserk,9.003659,8892
847,12,Bleach,7.011872,8678
641,598,Fairy Tail,7.189990,6957
75,9711,Bakuman.,8.311254,6615
63,8967,Onanie Master Kurosawa,8.378647,6310
248,564,Gantz,7.735890,5849


We can also use the Spearman correlation, that seems more adecuate in this case that the classical Pearson correlation. That is, instead of using the distance between the users' scores to get their similarities, it seems more intuitive to use de distances between the **personal rankings** of the users (https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php).

Take into consideration that this is only valid as a similarity measure between users, not between items.

In [14]:
similar_users_of_infinity_spearman = pivot_users.corrwith(other=ratings_user_infinity, method='spearman').dropna() 
df_similar_users_of_infinity_spearman = pd.DataFrame(similar_users_of_infinity_spearman)
df_similar_users_of_infinity_spearman.columns = ['Similarity']
df_similar_users_of_infinity_spearman = pd.merge(df_similar_users_of_infinity_spearman, n_scores, on=['user'])
df_similar_users_of_infinity_spearman.sort_values(by=['Similarity'], ascending=False).head()

Unnamed: 0,user,Similarity,n_scores
1215,MrSanjuro,1.0,33
1858,Tinhinane-Ingui,1.0,39
1545,RoxRobstah,1.0,34
1365,Paraturtle,1.0,36
1551,RukiaRocks,1.0,50


In [15]:
df_similar_users_of_infinity_spearman.sort_values(by=['Similarity', 'n_scores'], ascending=False).head()

Unnamed: 0,user,Similarity,n_scores
620,Fujaku,1.0,450
402,DORAGONFLY,1.0,177
1622,Selaht27,1.0,172
2757,mixing-scents,1.0,158
1093,Mahou_Bujin,1.0,114


From here on in, users similarities will be calculated ponderating their Spearman-Pearson correlations, with weights 70-30.

### Generalization (recommender system by product similarities)

Calculate the correlation matrix. This process takes time.

In [13]:
corr_pearson_mangas = pivot_mangas.corr(method='pearson')  
corr_pearson_mangas.head()

manga_id,1,2,3,4,7,8,9,10,11,12,...,19947,19952,19961,19968,19980,19981,19983,19984,19987,19995
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.409006,0.573935,0.296401,0.243751,0.221626,0.179385,0.053958,0.119629,0.048844,...,0.738549,-0.521817,-1.0,0.665016,0.294547,-0.354663,0.269692,0.212567,,1.0
2,0.409006,1.0,0.366907,0.261819,0.307543,0.297297,0.093504,0.07823,0.13769,0.166987,...,0.289414,0.175412,-1.0,0.050965,0.175027,0.358535,-0.203496,0.450499,,1.0
3,0.573935,0.366907,1.0,0.288774,0.192548,0.315443,0.139139,0.181722,0.072584,0.152991,...,0.944911,0.045162,1.0,0.5,-0.238073,-0.01494,0.394557,0.052437,,
4,0.296401,0.261819,0.288774,1.0,0.066961,0.437503,0.123408,0.114721,-0.053623,0.136216,...,,0.980581,,,0.503631,0.408248,0.416881,0.498585,,0.995871
7,0.243751,0.307543,0.192548,0.066961,1.0,0.04709,0.123593,0.247823,0.304135,0.303212,...,,0.794461,,,0.706897,0.326164,0.684177,0.445927,,


Apply a filter: to calculate the correlation between two items, they must share at least 100 scores.

In [14]:
corr_pearson_mangas_filtered = pivot_mangas.corr(method='pearson', min_periods=100)  
corr_pearson_mangas_filtered.head()

manga_id,1,2,3,4,7,8,9,10,11,12,...,19947,19952,19961,19968,19980,19981,19983,19984,19987,19995
manga_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.409006,0.573935,0.296401,0.243751,,0.179385,0.053958,0.119629,0.048844,...,,,,,,,,,,
2,0.409006,1.0,0.366907,0.261819,0.307543,,0.093504,0.07823,0.13769,0.166987,...,,,,,,,,,,
3,0.573935,0.366907,1.0,0.288774,0.192548,,0.139139,0.181722,0.072584,0.152991,...,,,,,,,,,,
4,0.296401,0.261819,0.288774,1.0,,,,,-0.053623,0.136216,...,,,,,,,,,,
7,0.243751,0.307543,0.192548,,1.0,,,,0.304135,0.303212,...,,,,,,,,,,


This matrix can be saved to improve the times, but it requires a lot of memory (scales quickly with the amount of data).

In [54]:
corr_pearson_mangas.to_csv('corr_pearson_mangas.csv')
corr_pearson_mangas_filtered.to_csv('corr_pearson_mangas_filtered.csv')

Let's choose a random user and get some recomendation for her.

In [20]:
random_user = pivot_mangas.iloc[2678].dropna()
random_user

manga_id
1        10.0
2         8.0
3         9.0
4         9.0
26        7.0
51        9.0
104       9.0
149       8.0
399      10.0
401       8.0
436       8.0
481      10.0
642       8.0
656      10.0
657      10.0
705       5.0
731      10.0
743       4.0
745       9.0
768       8.0
909       4.0
912       6.0
936       8.0
1373      9.0
1470      7.0
1471      8.0
1706      9.0
3008      5.0
3009      7.0
3258      8.0
3731      8.0
4632      9.0
5461      8.0
6604      6.0
7216      7.0
7375      9.0
8967      9.0
10690     5.0
11471     5.0
11734     3.0
14790     6.0
14893     6.0
15355     8.0
17192     5.0
17353     5.0
Name: Swarnadeep, dtype: float64

In [21]:
len(random_user)

45

For each manga _m_ that the user has scored, I get every similar manga and multiply its correlation coefficient up to the score the user has assigned to _m_. This way, the user manga score discriminates which mangas seems more appropiate for her.

Furthermore, to aggregate all this measures for each manga _m_, I will use the sum() function. However, although that election is very intuitive, it could be improved.

In [40]:
user_recomendation = pd.Series()

for manga_index in range(len(random_user)):
    similar_mangas = corr_pearson_mangas[random_user.index[manga_index]].dropna()
    # Multiplies the correlation coefficient up to the score assigned by the user.
    similar_mangas = similar_mangas.map(lambda x: x * random_user.values[manga_index])
    # Get the recommendation scores
    user_recomendation = user_recomendation.append(similar_mangas)
user_recomendation

1        10.000000
2         4.090057
3         5.739346
4         2.964012
7         2.437513
8         2.216258
9         1.793855
10        0.539584
11        1.196292
12        0.488439
13        1.111982
14        2.418527
15        2.128045
16        3.829819
17        2.961823
18        3.408165
19        2.222374
20        2.021492
21        2.309532
22        3.563782
23        2.318606
24        2.118071
25        3.732954
26        2.643361
27        2.442395
28        0.831100
29        2.100665
30        1.809499
31        3.856466
32        2.550374
           ...    
19729    -1.106567
19736    -1.250100
19745     3.667332
19767     4.579870
19797     5.000000
19809     0.000000
19810     3.234983
19833     2.500000
19839     2.258030
19841     5.000000
19844     1.898922
19869     5.000000
19871    -2.206307
19878     2.097067
19885     1.166565
19896     2.755747
19911     2.516385
19922    -0.769752
19925     3.201061
19926    -2.500000
19931    -2.500000
19932     0.

In [41]:
len(user_recomendation)

153809

With the previous operation I repeated a lot of items:

In [42]:
len(user_recomendation.index.unique())

4248

So here is when I use the sum() as aggregation function, as said before.

In [43]:
user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
user_recomendation

1        110.798177
2         98.830469
3        109.449835
4         76.107387
7         82.814613
8         79.144959
9         58.433464
10        63.848659
11        49.501681
12        52.022681
13        44.019191
14        73.508536
15        53.204647
16        62.206179
17       106.216927
18        92.076075
19        64.241118
20        43.097227
21        66.440720
22       103.068129
23        75.521329
24        74.488788
25        73.039882
26        84.773513
27        29.206385
28        61.980264
29        64.290683
30        74.656775
31        85.060887
32        67.725963
            ...    
19802     39.208202
19809     31.384173
19810     89.382516
19833     23.547178
19839     83.842333
19841     -8.017781
19844     55.355155
19869     31.666097
19871     22.409238
19878    120.893798
19885    110.831659
19896     84.134946
19911     62.056814
19922     84.315282
19925     61.754819
19926    -29.740392
19931     44.982713
19932    -34.240551
19939     42.014997


Remove from the recommended mangas those already scored by the target user:

In [44]:
user_recomendation = user_recomendation.drop(random_user.index, errors='ignore')  
user_recomendation.head(10)

7     82.814613
8     79.144959
9     58.433464
10    63.848659
11    49.501681
12    52.022681
13    44.019191
14    73.508536
15    53.204647
16    62.206179
dtype: float64

In [48]:
user_recomendations = pd.DataFrame(user_recomendation)
user_recomendations.columns = ['recomended_score']
user_recomendations['manga_id'] = user_recomendations.index
user_recomendations.index = np.arange(user_recomendations.shape[0])
user_recomendations

Unnamed: 0,recomended_score,manga_id
0,82.814613,7
1,79.144959,8
2,58.433464,9
3,63.848659,10
4,49.501681,11
5,52.022681,12
6,44.019191,13
7,73.508536,14
8,53.204647,15
9,62.206179,16


And get a full view of the results ordered by their recommendation score:

In [51]:
df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
df_user_recomendation.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3673,199.198508,15484,Nemuru Baka,9452,22,6.727273
1002,186.284587,1610,Maniac Road,7413,21,7.095238
4119,177.127647,19283,Uchuu no SPARROW,9112,21,6.476190
2365,172.521669,6934,Green Beans,16328,23,5.086957
2674,170.770075,8799,Umi-chan no Otomodachi,14613,20,6.100000
1229,170.594296,2156,Kaguya-hime,3415,27,6.851852
3681,170.589875,15536,Shoujo Nemu,7696,27,7.037037
1468,169.869810,3254,Tetsuwan Birdy,3088,21,7.666667
4013,169.106703,18327,Monochrome Myst,7701,40,7.200000
1352,166.872469,2970,Moyashimon,2319,51,7.627451


### Generalización (sistema de recomendación por similaridad entre usuarios)

En principio había guardado en .csv tres matrices de correlaciones: de Pearson, de Spearman, y una ponderada. Pero eso me guardaría las correlaciones entre todos los usuarios existentes para que luego me bastara con cargar la lista por ahorrarme el cálculo. Esto vale (y de hecho lo hago) en la similaridad entre productos, pero aquí no (me sirve tener las correlaciones de entre todos los productos entre ellos en vez de hacer el cálculo de la correlación entre cada nuevo producto por todos los productos, porque es un dato que voy a usar constantemente). Aquí, repitiendo lo de arriba, solo necesito obtener la correlación entre un usuario y el resto con el único usuario objetivo, que la mayoría de los casos no estará de antemano en esa lista, sino que será un perfil nuevo, con lo que no tiene sentido guardar las matrices (o la matriz final ponderada) de las correlaciones entre usuarios.

In [7]:
# Las matrices de correlaciones completas serían así:

# corr_pearson_users = pivot_users.corr(method='pearson')  
# corr_spearman_users = pivot_users.corr(method='spearman')  

# La similitud entre usuarios la valoramos con una ponderación entre estas dos matrices.
# 70% correlación de Spearman - 30% correlación de Pearson.

# corr_users = 0.7*corr_spearman_users + 0.3*corr_pearson_users

# Notemos que la matriz de correlaciones de Pearson tiene más valores (menos NA) que la matriz
# de correlaciones de Spearman. Al hacer esta operación se mantienen como NA los valores NA de
# la matriz de Spearman. Otra opción sería que para esos NA se usara el valor de 
# 0.3*corr_pearson_users. Prefiero dejarlo como está porque así aseguro más paralelismo entre
# perfiles de usuario al haber menos coeficientes de similaridad entre ellos.

# p, q = corr_users.shape[0]*corr_users.shape[1], corr_users.isna().sum().sum()

# print(f'De {p} valores totales de la matriz, tenemos rellenados {p-q}. \nEsto es, una proporción de {(p-q)/p}%.')

# De 26780625 valores totales de la matriz, tenemos rellenados 16900466. 
# Esto es, una proporción de 0.631 %. No está nada mal.

Vamos a escoger un usuario cualquiera y generar alguna recomendación para él (escojo el mismo que para la recomendación por similaridad de producto, para ver las diferencias o parecidos):

In [5]:
random_user = pivot_mangas.iloc[2678].dropna()
random_user

manga_id
1        10.0
2         8.0
3         9.0
4         9.0
26        7.0
51        9.0
104       9.0
149       8.0
399      10.0
401       8.0
436       8.0
481      10.0
642       8.0
656      10.0
657      10.0
705       5.0
731      10.0
743       4.0
745       9.0
768       8.0
909       4.0
912       6.0
936       8.0
1373      9.0
1470      7.0
1471      8.0
1706      9.0
3008      5.0
3009      7.0
3258      8.0
3731      8.0
4632      9.0
5461      8.0
6604      6.0
7216      7.0
7375      9.0
8967      9.0
10690     5.0
11471     5.0
11734     3.0
14790     6.0
14893     6.0
15355     8.0
17192     5.0
17353     5.0
Name: Swarnadeep, dtype: float64

Ahora recorro todos los usuarios parecidos a él según el coeficiente de similaridad asignado por la matriz ponderada entre los dos tipos de correlaciones.

In [8]:
# Correlamos el resto de columnas que representan a los demás usuarios con la seleccionada:
# Notemos que la instrucción tarda un poco y se tiene que hacer esto cada
# vez que se le quiera recomendar algo a un usuario por este método.
similar_users_by_pearson = pivot_users.corrwith(other=random_user, method='pearson').dropna() 
similar_users_by_spearman = pivot_users.corrwith(other=random_user, method='spearman').dropna() 
# La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson

df_similar_users = pd.DataFrame(similar_users)
df_similar_users.columns = ['Similarity']
# Me quedo solo con los usuarios con puntuación de similaridad positiva.
df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
df_similar_users.sort_values(by=['Similarity'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,Similarity
user,Unnamed: 1_level_1
ManU-Alchemist,1.000000
kidxatxheart,1.000000
okayhero,1.000000
gtzice2,1.000000
skutieos,1.000000
CreativeName42,1.000000
VAPOR_KNIGHTS_,1.000000
megarock327,1.000000
ilhoon,1.000000
Moriette,1.000000


In [18]:
# Esta función también tarda:
recomendations_by_similar_users = pd.Series()
for i in range(df_similar_users.shape[0]):
    current_user = df_similar_users.index[i]
    user_similarity = df_similar_users.values[i][0]
    # Hago una serie para tratar los datos de cada usuario similar:
    series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
    series_current_user.index=ratings[ratings['user'] == current_user].manga_id
    # Elimino de esta serie los productos que ya ha consumido el objetivo:
    series_current_user = series_current_user.drop(random_user.index, errors='ignore')
    # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
    # notas suspensas (aunque no con mucha fuerza)
    series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
    recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
# Obtengo una serie de pandas con muchas ids de mangas repetidas:
recomendations_by_similar_users

1        5.320632
4632     5.320632
21       4.729451
30       4.729451
745      5.911814
102      5.911814
1033     5.911814
3731     5.911814
9296    -2.364725
8652     4.729451
936      5.911814
463      5.320632
670      4.138270
1250     5.320632
436      5.911814
17465    5.911814
142      4.138270
933     -1.182363
11438    5.911814
7107     4.138270
448      5.320632
1267     5.911814
2436     4.138270
3614     5.911814
1162     4.138270
15355    5.911814
209      5.911814
932      5.911814
1373     5.911814
156      4.729451
           ...   
13029    2.424119
16109    2.424119
900      0.000000
4291     2.424119
4060     2.424119
11323    0.000000
2894    -0.346303
6041     2.770422
1178     2.770422
6862    -1.038908
11926   -0.346303
1007     2.077817
4371     2.424119
4548     2.077817
5929     2.424119
6922     2.077817
4568     2.077817
10579    3.116725
1235     2.424119
10658    2.077817
3585     2.424119
968      2.424119
4521     2.770422
7719     2.424119
5902     0

#### Recomendación por popularidad:

Utilizo la suma como función de agregación:

In [24]:
recomendations_by_similar_users_pop = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
recomendations_by_similar_users_pop

1        2603.458876
2        3252.402201
3        2493.545951
4        1053.837368
7         908.177542
8         478.019110
9         975.394837
10        994.361270
11       2678.260347
12       1961.135060
13       3231.974949
14        677.250854
15       1102.045245
16        943.068588
17        594.494901
18        380.396346
19        261.435791
20        920.238038
21       3263.201058
22       1749.767105
23        696.700824
24       1334.217225
25       3305.534129
26       1936.574919
27        253.140707
28        852.834272
29        493.515127
30       1165.346739
31        756.419162
32        542.283005
            ...     
19802       7.298406
19809      36.712064
19810      56.858827
19833      22.624130
19839     166.413132
19841      18.126947
19844     240.102869
19869      39.446317
19871      79.519269
19878      16.478213
19885     197.132249
19896     182.672963
19911      65.818267
19922      70.938403
19925      51.435088
19926      25.352204
19931      22

In [21]:
# Montado como dataframe:
df_recomendations_by_similar_users_pop = pd.DataFrame(recomendations_by_similar_users_pop)
df_recomendations_by_similar_users_pop.columns = ['recomended_score']
df_recomendations_by_similar_users_pop['manga_id'] = df_recomendations_by_similar_users_pop.index
df_recomendations_by_similar_users_pop.index = np.arange(df_recomendations_by_similar_users_pop.shape[0])

df_recomendations_by_similar_users_pop = pd.merge(df_recomendations_by_similar_users_pop, mangas, on=['manga_id'], how='inner')
# Recomendaciones:
df_recomendations_by_similar_users_pop.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
22,3305.534129,25,Fullmetal Alchemist,3,2054,9.022882
18,3263.201058,21,Death Note,35,2433,8.469379
1,3252.402201,2,Berserk,1,1913,9.003659
10,3231.974949,13,One Piece,4,2228,8.618492
2739,2719.443113,8967,Onanie Master Kurosawa,128,1611,8.378647
8,2678.260347,11,Naruto,741,3013,7.545304
0,2603.458876,1,Monster,5,981,8.928644
1987,2567.806063,4632,Oyasumi Punpun,6,1281,8.704918
2,2493.545951,3,20th Century Boys,11,1037,8.764706
292,2410.041364,436,Uzumaki,673,1230,7.793496


In [22]:
# Mangas que menos le van a gustar:
df_recomendations_by_similar_users_pop.sort_values(by=['recomended_score'], ascending=True)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3235,-37.062626,12200,High School Musical,16375,68,1.970588
1259,-18.872427,2177,Battle Royale II: Blitz Royale,16367,85,3.647059
3532,-16.303456,14218,My Sweet Sisters,16373,29,3.379310
2425,-2.650803,7118,Test Flight Girls,16326,23,4.565217
1936,-2.442584,4536,Gekkou Denchi Shiki Ningyou Gekijou,15474,28,4.892857
3496,-1.945117,13984,3H Before Kiss,16171,31,5.354839
3401,-1.659560,13354,Hissatsu Surume Katame,16369,24,3.458333
3924,-1.613066,17244,Truth of a Lily,14809,27,5.925926
2226,-1.377101,5689,Little Sweet Delusion,16319,48,5.375000
2462,-0.920782,7281,Ennui na Kanojo,14172,27,6.592593


#### Recomendación por especifidad:

Media aritmética como función de agregación:

In [25]:
recomendations_by_similar_users_esp_1 = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
recomendations_by_similar_users_esp_1

1        4.061558
2        3.670883
3        3.512037
4        3.466570
7        3.278619
8        3.649001
9        3.306423
10       3.440696
11       2.810347
12       2.256772
13       3.723473
14       2.919185
15       3.289687
16       2.974980
17       3.476578
18       3.426994
19       2.872921
20       3.370835
21       3.781229
22       3.585588
23       3.240469
24       3.199562
25       4.243304
26       3.452005
27       2.751529
28       3.824369
29       3.380241
30       3.676173
31       3.534669
32       2.658250
           ...   
19802    1.216401
19809    2.159533
19810    2.105882
19833    3.770688
19839    2.189646
19841    3.625389
19844    2.087851
19869    4.930790
19871    3.457360
19878    1.647821
19885    2.032291
19896    2.572859
19911    1.828285
19922    2.837536
19925    1.836967
19926    2.304746
19931    3.693490
19932    2.396587
19939    3.583580
19945    2.529141
19947    3.480379
19952    1.756928
19961    0.822337
19968    3.707914
19980    3

In [26]:
# Montado como dataframe:
df_recomendations_by_similar_users_esp_1 = pd.DataFrame(recomendations_by_similar_users_esp_1)
df_recomendations_by_similar_users_esp_1.columns = ['recomended_score']
df_recomendations_by_similar_users_esp_1['manga_id'] = df_recomendations_by_similar_users_esp_1.index
df_recomendations_by_similar_users_esp_1.index = np.arange(df_recomendations_by_similar_users_esp_1.shape[0])

df_recomendations_by_similar_users_esp_1 = pd.merge(df_recomendations_by_similar_users_esp_1, mangas, on=['manga_id'], how='inner')
# Recomendaciones:
df_recomendations_by_similar_users_esp_1.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3279,9.000000,12472,Bara no Kusari,13246,20,6.800000
1275,8.500000,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854
2579,6.890234,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
2816,6.645900,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
1223,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
2845,6.020992,9869,S+,6043,24,6.958333
1964,6.006897,4588,Con Con x Honey,9312,21,7.142857
1994,5.835351,4651,Shinayaka ni Kizutsuite,12368,20,6.400000
3999,5.786499,17790,Sandwich Girl,3282,28,7.785714
290,5.772426,431,Densha Otoko: Bijo to Junjou Otaku Seinen no N...,2623,21,8.380952


In [27]:
# Mangas que menos le van a gustar:
df_recomendations_by_similar_users_esp_1.sort_values(by=['recomended_score'], ascending=True)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3235,-1.090077,12200,High School Musical,16375,68,1.970588
3532,-0.959027,14218,My Sweet Sisters,16373,29,3.379310
2462,-0.920782,7281,Ennui na Kanojo,14172,27,6.592593
3496,-0.486279,13984,3H Before Kiss,16171,31,5.354839
1259,-0.314540,2177,Battle Royale II: Blitz Royale,16367,85,3.647059
218,-0.298747,335,Gravitation: Voice of Temptation,4087,20,6.800000
1936,-0.244258,4536,Gekkou Denchi Shiki Ningyou Gekijou,15474,28,4.892857
3924,-0.230438,17244,Truth of a Lily,14809,27,5.925926
2425,-0.220900,7118,Test Flight Girls,16326,23,4.565217
2327,-0.201946,6200,Ai wo Tomenaide,16252,37,5.351351


Una prueba con la media geométrica como función de agregación:

In [30]:
from scipy.stats.mstats import gmean

In [38]:
recomendations_by_similar_users_esp_2 = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).aggregate(gmean)
recomendations_by_similar_users_esp_2

1             NaN
2             NaN
3             NaN
4             NaN
7             NaN
8             NaN
9             NaN
10            NaN
11            NaN
12            NaN
13            NaN
14            NaN
15            NaN
16            NaN
17            NaN
18            NaN
19            NaN
20            NaN
21            NaN
22            NaN
23            NaN
24            NaN
25            NaN
26            NaN
27            NaN
28            NaN
29            NaN
30            NaN
31            NaN
32            NaN
           ...   
19802         NaN
19809    0.000000
19810         NaN
19833    2.325599
19839         NaN
19841    1.977253
19844         NaN
19869    0.000000
19871         NaN
19878         NaN
19885         NaN
19896         NaN
19911         NaN
19922    0.000000
19925         NaN
19926         NaN
19931    2.584951
19932    0.000000
19939    2.677643
19945    0.000000
19947    0.000000
19952         NaN
19961         NaN
19968    0.000000
19980     

In [41]:
recomendations_by_similar_users_esp_2 = recomendations_by_similar_users.groupby(
                                        recomendations_by_similar_users.index).apply(lambda group: group.product() ** (1 / float(len(group) )))
recomendations_by_similar_users_esp_2

  


1        0.000000
2        0.000000
3        0.000000
4        0.000000
7        0.000000
8        0.000000
9        0.000000
10       0.000000
11       0.000000
12       0.000000
13       0.000000
14       0.000000
15       0.000000
16       0.000000
17       0.000000
18       0.000000
19       0.000000
20       0.000000
21       0.000000
22       0.000000
23       0.000000
24       0.000000
25       0.000000
26       0.000000
27       0.000000
28       0.000000
29       0.000000
30       0.000000
31       0.000000
32       0.000000
           ...   
19802    0.000000
19809    0.000000
19810    0.000000
19833    2.325599
19839    0.000000
19841    1.977253
19844    0.000000
19869    0.000000
19871    0.000000
19878    1.186382
19885    0.000000
19896    0.000000
19911    0.000000
19922    0.000000
19925    0.000000
19926    0.000000
19931    2.584951
19932    0.000000
19939    2.677643
19945    0.000000
19947    0.000000
19952    0.000000
19961    0.000000
19968    0.000000
19980    0

Esto ofrece muy pocos resultados:

In [48]:
# Montado como dataframe:
df_recomendations_by_similar_users_esp_2 = pd.DataFrame(recomendations_by_similar_users_esp_2)
df_recomendations_by_similar_users_esp_2.columns = ['recomended_score']
df_recomendations_by_similar_users_esp_2['manga_id'] = df_recomendations_by_similar_users_esp_2.index
df_recomendations_by_similar_users_esp_2.index = np.arange(df_recomendations_by_similar_users_esp_2.shape[0])

df_recomendations_by_similar_users_esp_2 = pd.merge(df_recomendations_by_similar_users_esp_2, mangas, on=['manga_id'], how='inner')
df_recomendations_by_similar_users_esp_2 = df_recomendations_by_similar_users_esp_2[df_recomendations_by_similar_users_esp_2['recomended_score'] > 0].dropna()
# Recomendaciones:
df_recomendations_by_similar_users_esp_2.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3279,9.000000,12472,Bara no Kusari,13246,20,6.800000
1275,8.485281,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854
2579,6.822470,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
2816,6.555104,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
1223,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
1964,5.986506,4588,Con Con x Honey,9312,21,7.142857
1994,5.833028,4651,Shinayaka ni Kizutsuite,12368,20,6.400000
2004,5.603401,4684,Goshujinsama to Watashi,7934,21,7.285714
290,5.471588,431,Densha Otoko: Bijo to Junjou Otaku Seinen no N...,2623,21,8.380952
2696,5.448556,8722,NG Boy x Paradise,4722,26,7.500000


In [50]:
# Mangas que menos le van a gustar:
df_recomendations_by_similar_users_esp_2.sort_values(by=['recomended_score'], ascending=True)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
4170,0.356221,19334,I Will Be Cinderella,2337,28,7.642857
2452,0.364778,7246,Kiseki no Koibito,13859,26,6.576923
3102,0.410497,11430,VITA Sexualis,11018,21,6.238095
4080,0.631316,18566,Asura,8486,35,6.771429
1245,0.683184,2142,Ouji-sama no Kanojo,15431,53,6.000000
2714,0.694187,8835,Shadow of Visions,7435,22,7.181818
1267,0.721897,2428,"Darling, I Love You!",13770,40,5.975000
2256,0.767651,5854,Hanasakeru Seishounen,2076,20,7.450000
1938,0.775102,4540,Fujunna Renai,7784,44,7.136364
3997,0.843148,17770,Hieshou Danshi Kouryakuhou,11675,55,6.290909


A la vista de la diferencia entre aplicar la media aritmética o la geométrica como función de agregación, usaremos la media aritmética.

### Resultados

#### Similaridad entre productos

Cargo las matrices de correlaciones:

In [14]:
corr_mangas = pd.read_csv('corr_pearson_mangas.csv', header=0, index_col=0)
corr_mangas_reduced = pd.read_csv('corr_pearson_mangas_filtered.csv', header=0, index_col=0)

Comprimo el proceso anterior usando la matriz de correlaciones filtrada (mangas con más de 100 puntuaciones) y haciendo la recomendación por popularidad:

In [57]:
random_user_id = 2678
random_user = pivot_mangas.iloc[random_user_id].dropna()
user_recomendation = pd.Series()
for manga_index in range(len(random_user)):
    similar_mangas = corr_mangas_reduced[str(random_user.index[manga_index])].dropna()
    similar_mangas = similar_mangas.drop(random_user.index, errors='ignore')
    score = random_user.values[manga_index]
    similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
    user_recomendation = user_recomendation.append(similar_mangas)
# Popularidad:
# user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
# Especifidad:
user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
user_recomendations = pd.DataFrame(user_recomendation)
user_recomendations.columns = ['recomended_score']
user_recomendations['manga_id'] = user_recomendations.index
user_recomendations.index = np.arange(user_recomendations.shape[0])
df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
df_user_recomendation.sort_values(by=['recomended_score'], ascending=False)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
339,4.575700,6812,Kyou kara Ore wa!!,119,166,8.445783
308,4.478454,4628,Shin Yami no Koe - Kaidan,4407,127,7.000000
479,4.005417,19844,Mouryou no Yurikago,6455,218,6.954128
239,3.812048,1908,Hatsukoi Limited.,3900,267,7.142322
411,3.737173,13102,Kanojo wo Mamoru 51 no Houhou,2719,186,7.440860
136,3.612332,770,Gunnm: Last Order,522,170,7.705882
92,3.603917,582,Kodomo no Jikan,1022,283,7.717314
210,3.449536,1450,Skyhigh,4191,185,7.227027
462,3.410571,17368,Blame Gakuen! And So On,11983,150,6.046667
193,3.407764,1282,Itoshi no Kana,3895,267,7.359551


En forma de función:

In [13]:
def recomendations_by_item_similarity_for_given_user(user_id, filtered=False, subtype='specifity'):
    random_user = pivot_mangas.iloc[user_id].dropna()
    user_recomendation = pd.Series()
    for manga_index in range(len(random_user)):
        if filtered:
            similar_mangas = corr_mangas_reduced[str(random_user.index[manga_index])].dropna()
        else:
            similar_mangas = corr_mangas[str(random_user.index[manga_index])].dropna()
        similar_mangas = similar_mangas.drop(random_user.index, errors='ignore')
        score = random_user.values[manga_index]
        similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
        user_recomendation = user_recomendation.append(similar_mangas)
    if subtype is 'popularity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
    elif subtype is 'specifity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
    user_recomendations = pd.DataFrame(user_recomendation)
    user_recomendations.columns = ['recomended_score']
    user_recomendations['manga_id'] = user_recomendations.index
    user_recomendations.index = np.arange(user_recomendations.shape[0])
    df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
    recomendations = df_user_recomendation.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

def recomendations_by_item_similarity_for_new_user(new_user, filtered=False, subtype='specifity'):
    user_recomendation = pd.Series()
    for manga_index in range(len(new_user)):
        if filtered:
            similar_mangas = corr_mangas_reduced[str(new_user.index[manga_index])].dropna()
        else:
            similar_mangas = corr_mangas[str(new_user.index[manga_index])].dropna()
        similar_mangas = similar_mangas.drop(new_user.index, errors='ignore')
        score = new_user.values[manga_index]
        similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
        user_recomendation = user_recomendation.append(similar_mangas)
    if subtype is 'popularity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
    elif subtype is 'specifity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
    user_recomendations = pd.DataFrame(user_recomendation)
    user_recomendations.columns = ['recomended_score']
    user_recomendations['manga_id'] = user_recomendations.index
    user_recomendations.index = np.arange(user_recomendations.shape[0])
    df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
    recomendations = df_user_recomendation.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

In [59]:
recomendations_by_item_similarity_for_given_user(2)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,6.562621,4456,Suzunari!,9084,20,6.200000
1,4.777334,3254,Tetsuwan Birdy,3088,21,7.666667
2,4.716608,449,Miracle☆Girls,9966,33,6.939394
3,4.333647,386,Kimi no Unaji ni Kanpai!,10297,25,6.440000
4,3.826842,8473,Promise,3174,27,7.814815
5,3.759763,4472,Kamisama no Orgel,8716,27,7.185185
6,3.699187,9084,Otome Youkai Zakuro,2257,59,7.661017
7,3.251008,3963,Tau,11745,20,6.800000
8,2.987644,12537,Yoru wo Utau Kodomotachi,12489,29,5.931034
9,2.827247,1375,Oni-gokko,10646,22,6.363636


Mejor aún, haciendo la recomendación a través de unas puntuaciones variables y personalizables:

In [10]:
some_personal_scores = {'Berserk': 10, 'Neon Genesis Evangelion': 10, 'Gantz': 8, 'Monster': 8, 'One Piece': 10, 'Akira': 5, 'Kiseijuu': 7, 'Death Note': 9,
                        'Dragon Ball': 7, 'Hunter x Hunter': 9}

for manga_name in some_personal_scores.keys():
    if manga_name not in mangas['manga_name'].values:
        raise Exception(f'The manga {manga_name} is not on the manga database.')
print('Data introduced is ok.')

Data introduced is ok.


In [11]:
def prepare_new_user(dict_scores):
    manga_dict = {}
    for manga_name in dict_scores.keys():
        manga_id = mangas[mangas['manga_name'] == manga_name].manga_id.iloc[0]
        manga_dict[manga_id] = dict_scores[manga_name]
    new_user = pd.Series(manga_dict)
    return new_user

new_user = prepare_new_user(some_personal_scores)
new_user

2      10
698    10
564     8
1       8
13     10
664     5
401     7
21      9
42      7
26      9
dtype: int64

In [15]:
recomendations_by_item_similarity_for_new_user(new_user)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,7.873697,13844,Koisuru Yajuu,11451,23,6.521739
1,7.825673,6572,Get the Moon,8352,21,7.190476
2,7.746810,4348,Love Laboratory,10086,25,6.200000
3,7.651809,11659,Close to My Sweetheart,11876,21,6.428571
4,7.446664,7281,Ennui na Kanojo,14172,27,6.592593
5,7.283702,4213,Operation Liberate Men,3346,24,7.666667
6,7.017995,5440,Sei Dragon Girl Miracle,5354,22,6.545455
7,6.997123,19987,"Kimi to, Sekai ga Owaru made",11786,22,6.772727
8,6.893495,18813,Megane no Koiwazurai,11131,26,6.884615
9,6.591614,18741,Going to You,7705,34,7.000000


In [17]:
recomendations_by_item_similarity_for_new_user(new_user, filtered=True)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,5.081413,7458,Death Note Another Note: Los Angeles BB Renzok...,192,273,8.285714
1,4.148694,14721,D-Frag!,890,214,7.556075
2,4.120732,534,Spiral: Suiri no Kizuna,760,182,7.956044
3,4.045906,1110,Shiawase Kissa 3-choume,397,458,8.268559
4,3.888315,1009,Hachimitsu to Clover,213,219,8.374429
5,3.813086,17915,Judge,5071,254,6.759843
6,3.736257,14483,Uchuu Kyoudai,22,164,8.859756
7,3.715459,11438,Kimi ni shika Kikoenai,860,271,7.896679
8,3.666210,19358,Ratman,3553,216,7.500000
9,3.534426,19154,Kimi no Knife,2199,188,7.521277


In [18]:
recomendations_by_item_similarity_for_new_user(new_user, subtype='popularity')

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,54.609464,10704,Koi no Uta,6170,49,7.265306
1,52.018524,4375,Lamp no Ousama,13987,42,6.404762
2,50.985912,4213,Operation Liberate Men,3346,24,7.666667
3,50.030282,8721,Kimi to Boku no Junjou Renai Jijou,7796,37,6.891892
4,48.566500,12748,Futari Awasete Puramai Zero,10582,71,6.816901
5,46.462757,17647,Tokyo Kareshi,6711,91,6.901099
6,45.165494,17594,Yuki no Project,11778,41,6.634146
7,44.979870,4099,Auto Focus,5248,59,7.525424
8,44.916559,15675,Chitose etc.,8369,39,6.948718
9,44.379693,17770,Hieshou Danshi Kouryakuhou,11675,55,6.290909


In [19]:
recomendations_by_item_similarity_for_new_user(new_user, filtered=True, subtype='popularity')

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,21.879390,3537,Yankee-kun to Megane-chan,1176,644,7.818323
1,21.676449,5664,Nurarihyon no Mago,831,470,7.778723
2,21.382791,219,Alive: Saishuu Shinkateki Shounen,1043,381,7.690289
3,20.380160,15578,GE: Good Ending,1214,679,7.572901
4,20.333029,13702,Tonari no Kaibutsu-kun,367,1004,8.074701
5,20.251616,3403,Rosario to Vampire: Season II,452,696,7.899425
6,20.222677,447,Oretama,4287,511,7.195695
7,19.769090,105,Kekkaishi,907,416,7.846154
8,19.445258,671,To LOVE-Ru,2769,654,7.045872
9,19.377674,610,Skip Beat!,62,1266,8.626382


#### Similaridad entre usuarios

Cargo la tabla pivot:

In [5]:
pivot_users = ratings.pivot_table(index=['manga_id'], columns=['user'], values='score')
pivot_mangas = ratings.pivot_table(index=['user'], columns=['manga_id'], values='score')

Comprimo el proceso visto en el notebook haciendo la recomendación por especifidad:

In [6]:
random_user = pivot_mangas.iloc[2678].dropna()
# Calculo las correlaciones con el resto de usuarios:
similar_users_by_pearson = pivot_users.corrwith(other=random_user, method='pearson').dropna() 
similar_users_by_spearman = pivot_users.corrwith(other=random_user, method='spearman').dropna() 
# La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson

# Tabla de usuarios similares:
df_similar_users = pd.DataFrame(similar_users)
df_similar_users.columns = ['Similarity']
# Me quedo solo con los usuarios con puntuación de similaridad positiva.
df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]

# Empiezo a construir las puntuaciones de las recomendaciones:
recomendations_by_similar_users = pd.Series()
for i in range(df_similar_users.shape[0]):
    current_user = df_similar_users.index[i]
    user_similarity = df_similar_users.values[i][0]
    # Hago una serie para tratar los datos de cada usuario similar:
    series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
    series_current_user.index=ratings[ratings['user'] == current_user].manga_id
    # Elimino de esta serie los productos que ya ha consumido el objetivo:
    series_current_user = series_current_user.drop(random_user.index, errors='ignore')
    # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
    # notas suspensas (aunque no con mucha fuerza)
    series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
    recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)

# Función de agregación: por popularidad o por especifidad:
# recomendations_by_similar_users_pop = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
recomendations_by_similar_users_esp = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()

# Construyo el dataframe para la presentación de resultados:
df_recomendations_by_similar_users_esp = pd.DataFrame(recomendations_by_similar_users_esp)
df_recomendations_by_similar_users_esp.columns = ['recomended_score']
df_recomendations_by_similar_users_esp['manga_id'] = df_recomendations_by_similar_users_esp.index
df_recomendations_by_similar_users_esp.index = np.arange(df_recomendations_by_similar_users_esp.shape[0])

df_recomendations_by_similar_users_esp = pd.merge(df_recomendations_by_similar_users_esp, mangas, on=['manga_id'], how='inner')

# Recomendaciones:
df_recomendations_by_similar_users_esp.sort_values(by=['recomended_score'], ascending=False)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
3279,9.000000,12472,Bara no Kusari,13246,20,6.800000
1275,8.500000,2458,Kimi wa Boku wo Suki ni Naru,5346,41,7.365854
2579,6.890234,7972,Nayameru Hime to Mayoeru Ouji,14945,24,6.041667
2816,6.645900,9606,Yubi to Kuchibiru to Hitomi no Ijiwaru,9329,24,6.750000
1223,6.212536,1990,Ouji-sama no Renai Jijou,15668,26,6.115385
2845,6.020992,9869,S+,6043,24,6.958333
1964,6.006897,4588,Con Con x Honey,9312,21,7.142857
1994,5.835351,4651,Shinayaka ni Kizutsuite,12368,20,6.400000
3999,5.786499,17790,Sandwich Girl,3282,28,7.785714
290,5.772426,431,Densha Otoko: Bijo to Junjou Otaku Seinen no N...,2623,21,8.380952


En forma de función:

In [30]:
def recomendations_by_user_similarity_for_given_user(user_id, subtype='specifity'):
    random_user = pivot_mangas.iloc[user_id].dropna()
    # Calculo las correlaciones con el resto de usuarios:
    similar_users_by_pearson = pivot_users.corrwith(other=random_user, method='pearson').dropna() 
    similar_users_by_spearman = pivot_users.corrwith(other=random_user, method='spearman').dropna() 
    # La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
    similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson
    # Tabla de usuarios similares:
    df_similar_users = pd.DataFrame(similar_users)
    df_similar_users.columns = ['Similarity']
    # Me quedo solo con los usuarios con puntuación de similaridad positiva.
    df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
    # Empiezo a construir las puntuaciones de las recomendaciones:
    recomendations_by_similar_users = pd.Series()
    for i in range(df_similar_users.shape[0]):
        current_user = df_similar_users.index[i]
        user_similarity = df_similar_users.values[i][0]
        # Hago una serie para tratar los datos de cada usuario similar:
        series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
        series_current_user.index=ratings[ratings['user'] == current_user].manga_id
        # Elimino de esta serie los productos que ya ha consumido el objetivo:
        series_current_user = series_current_user.drop(random_user.index, errors='ignore')
        # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
        # notas suspensas (aunque no con mucha fuerza)
        series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
        recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
    # Función de agregación: por popularidad o por especifidad:
    if subtype is 'popularity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
    elif subtype is 'specifity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
    # Construyo el dataframe para la presentación de resultados:
    recomendations = pd.DataFrame(recomendations_by_similar_users)
    recomendations.columns = ['recomended_score']
    recomendations['manga_id'] = recomendations.index
    recomendations.index = np.arange(recomendations.shape[0])
    recomendations = pd.merge(recomendations, mangas, on=['manga_id'], how='inner')
    # Recomendaciones:
    recomendations = recomendations.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations
    

def recomendations_by_user_similarity_for_new_user(new_user, subtype='specifity'):
    similar_users_by_pearson = pivot_users.corrwith(other=new_user, method='pearson').dropna() 
    similar_users_by_spearman = pivot_users.corrwith(other=new_user, method='spearman').dropna() 
    # La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
    similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson
    # Tabla de usuarios similares:
    df_similar_users = pd.DataFrame(similar_users)
    df_similar_users.columns = ['Similarity']
    # Me quedo solo con los usuarios con puntuación de similaridad positiva.
    df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
    # Empiezo a construir las puntuaciones de las recomendaciones:
    recomendations_by_similar_users = pd.Series()
    for i in range(df_similar_users.shape[0]):
        current_user = df_similar_users.index[i]
        user_similarity = df_similar_users.values[i][0]
        # Hago una serie para tratar los datos de cada usuario similar:
        series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
        series_current_user.index=ratings[ratings['user'] == current_user].manga_id
        # Elimino de esta serie los productos que ya ha consumido el objetivo:
        series_current_user = series_current_user.drop(new_user.index, errors='ignore')
        # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
        # notas suspensas (aunque no con mucha fuerza)
        series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
        recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
    # Función de agregación: por popularidad o por especifidad:
    if subtype is 'popularity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
    elif subtype is 'specifity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
    # Construyo el dataframe para la presentación de resultados:
    recomendations = pd.DataFrame(recomendations_by_similar_users)
    recomendations.columns = ['recomended_score']
    recomendations['manga_id'] = recomendations.index
    recomendations.index = np.arange(recomendations.shape[0])
    recomendations = pd.merge(recomendations, mangas, on=['manga_id'], how='inner')
    # Recomendaciones:
    recomendations = recomendations.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

In [24]:
recomendations_by_user_similarity_for_given_user(2)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,7.255016,95,Lagoon Engine,8193,20,6.750000
1,6.702769,18747,The Bear-Like Fox Meets The Wolf,5387,27,7.370370
2,6.537838,7554,Second Kiss,9878,21,7.095238
3,6.486384,4944,Cream,8345,21,6.714286
4,6.485082,4651,Shinayaka ni Kizutsuite,12368,20,6.400000
5,6.383989,4081,Princess,482,25,8.480000
6,6.371372,944,Chuuka Ichiban!,3515,20,7.300000
7,6.291557,4602,Koi*Oto,2779,27,7.481481
8,6.266328,3167,Chicchana Yukitsukai Sugar,10552,22,6.954545
9,6.262133,4684,Goshujinsama to Watashi,7934,21,7.285714


Mejor aún, haciendo la recomendación a través de unas puntuaciones variables y personalizables:

In [20]:
some_personal_scores = {'Berserk': 10, 'Neon Genesis Evangelion': 10, 'Gantz': 8, 'Monster': 8, 'One Piece': 10, 'Akira': 5, 'Kiseijuu': 7, 'Death Note': 9,
                        'Dragon Ball': 7, 'Hunter x Hunter': 9}

for manga_name in some_personal_scores.keys():
    if manga_name not in mangas['manga_name'].values:
        raise Exception(f'The manga {manga_name} is not on the manga database.')
print('Data introduced is ok.')

Data introduced is ok.


In [28]:
def prepare_new_user(dict_scores):
    manga_dict = {}
    for manga_name in dict_scores.keys():
        manga_id = mangas[mangas['manga_name'] == manga_name].manga_id.iloc[0]
        manga_dict[manga_id] = dict_scores[manga_name]
    new_user = pd.Series(manga_dict)
    return new_user

new_user = prepare_new_user(some_personal_scores)
new_user

2      10
698    10
564     8
1       8
13     10
664     5
401     7
21      9
42      7
26      9
dtype: int64

In [25]:
recomendations_by_user_similarity_for_new_user(new_user)

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,9.000000,12352,Yuki no Kuni kara,4629,23,7.043478
1,9.000000,15667,Make Sweet,6706,34,6.970588
2,9.000000,2481,Every Day Every Night,13771,20,6.600000
3,8.500000,8239,Vampire Knight: Ice Blue no Tsumi,1088,26,8.230769
4,8.500000,17633,Love & Noise!,3836,48,7.729167
5,8.500000,2613,Sonna Kimochi ga Koi datta,10995,30,7.000000
6,8.335740,13093,Sore wa Tabete wa Ikemasen,9103,40,7.375000
7,8.200000,4016,Yaya,7167,32,7.312500
8,8.000000,7387,Watashi no Cinderella,10439,46,7.065217
9,8.000000,8674,Deep Black,12380,20,7.000000


In [31]:
recomendations_by_user_similarity_for_new_user(new_user, subtype='popularity')

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,3158.877188,11,Naruto,741,3013,7.545304
1,2976.971433,25,Fullmetal Alchemist,3,2054,9.022882
2,2346.336936,8967,Onanie Master Kurosawa,128,1611,8.378647
3,2238.922578,12,Bleach,2117,2527,7.011872
4,2209.539790,9711,Bakuman.,149,1555,8.311254
5,2115.001464,583,Claymore,285,1306,8.132466
6,2002.680774,8586,The Breaker,88,1205,8.351037
7,1987.917417,3986,Deadman Wonderland,677,1492,7.767426
8,1972.840637,656,Vagabond,8,959,8.863399
9,1966.278524,598,Fairy Tail,1529,2058,7.189990


# Motor de recomendación:

Preámbulos:

In [1]:
import pandas as pd
import numpy as np

In [2]:
# La versión de pandas debería ser la 0.24.2 o más reciente.
pd.__version__

'0.24.2'

In [3]:
mangas = pd.read_csv('mangas_v2.csv')
scores = pd.read_csv('scores_v2.csv')
ratings = pd.merge(mangas, scores, on='manga_id')
ratings = ratings[['manga_id', 'user', 'score']]
corr_mangas = pd.read_csv('corr_pearson_mangas.csv', header=0, index_col=0)
corr_mangas_reduced = pd.read_csv('corr_pearson_mangas_filtered.csv', header=0, index_col=0)
pivot_users = ratings.pivot_table(index=['manga_id'], columns=['user'], values='score')
pivot_mangas = ratings.pivot_table(index=['user'], columns=['manga_id'], values='score')

Funciones:

In [4]:
def recomendations_by_item_similarity_for_new_user(new_user, filtered=False, subtype='specifity'):
    user_recomendation = pd.Series()
    for manga_index in range(len(new_user)):
        if filtered:
            similar_mangas = corr_mangas_reduced[str(new_user.index[manga_index])].dropna()
        else:
            similar_mangas = corr_mangas[str(new_user.index[manga_index])].dropna()
        similar_mangas = similar_mangas.drop(new_user.index, errors='ignore')
        score = new_user.values[manga_index]
        similar_mangas = similar_mangas.map(lambda x: x * score if score > 5 else (x-5) * score)
        user_recomendation = user_recomendation.append(similar_mangas)
    if subtype == 'popularity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).sum()
    elif subtype == 'specifity':
        user_recomendation = user_recomendation.groupby(user_recomendation.index).mean()
    user_recomendations = pd.DataFrame(user_recomendation)
    user_recomendations.columns = ['recomended_score']
    user_recomendations['manga_id'] = user_recomendations.index
    user_recomendations.index = np.arange(user_recomendations.shape[0])
    df_user_recomendation = pd.merge(user_recomendations, mangas, on=['manga_id'], how='inner')
    recomendations = df_user_recomendation.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

def recomendations_by_user_similarity_for_new_user(new_user, subtype='specifity'):
    similar_users_by_pearson = pivot_users.corrwith(other=new_user, method='pearson').dropna() 
    similar_users_by_spearman = pivot_users.corrwith(other=new_user, method='spearman').dropna() 
    # La similaridad entre usuarios según la ponderación entre los dos tipos de correlaciones.
    similar_users = 0.7*similar_users_by_spearman + 0.3*similar_users_by_pearson
    # Tabla de usuarios similares:
    df_similar_users = pd.DataFrame(similar_users)
    df_similar_users.columns = ['Similarity']
    # Me quedo solo con los usuarios con puntuación de similaridad positiva.
    df_similar_users = df_similar_users[df_similar_users['Similarity'] > 0]
    # Empiezo a construir las puntuaciones de las recomendaciones:
    recomendations_by_similar_users = pd.Series()
    for i in range(df_similar_users.shape[0]):
        current_user = df_similar_users.index[i]
        user_similarity = df_similar_users.values[i][0]
        # Hago una serie para tratar los datos de cada usuario similar:
        series_current_user = pd.Series(ratings[ratings['user'] == current_user].score)
        series_current_user.index=ratings[ratings['user'] == current_user].manga_id
        # Elimino de esta serie los productos que ya ha consumido el objetivo:
        series_current_user = series_current_user.drop(new_user.index, errors='ignore')
        # Multiplico las puntuaciones por la similaridad del usuario. Notemos que penalizo las
        # notas suspensas (aunque no con mucha fuerza)
        series_current_user = series_current_user.map(lambda x : x * user_similarity if x > 5 else (x-5) * user_similarity)
        recomendations_by_similar_users = recomendations_by_similar_users.append(series_current_user)
    # Función de agregación: por popularidad o por especifidad:
    if subtype == 'popularity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).sum()
    elif subtype == 'specifity':
        recomendations_by_similar_users = recomendations_by_similar_users.groupby(recomendations_by_similar_users.index).mean()
    # Construyo el dataframe para la presentación de resultados:
    recomendations = pd.DataFrame(recomendations_by_similar_users)
    recomendations.columns = ['recomended_score']
    recomendations['manga_id'] = recomendations.index
    recomendations.index = np.arange(recomendations.shape[0])
    recomendations = pd.merge(recomendations, mangas, on=['manga_id'], how='inner')
    # Recomendaciones:
    recomendations = recomendations.sort_values(by=['recomended_score', 'mean_score'], ascending=False)
    recomendations.index = np.arange(recomendations.shape[0])
    return recomendations

def recomendations_by_item_similarity_for_given_user(user_id, filtered=False, subtype='specifity'):
    user = pivot_mangas.iloc[user_id].dropna()
    return recomendations_by_item_similarity_for_new_user(user, filtered=filtered, subtype=subtype)

def recomendations_by_user_similarity_for_given_user(user_id, subtype='specifity'):
    user = pivot_mangas.iloc[user_id].dropna()
    return recomendations_by_user_similarity_for_new_user(user, subtype=subtype)
    
def get_recomendations(user, main_type='user_similarity', subtype='specifity', reduced_dtb=False):
    if main_type == 'user_similarity':
        return recomendations_by_user_similarity_for_new_user(user, subtype=subtype)
    elif main_type == 'item_similarity':
        return recomendations_by_item_similarity_for_new_user(user, filtered=reduced_dtb, subtype=subtype)
    else:
        raise Exception("You must select between 'user_similarity' or 'item_similarity'.")

def validate_user(user):
    flag = True
    for manga_name in user.keys():
        if manga_name not in mangas['manga_name'].values:
            print(f'The manga {manga_name} is not on the manga database.')
            flag = False
    if flag:
        print('Data introduced is ok.')
    return flag

def prepare_new_user(user):
    manga_dict = {}
    for manga_name in user.keys():
        manga_id = mangas[mangas['manga_name'] == manga_name].manga_id.iloc[0]
        manga_dict[manga_id] = user[manga_name]
    new_user = pd.Series(manga_dict)
    return new_user

Ejemplo de uso:

In [6]:
!pip install scipy

Collecting scipy
  Downloading https://files.pythonhosted.org/packages/be/cc/6f7842a4d9aa7f51155f849185631e1201df255742de84d724ac33bff723/scipy-1.3.0-cp37-cp37m-win32.whl (27.1MB)
Installing collected packages: scipy
Successfully installed scipy-1.3.0


In [8]:
user_example = {'Berserk': 10, 'Neon Genesis Evangelion': 10, 'Naruto': 8}

if validate_user(user_example):
    user = prepare_new_user(user_example)
    recomendations = get_recomendations(user, 'user_similarity', 'specifity')

recomendations

Data introduced is ok.


  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
0,10.000000,1021,The Summit,596,43,8.372093
1,10.000000,963,9-banme no Musashi,2557,20,7.450000
2,9.000000,412,Koibumi Biyori,6003,24,7.875000
3,9.000000,2576,Takeru ~ Opera Susanoh Sword of the Devil,2381,22,7.818182
4,9.000000,13213,Ragtonia,3271,28,7.642857
5,9.000000,11128,Miku-4,5801,27,7.296296
6,9.000000,166,Chou Shinri Genshou Nouryokusha Nanaki,6404,23,7.260870
7,9.000000,2996,Missing Piece,13306,21,7.047619
8,9.000000,8595,Houkago Orange,3819,32,6.968750
9,9.000000,4678,Dokuyaku to Otome,12070,44,6.681818


In [10]:
recomendations.sort_values('manga_name')

Unnamed: 0,recomended_score,manga_id,manga_name,manga_rank,number_scores,mean_score
2022,49.500952,10000,"""Bungaku Shoujo"" Series",469,47,7.851064
1212,95.740403,11776,"""Bungaku Shoujo"" to Shi ni Tagari no Pierrot",2942,92,7.467391
367,282.688682,682,"""Kare"" First Love",2496,450,7.604444
1290,87.824673,69,"""Suki"" to Ienai.",11301,245,6.461224
2293,40.691849,4769,#000000: Ultra Black,9085,23,7.304348
229,406.872061,38,+Anima,1909,414,7.555556
698,167.534054,12922,+C: Sword and Cornett,4846,157,7.280255
1318,85.748514,17931,-Hitogatana-,5072,59,6.847458
3658,13.703018,1144,...Curtain.: Sensei to Kiyoraka ni Dousei,15626,101,6.217822
2150,44.441009,7886,...Seishunchuu!,8223,30,7.366667


Referencia: http://blog.findemor.es/2018/02/sistemas-de-recomendacion-en-python/

---