# Collaborative Recommender System

Configure the project. Indeed you create a dataset in csv format.

In [1]:
! rm -rf *.csv
! unzip ./dataset/archive.zip


Archive:  ./dataset/archive.zip
  inflating: anime.csv               
  inflating: rating.csv              


Import needed libraries

In [2]:
import pandas as pd
from IPython.display import Image
from scipy.stats import pearsonr

%matplotlib inline

### User-Based Collaborative Filtering
- Approach: Finds users similar to the target user based on historical interactions.
- Process:
  1. Identify users with similar preferences.
  2. Recommend items liked by these similar users.
- Pros:
  - Simple to understand and implement.
  - Often effective with sufficient user data.
- Cons:
  - Performance degrades with large datasets.
  - Struggles with new users (cold start problem).

### Item-Based Collaborative Filtering
- Approach: Finds items similar to the ones the target user has interacted with.
- Process:
  1. Identify items similar to what the user likes.
  2. Recommend these similar items.
- Pros:
  - More scalable with large datasets.
  - Can leverage item characteristics and interactions.
- Cons:
  - Requires significant item interaction data.
  - Might not capture nuanced user preferences.

Both approaches aim to provide personalized recommendations but differ in their method and scalability.

In [3]:
url = "https://www.scaler.com/topics/images/collaborative.webp"
Image(url=url)

### I decided to implement User Based Approach to avoid memory issues.

Read data from csv files using pandas and store in data frame structure. Also shuffle data to have uniform distribution. 

In [4]:
anime_df = pd.read_csv("anime.csv")
anime_df = anime_df.sample(frac=1.0, random_state=42).reset_index(drop=True)

rating_df = pd.read_csv("rating.csv")
rating_df = rating_df.sample(frac=1.0, random_state=42).reset_index(drop=True)

In [5]:
anime_df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,17209,Suzy&#039;s Zoo: Daisuki! Witzy - Happy Birthday,Kids,Special,1,6.17,158
1,173,Tactics,"Comedy, Drama, Fantasy, Mystery, Shounen, Supe...",TV,25,7.34,27358
2,3616,Kamen no Maid Guy,"Action, Comedy, Ecchi, Super Power",TV,12,7.14,27761
3,18799,Take Your Way,"Action, Music, Seinen, Supernatural",Music,1,6.66,1387
4,18831,Rinkaku,"Dementia, Horror, Music",Music,1,5.60,606
...,...,...,...,...,...,...,...
12289,4638,Milkyway,"Hentai, Romance",OVA,2,5.82,695
12290,5272,Tondemo Nezumi Daikatsuyaku,Adventure,Movie,1,6.53,252
12291,1262,Macross II: Lovers Again,"Adventure, Mecha, Military, Sci-Fi, Shounen, S...",OVA,6,6.47,6760
12292,22819,Aikatsu! Movie,"Music, School, Shoujo, Slice of Life",Movie,1,7.79,2813


In [6]:
rating_df

Unnamed: 0,user_id,anime_id,rating
0,59257,650,6
1,8203,591,7
2,15395,209,7
3,1280,6702,10
4,9259,32998,8
...,...,...,...
2925805,24906,9065,9
2925806,48795,18411,7
2925807,43226,28299,7
2925808,62082,17397,6


In [7]:
rating_df['rating'].value_counts()

rating
-1     625635
 7     570953
 8     567465
 9     362429
 6     303423
 10    253867
 5     144328
 4      53456
 3      22105
 2      12758
 1       9391
Name: count, dtype: int64

In [8]:
userInput = [
    {'Title': 'Boku dake ga Inai Machi', 'Rating': 10.0},
    {'Title': 'Violet Evergarden', 'Rating': 9.5},
    {'Title': 'Goblin Slayer', 'Rating': 6.0},
    {'Title': 'Berserk', 'Rating': 8.0},
    {'Title': 'Shingeki no Kyojin', 'Rating': 7.0},
    {'Title': 'Tokyo Ghoul', 'Rating': 6.5},
    {'Title': 'Orange', 'Rating': 6.0},
    {'Title': 'Death Parade', 'Rating': 8.0},
    {'Title': 'Death Note', 'Rating': 7.5},
    {'Title': 'Bungou Stray Dogs', 'Rating': 7.5},
    {'Title': 'Tenki no Ko', 'Rating': 8.0},
    {'Title': 'Kimi no Na wa.', 'Rating': 8.0},
    {'Title': 'Kimi no Suizou wo Tabetai', 'Rating': 8.5},
    {'Title': 'Mononoke Hime', 'Rating': 7.5},
    {'Title': 'Sen to Chihiro no Kamikakushi', 'Rating': 7.5},
    {'Title': 'Koe no Katachi', 'Rating': 8.5},
    {'Title': 'Ao Haru Ride', 'Rating': 5.5},
    {'Title': 'Toki wo Kakeru Shoujo', 'Rating': 7.0},
    {'Title': 'Another', 'Rating': 7.5},
    {'Title': 'Kimetsu no Yaiba', 'Rating': 7.0},
    {'Title': 'Shigatsu wa Kimi no Uso', 'Rating': 8.0},
    {'Title': 'Byousoku 5 Centimeter', 'Rating': 6.0},
    {'Title': 'Kokoro ga Sakebitagatterunda.', 'Rating': 7.5},
    {'Title': 'Schick x Evangelion', 'Rating': 5.0}
]

inputAnime = pd.DataFrame(userInput)
print(inputAnime)

                            Title  Rating
0         Boku dake ga Inai Machi    10.0
1               Violet Evergarden     9.5
2                   Goblin Slayer     6.0
3                         Berserk     8.0
4              Shingeki no Kyojin     7.0
5                     Tokyo Ghoul     6.5
6                          Orange     6.0
7                    Death Parade     8.0
8                      Death Note     7.5
9               Bungou Stray Dogs     7.5
10                    Tenki no Ko     8.0
11                 Kimi no Na wa.     8.0
12      Kimi no Suizou wo Tabetai     8.5
13                  Mononoke Hime     7.5
14  Sen to Chihiro no Kamikakushi     7.5
15                 Koe no Katachi     8.5
16                   Ao Haru Ride     5.5
17          Toki wo Kakeru Shoujo     7.0
18                        Another     7.5
19               Kimetsu no Yaiba     7.0
20        Shigatsu wa Kimi no Uso     8.0
21          Byousoku 5 Centimeter     6.0
22  Kokoro ga Sakebitagatterunda. 

In [9]:
anime_df.columns[:25]

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')

In [10]:
inputAnime = pd.merge(inputAnime, anime_df[['anime_id', 'name']], how='left', left_on='Title', right_on='name')
inputAnime = inputAnime.drop(columns='name')
inputAnime

Unnamed: 0,Title,Rating,anime_id
0,Boku dake ga Inai Machi,10.0,31043.0
1,Violet Evergarden,9.5,33352.0
2,Goblin Slayer,6.0,
3,Berserk,8.0,33.0
4,Shingeki no Kyojin,7.0,16498.0
5,Tokyo Ghoul,6.5,22319.0
6,Orange,6.0,32729.0
7,Death Parade,8.0,28223.0
8,Death Note,7.5,1535.0
9,Bungou Stray Dogs,7.5,31478.0


In [11]:
inputAnime = inputAnime.dropna(subset=['anime_id'])
inputAnime = inputAnime.reset_index(drop=True)

inputAnime

Unnamed: 0,Title,Rating,anime_id
0,Boku dake ga Inai Machi,10.0,31043.0
1,Violet Evergarden,9.5,33352.0
2,Berserk,8.0,33.0
3,Shingeki no Kyojin,7.0,16498.0
4,Tokyo Ghoul,6.5,22319.0
5,Orange,6.0,32729.0
6,Death Parade,8.0,28223.0
7,Death Note,7.5,1535.0
8,Bungou Stray Dogs,7.5,31478.0
9,Kimi no Na wa.,8.0,32281.0


In [12]:
userSubset = rating_df[rating_df['anime_id'].isin(inputAnime['anime_id'].tolist())]
userSubset

Unnamed: 0,user_id,anime_id,rating
220,6853,23273,9
402,10089,21995,8
601,3539,33,10
760,28955,32281,10
762,2398,22319,-1
...,...,...,...
2925331,121,1535,10
2925458,1256,1535,10
2925556,1442,164,8
2925669,9625,21995,8


In [13]:
userSubsetGroup = userSubset.groupby('user_id').agg(
    counts=('anime_id', 'count'),                           # Count of anime
    anime_ids=('anime_id', lambda x: list(x))               # List of anime IDs
).reset_index()

userSubsetGroup = userSubsetGroup.sort_values(by='counts', ascending=False)
userSubsetGroup

Unnamed: 0,user_id,counts,anime_ids
552,687,15,"[21995, 2236, 1689, 22319, 1535, 164, 28725, 2..."
920,1145,14,"[16498, 31043, 21995, 28725, 2236, 31478, 2327..."
2479,3338,14,"[28851, 32281, 31478, 22319, 21995, 2236, 1689..."
625,784,14,"[28223, 22319, 1535, 32729, 21995, 31043, 1649..."
606,760,13,"[22319, 28223, 31043, 31478, 199, 32729, 1535,..."
...,...,...,...
7082,68091,1,[28725]
7083,68177,1,[28725]
7084,68320,1,[28725]
7085,68405,1,[28725]


In [14]:
userSubsetGroup = userSubsetGroup[userSubsetGroup['counts'] >= 10]
userSubsetGroup

Unnamed: 0,user_id,counts,anime_ids
552,687,15,"[21995, 2236, 1689, 22319, 1535, 164, 28725, 2..."
920,1145,14,"[16498, 31043, 21995, 28725, 2236, 31478, 2327..."
2479,3338,14,"[28851, 32281, 31478, 22319, 21995, 2236, 1689..."
625,784,14,"[28223, 22319, 1535, 32729, 21995, 31043, 1649..."
606,760,13,"[22319, 28223, 31043, 31478, 199, 32729, 1535,..."
...,...,...,...
960,1197,10,"[1689, 16498, 21995, 28725, 28223, 23273, 3147..."
1519,1870,10,"[21995, 164, 31478, 31043, 28223, 22319, 11111..."
1530,1889,10,"[21995, 23273, 11111, 199, 33, 31043, 164, 168..."
1752,2243,10,"[21995, 32729, 1689, 28223, 22319, 11111, 3228..."


In [15]:
animeRating_dict = inputAnime.set_index('anime_id')['Rating'].to_dict()

pearsonCorrelation_dict = {}

for index, row in userSubsetGroup.iterrows():
    user_id = row['user_id']
    user_anime_ids = row['anime_ids']
    
    # Get corresponding ratings for the user's anime ids
    user_ratings = [animeRating_dict[anime_id] for anime_id in user_anime_ids if anime_id in animeRating_dict]
    
    # Calculate Pearson correlation with inputAnime ratings (use it as a baseline)
    if user_ratings:
        input_ratings = [animeRating_dict[anime_id] for anime_id in inputAnime['anime_id'] if anime_id in user_anime_ids]
        
        # Ensure both lists have the same length
        if len(user_ratings) == len(input_ratings) and len(user_ratings) > 0:
            correlation, _ = pearsonr(user_ratings, input_ratings)
            pearsonCorrelation_dict[user_id] = correlation
        else:
            pearsonCorrelation_dict[user_id] = 0
    else:
        pearsonCorrelation_dict[user_id] = 0

# Display the results
for user_id, correlation in pearsonCorrelation_dict.items():
    print(f"User ID: {user_id}, Pearson Correlation: {correlation}")

User ID: 687, Pearson Correlation: -0.23953974895397492
User ID: 1145, Pearson Correlation: -0.018181818181818202
User ID: 3338, Pearson Correlation: 0.18984771573604062
User ID: 784, Pearson Correlation: 0.14410480349344976
User ID: 760, Pearson Correlation: -0.33180428134556567
User ID: 392, Pearson Correlation: 0.12532981530343007
User ID: 446, Pearson Correlation: -0.6027397260273972
User ID: 17, Pearson Correlation: -0.4708029197080291
User ID: 342, Pearson Correlation: 0.10817941952506599
User ID: 786, Pearson Correlation: -0.0633245382585752
User ID: 1497, Pearson Correlation: 0.294811320754717
User ID: 963, Pearson Correlation: 0.08401084010840111
User ID: 2378, Pearson Correlation: 0.3827751196172249
User ID: 813, Pearson Correlation: -0.1868131868131868
User ID: 958, Pearson Correlation: -0.3411764705882353
User ID: 938, Pearson Correlation: 0.6970873786407766
User ID: 198, Pearson Correlation: 0.1
User ID: 562, Pearson Correlation: 0.1047957371225577
User ID: 1013, Pearson C

In [16]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelation_dict, orient='index')
pearsonDF.columns = ['Similarity Index']
pearsonDF['user_id'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF

Unnamed: 0,Similarity Index,user_id
0,-0.239540,687
1,-0.018182,1145
2,0.189848,3338
3,0.144105,784
4,-0.331804,760
...,...,...
92,-0.526104,1197
93,-0.375000,1870
94,-0.212654,1889
95,0.065744,2243


In [17]:
topUsers = pearsonDF.sort_values(by='Similarity Index', ascending=False)[0:40]
topUsers

Unnamed: 0,Similarity Index,user_id
77,0.833333,1419
15,0.697087,938
38,0.657778,1176
18,0.602649,1013
28,0.536082,271
82,0.483066,1400
62,0.458955,744
91,0.455128,444
12,0.382775,2378
41,0.350694,2016


In [18]:
topUsersRating=topUsers.merge(rating_df, left_on='user_id', right_on='user_id', how='inner')
topUsersRating

Unnamed: 0,Similarity Index,user_id,anime_id,rating
0,0.833333,1419,23319,10
1,0.833333,1419,31043,10
2,0.833333,1419,2904,10
3,0.833333,1419,13391,9
4,0.833333,1419,14189,10
...,...,...,...,...
18181,0.040000,1344,239,-1
18182,0.040000,1344,30413,9
18183,0.040000,1344,9863,-1
18184,0.040000,1344,9041,7


In [19]:
topUsersRating['weighted Rating'] = topUsersRating['Similarity Index']*topUsersRating['rating']
topUsersRating.head

<bound method NDFrame.head of        Similarity Index  user_id  anime_id  rating  weighted Rating
0              0.833333     1419     23319      10         8.333333
1              0.833333     1419     31043      10         8.333333
2              0.833333     1419      2904      10         8.333333
3              0.833333     1419     13391       9         7.500000
4              0.833333     1419     14189      10         8.333333
...                 ...      ...       ...     ...              ...
18181          0.040000     1344       239      -1        -0.040000
18182          0.040000     1344     30413       9         0.360000
18183          0.040000     1344      9863      -1        -0.040000
18184          0.040000     1344      9041       7         0.280000
18185          0.040000     1344      3654      -1        -0.040000

[18186 rows x 5 columns]>

In [20]:
tempTopUsersRating = topUsersRating.groupby('anime_id').sum()[['Similarity Index','weighted Rating']]
tempTopUsersRating.columns = ['sum of similarity Index','sum of weighted Rating']
tempTopUsersRating

Unnamed: 0_level_0,sum of similarity Index,sum of weighted Rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2.944419,19.664045
5,0.845858,5.965146
6,1.268708,6.559716
7,1.003768,8.823478
15,1.012594,6.341339
...,...,...
34085,0.092742,0.649194
34103,0.468151,2.845150
34136,0.040000,0.320000
34173,0.274725,-0.274725


In [21]:
recommendation_df = pd.DataFrame()
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum of weighted Rating']/tempTopUsersRating['sum of similarity Index']
recommendation_df['anime_id'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,anime_id
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6.678413,1
5,7.052186,5
6,5.170389,6
7,8.790355,7
15,6.262468,15


In [22]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,anime_id
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3247,10.0,3247
10153,10.0,10153
28479,10.0,28479
28683,10.0,28683
441,10.0,441
446,10.0,446
2825,10.0,2825
449,10.0,449
16143,10.0,16143
1566,10.0,1566


In [23]:
anime_df.loc[anime_df['anime_id'].isin(recommendation_df.head(10)['anime_id'].tolist())]


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
2839,449,InuYasha: Guren no Houraijima,"Adventure, Comedy, Demons, Drama, Historical, ...",Movie,1,7.62,50008
3141,16143,One Piece: Kinkyuu Kikaku One Piece Kanzen Kou...,"Adventure, Comedy, Fantasy, Shounen",Special,1,7.36,5914
6798,10153,Mahou Shoujo Lyrical Nanoha: The Movie 2nd A&#...,"Action, Comedy, Drama, Magic, Super Power",Movie,1,8.34,13315
6895,28683,One Piece: Episode of Alabasta - Prologue,"Action, Adventure, Fantasy, Shounen",OVA,1,7.41,4225
7325,441,Shoujo Kakumei Utena: Adolescence Mokushiroku,"Dementia, Drama, Fantasy, Romance, Shoujo",Movie,1,7.59,22219
7968,446,Weiß Kreuz Glühen,"Action, Drama, Shounen",TV,13,6.72,7043
8528,3247,Love Hina Final Selection,"Comedy, Ecchi, Harem, Romance",OVA,1,7.32,21824
8636,1566,Ghost in the Shell: Stand Alone Complex - Soli...,"Mecha, Military, Mystery, Police, Sci-Fi, Seinen",Special,1,8.22,55247
11574,28479,Detective Conan Movie 19: The Hellfire Sunflowers,"Action, Mystery, Police, Shounen",Movie,1,7.77,8600
12274,2825,Arabian Nights: Sindbad no Bouken (TV),"Adventure, Fantasy, Magic, Romance",TV,52,7.26,2631
