###Item_Based_Collaborative_Filtering.ipynb

##This notebook demonstrates how to build an Item-Based Collaborative Filtering (IBCF) Recommendation System using the Surprise library.



In [1]:
!pip install "numpy<2"  # required for surprise to work as it doesnt work with numpy 2




##### Mounting Google Drive to access files


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd

df=pd.read_csv('/content/drive/MyDrive/0.Latest_DS_Course/RS/data/ratings_sub.csv', encoding = "ISO-8859-1")
print(df)

        userId  movieId  rating   timestamp  \
0         3218     3889     1.0  1172532894   
1         3663     3889     1.0  1044474348   
2         3704     3889     3.0   971391538   
3         8877     3889     1.0  1050744366   
4         9599     3889     0.5  1378056755   
...        ...      ...     ...         ...   
487464  130784   109159     3.5  1427063644   
487465  134800    97732     4.0  1351051540   
487466  134800    87644     4.5  1308552954   
487467  134800    99171     4.0  1356252918   
487468  134800   101581     4.0  1364789324   

                                                    title  \
0              Highlander: Endgame (Highlander IV) (2000)   
1              Highlander: Endgame (Highlander IV) (2000)   
2              Highlander: Endgame (Highlander IV) (2000)   
3              Highlander: Endgame (Highlander IV) (2000)   
4              Highlander: Endgame (Highlander IV) (2000)   
...                                                   ...   
487464  

In [4]:
# Assign DataFrame to ratings variable
ratings = df

In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,year
0,3218,3889,1.0,1172532894,Highlander: Endgame (Highlander IV) (2000),Action|Adventure|Fantasy,2000.0
1,3663,3889,1.0,1044474348,Highlander: Endgame (Highlander IV) (2000),Action|Adventure|Fantasy,2000.0
2,3704,3889,3.0,971391538,Highlander: Endgame (Highlander IV) (2000),Action|Adventure|Fantasy,2000.0
3,8877,3889,1.0,1050744366,Highlander: Endgame (Highlander IV) (2000),Action|Adventure|Fantasy,2000.0
4,9599,3889,0.5,1378056755,Highlander: Endgame (Highlander IV) (2000),Action|Adventure|Fantasy,2000.0


In [6]:
ratings.shape

(487469, 7)

Scikit-surprise (often referred to simply as Surprise) is a Python scikit (like scikit-learn) built specifically for building and analyzing recommender systems. It is designed to handle collaborative filtering tasks efficiently, making it an excellent tool for experimenting with recommendation algorithms.

##Surprise supports:

* User-based and item-based collaborative filtering

* Matrix factorization methods like SVD, SVD++, NMF

* Model selection via cross-validation

* Prediction accuracy metrics like RMSE, MAE

In [7]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl size=2463295 sha256=8e5ddbee3c69b175e3ee221a4b8d2c62c2bd660b97cbcab64cc2704a53a25624
  Stored in directory: /root/.cache/pip/wheels/2a/8f/6e/7e2899163

In [8]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487469 entries, 0 to 487468
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     487469 non-null  int64  
 1   movieId    487469 non-null  int64  
 2   rating     487469 non-null  float64
 3   timestamp  487469 non-null  int64  
 4   title      487469 non-null  object 
 5   genres     487469 non-null  object 
 6   year       487469 non-null  float64
dtypes: float64(2), int64(3), object(2)
memory usage: 26.0+ MB


### # Convert userId and movieId to string (required for Surprise library)


In [9]:
ratings.userId = ratings.userId.astype(str)

In [10]:
ratings.movieId = ratings.movieId.astype(str)

In [11]:
# Check number of unique users and movies
print("Unique users:", ratings['userId'].nunique())
print("Unique movies:", ratings['movieId'].nunique())

Unique users: 2827
Unique movies: 6656


In [12]:
# View most active users (those with most ratings)
print(ratings['userId'].value_counts().head())

userId
3218     200
75694    200
61382    200
47594    200
29990    200
Name: count, dtype: int64


In [13]:
# View most rated movies
print(ratings['movieId'].value_counts().head())

movieId
4993    2481
4306    2356
5952    2338
7153    2235
3578    2226
Name: count, dtype: int64


In [14]:
# Install surprise library if not already installed
# !pip install surprise

In [15]:
import numpy as np
print(np.__version__)


1.26.4


#### Import Surprise modules


In [16]:
from surprise import Dataset,Reader

###Define the format of the input data for Surprise


In [17]:
reader = Reader(rating_scale=(1,5))  # Ratings range from 1 to 5

#### Load data into Surprise dataset format


In [18]:
data = Dataset.load_from_df(ratings[['userId', 'title', 'rating']], reader)

| Surprise Concept | Your Columns |
| ---------------- | ------------ |
| `uid` (user ID)  | `userId`     |
| `iid` (item ID)  | `title`      |
| `rating`         | `rating`     |


* user_id (from your original DataFrame)

* uid (internal user ID used by Surprise)

* item_title (like movie title)

* iid (internal item ID used by Surprise)



In [19]:
data

<surprise.dataset.DatasetAutoFolds at 0x7af19cda96d0>

#### Split dataset into train and test sets (75%-25% split)


In [20]:
from surprise.model_selection import train_test_split

trainset, testset = train_test_split(data, test_size=0.25, random_state = 123)

In [21]:
# To build on full data

#trainset = data.build_full_trainset()

In [22]:
trainset

<surprise.trainset.Trainset at 0x7af1966de7d0>

In [23]:
# raw id - original userid and title

# inner id - mapped values created by surprise library

In [24]:
# View internal training data structure
print("Trainset size (interactions):", trainset.n_ratings)

Trainset size (interactions): 365601


In [25]:
# Understanding raw vs internal IDs used by Surprise
user_records = trainset.ur   # All user interactions

In [26]:
user_records[0]

[(0, 3.0),
 (195, 4.0),
 (1066, 3.5),
 (999, 3.5),
 (237, 3.0),
 (1577, 3.0),
 (932, 2.0),
 (247, 4.5),
 (2215, 3.0),
 (221, 4.0),
 (745, 3.0),
 (133, 3.0),
 (249, 3.0),
 (1065, 2.5),
 (255, 3.5),
 (167, 4.0),
 (586, 3.5),
 (1234, 4.0),
 (259, 4.5),
 (729, 2.5),
 (236, 3.5),
 (181, 3.5),
 (3245, 3.5),
 (1014, 3.0),
 (577, 5.0),
 (2789, 3.5),
 (91, 4.0),
 (10, 4.0),
 (19, 3.5),
 (274, 4.0),
 (2135, 3.0),
 (1419, 3.5),
 (695, 4.0),
 (1373, 3.5),
 (850, 3.0),
 (334, 4.0),
 (2759, 3.0),
 (222, 3.0),
 (37, 4.0),
 (380, 2.5),
 (544, 4.0),
 (542, 4.5),
 (1135, 5.0),
 (650, 5.0),
 (4625, 3.5),
 (341, 1.0),
 (780, 4.0),
 (2371, 3.0),
 (661, 4.0),
 (4742, 4.5),
 (1660, 3.5),
 (4189, 2.5),
 (110, 2.5),
 (2349, 3.0),
 (2285, 3.5),
 (2623, 3.0),
 (1001, 4.0),
 (1490, 3.0),
 (171, 4.0),
 (465, 4.0),
 (733, 5.0),
 (894, 3.0),
 (3771, 3.0),
 (933, 3.0),
 (1083, 3.0),
 (3003, 3.0),
 (11, 3.0),
 (756, 2.5),
 (604, 3.5),
 (258, 4.0),
 (725, 5.0),
 (320, 3.5),
 (1838, 5.0),
 (383, 3.0),
 (3977, 3.0),
 (19

In [27]:
trainset.to_raw_uid(0) # raw_uid means original userid that coming from the orginal dataset

'248'

In [28]:
trainset.to_raw_iid(0) # raw_uid means original title that coming from the orginal dataset

'Life of Pi (2012)'

==================================================================================

##This part of code is only show the mapping (between DF and surprise dataset) for understanding

In [29]:
#  Step 1: Build trainset (already done earlier)
newset = trainset  # Usually this is after trainset = train_test_split(...)[0].build_full_trainset()

#  Step 2: Create inner-to-raw mappings (optional, for reverse lookup or debugging)
# uid_map = {newset.to_inner_uid(uid): uid for uid in newset._raw2inner_id_users}
# iid_map = {newset.to_inner_iid(iid): iid for iid in newset._raw2inner_id_items}

#  Step 3: Extract only the userId-title pairs used in trainset
train_records = [(newset.to_raw_uid(u), newset.to_raw_iid(i)) for (u, i, _) in newset.all_ratings()]
train_df = pd.DataFrame(train_records, columns=['userId', 'title'])

#  Step 4: Join with the original DataFrame to get ratings
df_train_only = pd.merge(train_df, df, on=['userId', 'title'], how='left')

#  Step 5: Add inner UID and IID using newset mappings
df_train_only['uid'] = df_train_only['userId'].apply(lambda x: newset.to_inner_uid(x))
df_train_only['iid'] = df_train_only['title'].apply(lambda x: newset.to_inner_iid(x))

#  Step 6: Reorder for clarity
final_df = df_train_only[['userId', 'uid', 'title', 'iid', 'rating']]

#  Output
print(final_df)


       userId   uid                                title   iid  rating
0         248     0                    Life of Pi (2012)     0     3.0
1         248     0                       WALLÂ·E (2008)   195     4.0
2         248     0         Step Up 2 the Streets (2008)  1066     3.5
3         248     0  Talk to Her (Hable con Ella) (2002)   999     3.5
4         248     0              Illusionist, The (2006)   237     3.0
...       ...   ...                                  ...   ...     ...
365636  84115  2826          Flags of Our Fathers (2006)  1913     3.0
365637  84115  2826                        Snatch (2000)   660     4.0
365638  84115  2826                 Terminal, The (2004)   346     2.5
365639  84115  2826          Boondock Saints, The (2000)   209     4.0
365640  84115  2826       40-Year-Old Virgin, The (2005)   878     3.0

[365641 rows x 5 columns]


In [30]:
final_df

Unnamed: 0,userId,uid,title,iid,rating
0,248,0,Life of Pi (2012),0,3.0
1,248,0,WALLÂ·E (2008),195,4.0
2,248,0,Step Up 2 the Streets (2008),1066,3.5
3,248,0,Talk to Her (Hable con Ella) (2002),999,3.5
4,248,0,"Illusionist, The (2006)",237,3.0
...,...,...,...,...,...
365636,84115,2826,Flags of Our Fathers (2006),1913,3.0
365637,84115,2826,Snatch (2000),660,4.0
365638,84115,2826,"Terminal, The (2004)",346,2.5
365639,84115,2826,"Boondock Saints, The (2000)",209,4.0


============================================================================================================


In [31]:
# Example: Show how user and item IDs are mapped internally
print("Internal user id 0 maps to:", trainset.to_raw_uid(0))
print("Internal item id 0 maps to:", trainset.to_raw_iid(0))

Internal user id 0 maps to: 248
Internal item id 0 maps to: Life of Pi (2012)


In [32]:
# Import algorithm and metrics
from surprise import KNNWithMeans,accuracy,Prediction

In [33]:
# IBCF - pearson correlation
# UBCf - cosine

# Initialize IBCF algorithm using Pearson correlation (item-based = user_based=False)

algo = KNNWithMeans(k=51, sim_options={'name' : 'pearson' , 'user_based' : False})

In [34]:
# Train the model
algo.fit(trainset)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7af1968ac610>

In [35]:
len(testset)

121868

In [36]:
# Evaluate model on testset (predict only known interactions)
test_pred = algo.test(testset)

In [37]:
# Calculate RMSE to evaluate prediction accuracy
accuracy.rmse(test_pred)

RMSE: 0.8113


0.8113433713272009

In [38]:
# View some test predictions
test_pred[0:10]

[Prediction(uid='107317', iid='Signs (2002)', r_ui=2.5, est=2.4914040802676256, details={'actual_k': 51, 'was_impossible': False}),
 Prediction(uid='103061', iid='Inconvenient Truth, An (2006)', r_ui=4.5, est=3.8431681088652874, details={'actual_k': 51, 'was_impossible': False}),
 Prediction(uid='84115', iid='Battlefield Earth (2000)', r_ui=2.5, est=1.349355244122438, details={'actual_k': 51, 'was_impossible': False}),
 Prediction(uid='130756', iid='Fast and the Furious: Tokyo Drift, The (Fast and the Furious 3, The) (2006)', r_ui=2.0, est=2.6011673998368963, details={'actual_k': 51, 'was_impossible': False}),
 Prediction(uid='24878', iid='Drive (2011)', r_ui=4.5, est=4.438749635000185, details={'actual_k': 51, 'was_impossible': False}),
 Prediction(uid='137648', iid='Matrix Reloaded, The (2003)', r_ui=4.5, est=3.8479008349158783, details={'actual_k': 51, 'was_impossible': False}),
 Prediction(uid='52242', iid='Spy Game (2001)', r_ui=2.5, est=3.462597865564213, details={'actual_k': 51,

#### Convert predictions to DataFrame for inspection


In [39]:
test_pred_df = pd.DataFrame(test_pred)

In [40]:
test_pred_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,107317,Signs (2002),2.5,2.491404,"{'actual_k': 51, 'was_impossible': False}"
1,103061,"Inconvenient Truth, An (2006)",4.5,3.843168,"{'actual_k': 51, 'was_impossible': False}"
2,84115,Battlefield Earth (2000),2.5,1.349355,"{'actual_k': 51, 'was_impossible': False}"
3,130756,"Fast and the Furious: Tokyo Drift, The (Fast a...",2.0,2.601167,"{'actual_k': 51, 'was_impossible': False}"
4,24878,Drive (2011),4.5,4.43875,"{'actual_k': 51, 'was_impossible': False}"


#### Extract 'was_impossible' flag from prediction details


In [41]:
test_pred_df['was_impossible'] = [ x['was_impossible'] for x in test_pred_df['details']]

In [42]:
test_pred_df.head()

Unnamed: 0,uid,iid,r_ui,est,details,was_impossible
0,107317,Signs (2002),2.5,2.491404,"{'actual_k': 51, 'was_impossible': False}",False
1,103061,"Inconvenient Truth, An (2006)",4.5,3.843168,"{'actual_k': 51, 'was_impossible': False}",False
2,84115,Battlefield Earth (2000),2.5,1.349355,"{'actual_k': 51, 'was_impossible': False}",False
3,130756,"Fast and the Furious: Tokyo Drift, The (Fast a...",2.0,2.601167,"{'actual_k': 51, 'was_impossible': False}",False
4,24878,Drive (2011),4.5,4.43875,"{'actual_k': 51, 'was_impossible': False}",False


#### Show count of possible vs impossible predictions


In [43]:
test_pred_df['was_impossible'].value_counts()

Unnamed: 0_level_0,count
was_impossible,Unnamed: 1_level_1
False,121417
True,451


#### Make a prediction for a specific user and item


In [44]:
example_pred = algo.predict(uid='41891', iid="Wrong Trousers, The (1993)")
print("Estimated rating:", example_pred.est)

Estimated rating: 3.511396303620614


# ---------------------------------------------
# 💡 Recommend movies not yet rated by a user
# ---------------------------------------------

# Build anti-testset (all user-item pairs NOT in trainset)


####What is build_anti_testset()?
* It builds a testset of all user-item pairs that are NOT in the training set.

* In other words, it creates a list of (user, item, rating) triples for every user-item combination where the user has NOT rated that item in the training data.

* The rating for these pairs is set to the trainset’s global mean rating (just a placeholder).

In [45]:
testset_new = trainset.build_anti_testset()

In [46]:
testset_new[0:5]

[('248', 'Disturbia (2007)', 3.511396303620614),
 ('248', 'Hamlet 2 (2008)', 3.511396303620614),
 ('248', 'Unbreakable (2000)', 3.511396303620614),
 ('248', 'Finding Neverland (2004)', 3.511396303620614),
 ('248', 'X2: X-Men United (2003)', 3.511396303620614)]

In [47]:
len(testset_new)

17308818

#### Predict ratings for unseen movies for first 10,000 user-item pairs


* when we call this algo.test(testset_new[0:10000]) , weighted sum will be used when data points have values and for missing values its the global mean




In [48]:
predictions = algo.test(testset_new[0:10000])

In [49]:
predictions[0]

Prediction(uid='248', iid='Disturbia (2007)', r_ui=3.511396303620614, est=3.2774421731141405, details={'actual_k': 51, 'was_impossible': False})

####What does was_impossible=True mean?
* It means the prediction could NOT be made by the algorithm for that user-item pair.

* This usually happens when the model does not have enough information to produce a meaningful prediction.

####What happens in practice?
* When was_impossible=True, the predicted rating (est) is often set to some default value (like the global mean) or NaN.




In [50]:
# Create DataFrame of predictions
predictions_df = pd.DataFrame([[x.uid, x.iid, x.est] for x in predictions],
                              columns=["userId", "movie_name", "est_rating"])

In [51]:
predictions_df.head()

Unnamed: 0,userId,movie_name,est_rating
0,248,Disturbia (2007),3.277442
1,248,Hamlet 2 (2008),2.443231
2,248,Unbreakable (2000),3.207692
3,248,Finding Neverland (2004),3.693388
4,248,X2: X-Men United (2003),3.439236


In [52]:
# Sort by user and estimated rating in descending order
predictions_df.sort_values(by=["userId", "est_rating"], ascending=False, inplace=True)


In [53]:
predictions_df.head(30)

Unnamed: 0,userId,movie_name,est_rating
8040,45844,Elizabeth I (2005),5.0
9039,45844,Star Wars Uncut: Director's Cut (2012),5.0
9147,45844,Lucky Break (2001),5.0
9413,45844,Dog Pound (2010),5.0
9497,45844,911 in Plane Site (2004),5.0
9507,45844,Wild Things: Diamonds in the Rough (2005),5.0
9539,45844,Serial (Bad) Weddings (Qu'est-ce Qu'on An Fit ...,5.0
9877,45844,Bag It (2010),5.0
9909,45844,Triad Election (Election 2) (Hak se wui yi wo ...,5.0
9978,45844,Stromberg - Der Film (2014),5.0


In [54]:
#  Extract top 10 recommended movies for each user
top_10_recos = predictions_df.groupby('userId').head(10).reset_index(drop=True)


In [55]:
# Show final top 10 recommendations for all users
top_10_recos.head(50)

Unnamed: 0,userId,movie_name,est_rating
0,45844,Elizabeth I (2005),5.0
1,45844,Star Wars Uncut: Director's Cut (2012),5.0
2,45844,Lucky Break (2001),5.0
3,45844,Dog Pound (2010),5.0
4,45844,911 in Plane Site (2004),5.0
5,45844,Wild Things: Diamonds in the Rough (2005),5.0
6,45844,Serial (Bad) Weddings (Qu'est-ce Qu'on An Fit ...,5.0
7,45844,Bag It (2010),5.0
8,45844,Triad Election (Election 2) (Hak se wui yi wo ...,5.0
9,45844,Stromberg - Der Film (2014),5.0


In [56]:
# Save recommendations to a CSV file if needed
top_10_recos.to_csv('top_10_recommendations.csv', index=False)