**Problem Statement**

We want to build a collaborative filtering recommender system using Yelp data.

**The objective is:**

Use user–business star ratings.

Learn latent user and item representations using Matrix Factorization (ALS).

Generate personalized recommendations.


**Evaluate the system using:**

RMSE

Precision@K

Recall@K

**The system should:**

Map string IDs to numeric indices.

Create a user–item interaction matrix.

Train a factorization model.

Predict ratings.

Recommend top-N businesses.

Evaluate ranking quality.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**What this does:**

Mounts Google Drive in Colab so files can be accessed.

**Why needed:**

The dataset is stored in Drive.

In [2]:
import os
os.listdir('/content/drive/MyDrive/Recommender system')

['Dataset_User_Agreement.pdf',
 'yelp_academic_dataset_business.json',
 'yelp_academic_dataset_checkin.json',
 'yelp_academic_dataset_review.json',
 'yelp_academic_dataset_tip.json',
 'yelp_academic_dataset_user.json']

In [3]:
import pandas as pd

In [4]:
review = pd.read_json('/content/drive/MyDrive/Recommender system/yelp_academic_dataset_review.json', nrows = 100000, lines=True)

In [5]:
business = pd.read_json('/content/drive/MyDrive/Recommender system/yelp_academic_dataset_business.json', nrows = 100000, lines = True)

**Data Cleaning**

In [6]:
reviewtest = review.copy()

Creates a working copy to preserve raw data.

In [7]:
reviewtest.drop(columns=['review_id'], inplace = True)

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import io

In [9]:
reviewtest.head()

Unnamed: 0,user_id,business_id,stars,useful,funny,cool,text,date
0,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


In [10]:
reviewtest.drop(columns=['text'], inplace = True)

In [11]:
reviewtest.drop(columns=['useful', 'funny', 'cool'], inplace = True)

**What this does:**

Removes unnecessary columns.

Why?

**For collaborative filtering, we only need:**

user_id

business_id

stars

Text and metadata are irrelevant here.

In [12]:
business.drop(columns=['address', 'state', 'postal_code', 'latitude', 'longitude', 'stars', 'review_count', 'is_open', 'attributes'], inplace = True)

In [13]:
business.drop(columns = ['hours'], inplace = True)

In [14]:
business.head()

Unnamed: 0,business_id,name,city,categories
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ",Santa Barbara,"Doctors, Traditional Chinese Medicine, Naturop..."
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,Affton,"Shipping Centers, Local Services, Notaries, Ma..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,Tucson,"Department Stores, Shopping, Fashion, Home & G..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,Green Lane,"Brewpubs, Breweries, Food"


Removes non-essential business metadata.

Why?

We only need business_id to map recommendations back.

In [15]:
reviewtest.head()

Unnamed: 0,user_id,business_id,stars,date
0,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,2018-07-07 22:09:11
1,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,2012-01-03 15:28:18
2,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,2014-02-05 20:30:30
3,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,2015-01-04 00:01:03
4,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,2017-01-14 20:54:15


**Encoding IDs to Numeric Indices**

Matrix factorization requires integer indexing.

In [16]:
all_business_ids = pd.concat([
    reviewtest["business_id"],
    business["business_id"]
]).unique()

In [17]:
business_id_map = {
    bid: idx for idx, bid in enumerate(all_business_ids)
}

In [18]:
reviewtest["business_idx"] = reviewtest["business_id"].map(business_id_map)
business["business_idx"] = business["business_id"].map(business_id_map)

In [19]:
reviewtest.head()

Unnamed: 0,user_id,business_id,stars,date,business_idx
0,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,2018-07-07 22:09:11,0
1,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,2012-01-03 15:28:18,1
2,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,2014-02-05 20:30:30,2
3,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,2015-01-04 00:01:03,3
4,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,2017-01-14 20:54:15,4


In [20]:
reviewtest['user_idx'], user_map = pd.factorize(reviewtest['user_id'])

In [21]:
reviewtest.drop(columns=['business_id', 'user_id'], inplace = True)

In [22]:
reviewtest.head().set_index('user_idx')

Unnamed: 0_level_0,stars,date,business_idx
user_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3,2018-07-07 22:09:11,0
1,5,2012-01-03 15:28:18,1
2,3,2014-02-05 20:30:30,2
3,5,2015-01-04 00:01:03,3
4,4,2017-01-14 20:54:15,4


In [23]:
business.head()

Unnamed: 0,business_id,name,city,categories,business_idx
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ",Santa Barbara,"Doctors, Traditional Chinese Medicine, Naturop...",9078
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,Affton,"Shipping Centers, Local Services, Notaries, Ma...",9004
2,tUFrWirKiKi_TAnsVWINQQ,Target,Tucson,"Department Stores, Shopping, Fashion, Home & G...",2550
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...",2289
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,Green Lane,"Brewpubs, Breweries, Food",9973


In [24]:
business.drop(columns = ['business_id'], inplace = True)

**Aggregating Ratings**

In [25]:
reviewtest = (
    reviewtest
    .groupby(['user_idx', 'business_idx'], as_index=False)['stars'].mean()
)

**What this does:**

If a user rated the same business multiple times:
→ Take mean rating.

Why?

Matrix factorization expects one rating per user-item pair.

In [26]:
rm_small = reviewtest.copy()

In [27]:
rm_small = rm_small.head(1000)

**Creating User-Item Matrix**

In [28]:
rm_small = rm_small.pivot(index = 'user_idx', columns ='business_idx', values = 'stars').fillna(0)
rm_small.head()

business_idx,0,1,2,3,4,5,6,7,8,9,...,8831,8862,8899,8902,9229,9252,9277,9354,9581,9774
user_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**What this does:**

Creates matrix:

| user | item1 | item2 | ... |

Missing values → filled with 0.

Why?

Matrix factorization works on numeric matrices.

In [29]:
rm = rm_small.copy()

**Preparing Data for CMF Model**

In [30]:
rm_raw = reviewtest[['user_idx', 'business_idx', 'stars']].copy()
rm_raw.columns = ['UserId', 'ItemId', 'Rating']  # Lib requires specific column names
rm_raw.head(2)

Unnamed: 0,UserId,ItemId,Rating
0,0,0,3.0
1,0,927,3.0


**Why rename?**

cmfrec requires these exact column names.

In [31]:
!pip install cmfrec

Collecting cmfrec
  Downloading cmfrec-3.5.1.post13.tar.gz (268 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/268.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/268.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m266.2/268.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.5/268.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting findblas (from cmfrec)
  Using cached findblas-0.1.26.post1-py3-none-any.whl
Building wheels for collected packages: cmfrec
  Building wheel for cmfrec (pyproject.toml) ... [?25l[?25hdone
  Created wheel for cmfrec: filename=cmfrec-3.5.1

**Training Matrix Factorization (ALS)**

In [32]:
from cmfrec import CMF

model = CMF(method="als", k=2, lambda_=0.5, user_bias=True, item_bias=True, verbose=False)
model.fit(rm_raw)

Collective matrix factorization model
(explicit-feedback variant)


Parameters explained:

method="als" → Alternating Least Squares

k=2 → latent dimensions

lambda_=0.5 → regularization strength

user_bias=True → user bias term

item_bias=True → Item bias term

**What ALS does mathematically:**

Factorizes:

𝑅
≈
𝐴
𝐵
𝑇
R≈AB
T

Where:

A = user latent matrix

B = item latent matrix

It minimizes:

∑(𝑅𝑢𝑖−𝐴𝑢⋅𝐵𝑖)2+𝜆(∣∣𝐴∣∣2+∣∣𝐵∣∣2)∑(Rui​−Au⋅Bi)2+λ(∣∣A∣∣+∣∣B∣∣2)

In [33]:
model.A_.shape, model.B_.shape

((79345, 2), (9973, 2))

In [34]:
rm_raw.Rating.mean(), model.glob_mean_

(np.float64(3.8438763269050473), 3.843876361846924)

In [35]:
import numpy as np
from sklearn.metrics import mean_squared_error as mse

**Generating Predictions**

What this does:

Reconstructs full rating matrix:

𝑅^=𝐴𝐵𝑇+𝜇R^=AB+μ

Where:

μ = global mean rating

Why?

To predict ratings for all user-item pairs.

In [36]:
rm__ = np.dot(model.A_, model.B_.T) + model.glob_mean_

rm_user_indices = rm.index.values
rm_business_indices = rm.columns.values


predicted_rm_aligned = rm__[np.ix_(rm_user_indices, rm_business_indices)]


true_ratings_subset = rm.values[rm > 0]
predicted_ratings_subset = predicted_rm_aligned[rm > 0]

# Calculate RMSE
rmse_value = mse(true_ratings_subset, predicted_ratings_subset)**0.5
print(f"RMSE: {rmse_value}")

RMSE: 1.3166112400301124


**Generate Recommendations**

In [37]:
top_items = model.topN(user=1, n=10)
business.loc[business.business_idx.isin(top_items)]

Unnamed: 0,name,city,categories,business_idx
1159,Banko Overhead Doors,Tampa,"Home Services, Contractors, Building Supplies,...",5691
2858,Creole Creamery,New Orleans,"Food, Ice Cream & Frozen Yogurt",138
5830,Southern Arizona Veterinary Specialty & Emerge...,Tucson,"Pets, Veterinarians, Pet Services",4958
6287,Free Tours By Foot,Philadelphia,"Hotels & Travel, Walking Tours, Tours, Food Tours",2612
7493,Biscardi Vision,Philadelphia,"Health & Medical, Optometrists",2819
8396,Loris Soaps & Sponges Exchange,Tarpon Springs,"Cosmetics & Beauty Supply, Beauty & Spas, Shop...",6745
8402,Artmart,Saint Louis,"Shopping, Arts & Crafts, Art Supplies, Framing",2477
9834,Enjoi Sweets & Company,Tampa,"Desserts, Food, Cafes, Restaurants, Food Truck...",2712
10210,Alpine Lock and Key,Reno,"Gunsmith, Local Services, Keys & Locksmiths, B...",424
14536,Sabrina's West Street Kitchen,Reno,"Sandwiches, American (New), Restaurants, Salad...",2397


**What this does:**

Returns top 10 highest predicted items for user 1.

In [38]:
import numpy as np
import pandas as pd

In [39]:
def train_test_split_per_user(df, test_size=1, seed=42):
    train_rows = []
    test_rows = []

    rng = np.random.default_rng(seed)

    for u, grp in df.groupby("UserId"):
        if len(grp) <= test_size:
            continue

        test_idx = rng.choice(grp.index, size=test_size, replace=False)
        train_idx = grp.index.difference(test_idx)

        train_rows.append(grp.loc[train_idx])
        test_rows.append(grp.loc[test_idx])

    train_df = pd.concat(train_rows)
    test_df  = pd.concat(test_rows)

    return train_df, test_df

**Precision@K and Recall@K**

In [40]:
def precision_recall_at_k(
    pred_matrix,
    test_df,
    k=10,
    relevance_threshold=4
):
    """
    pred_matrix: np.ndarray (users x items) -> rm__
    test_df: DataFrame with UserId, ItemId, Rating
    """

    precisions = []
    recalls = []

    for u, grp in test_df.groupby("UserId"):
        # relevant items in test set
        relevant_items = grp.loc[
            grp.Rating >= relevance_threshold, "ItemId"
        ].values

        if len(relevant_items) == 0:
            continue

        # top-K predicted items for this user
        top_k_items = np.argsort(pred_matrix[u])[::-1][:k]

        hits = len(set(relevant_items) & set(top_k_items))

        precisions.append(hits / k)
        recalls.append(hits / len(relevant_items))

    return {
        f"precision@{k}": np.mean(precisions),
        f"recall@{k}": np.mean(recalls)
    }

**Logic:**

For each user:

1. Identify relevant items:

    Rating >= 4
2. Get top K predicted items.
3. Compute:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝐾=(𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡∩𝑅𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑)/𝐾

𝑅𝑒𝑐𝑎𝑙𝑙@𝐾=𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡∩𝑅𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑/𝑇𝑜𝑡𝑎𝑙Relevant
	​



In [41]:
# split data
train_df, test_df = train_test_split_per_user(rm_raw)

# evaluate
metrics = precision_recall_at_k(
    pred_matrix=rm__,
    test_df=test_df,
    k=10,
    relevance_threshold=4
)

metrics

{'precision@10': np.float64(0.0004283734408466675),
 'recall@10': np.float64(0.004283734408466675)}

**Evaluate**

Returns:

Mean Precision@K

Mean Recall@K

**🏁 Final System Flow**

1. Raw Yelp Data

2. Clean & encode IDs

3. Build interaction matrix

4. Train ALS factorization

5. Generate predictions

6. Recommend top-N items

7. Evaluate ranking quality

**To Generate the Pickle File**

In [43]:
user_ids = rm_raw['UserId'].unique()
business_ids = rm_raw['ItemId'].unique()

In [44]:
user_id_map = {id: i for i, id in enumerate(user_ids)}
business_id_map = {id: i for i, id in enumerate(business_ids)}

In [50]:
import pickle

model_artifact = {
    "A": model.A_,
    "B": model.B_.T,
    "user_id_map": user_id_map,
    "business_id_map": business_id_map
}

with open("recommender_model.pkl", "wb") as f:
    pickle.dump(model_artifact, f)

print("Model saved successfully as recommender_model.pkl")

Model saved successfully as recommender_model.pkl


In [52]:
from google.colab import files
files.download('recommender_model.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>