# Content-Based Filtering for Dating Profile Recommendations

This notebook implements a content-based filtering approach to recommend dating profiles to users. Content-based filtering uses characteristics of items (profiles in this case) to recommend similar items to those that a user has liked in the past.

In our implementation, we combine a baseline predictor (global average + user/item biases) with a content-based approach that leverages profile features to find similarities between profiles.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Paths
DATA_DIR = Path().resolve().parent / "data"
TRAIN_FILE = DATA_DIR / "ratings.dat"
TEST_FILE = DATA_DIR / "ratings-Test.dat"
GENDER_FILE = DATA_DIR / "gender.dat"

## 1. Data Loading and Preprocessing

We'll start by importing the necessary libraries and loading our dataset. The dataset contains:
- Ratings given by users to different profiles
- Gender information for users

We use these to build our recommendation system. The data is split into training and validation sets to enable parameter tuning.

In [11]:
# 1) Load data
train_df = pd.read_csv(TRAIN_FILE)
test_df = pd.read_csv(TEST_FILE)
gender_df = pd.read_csv(GENDER_FILE, names=["userID", "Gender"], header=None)

train_df = train_df.merge(gender_df, on="userID", how="left")

# 2) Split into train/validation for tuning
train_sub, val_sub = train_test_split(train_df, test_size=0.2, random_state=42)

## 2. Baseline Predictor

Before implementing content-based filtering, we establish a baseline prediction model. This baseline consists of three components:

1. **Global Mean (μ)**: The average rating across all user-profile interactions
2. **Item Bias (b_i)**: How much better/worse a profile is rated compared to the average
3. **User Bias (b_u)**: How much higher/lower a user tends to rate compared to the average

The formula for baseline prediction is:

$\hat{r}_{ui} = \mu + b_u + b_i$

We use regularization (λ) to prevent overfitting, especially for users/profiles with few ratings.

In [12]:
def compute_baseline(df, λ=10):
    """Compute global mean, item biases b_i and user biases b_u on df."""
    mu = df["rating"].mean()
    # item bias
    b_i = df.groupby("profileID").apply(lambda g: (g.rating - mu).sum() / (len(g) + λ))

    # user bias (using item bias)
    def user_bias(g):
        return (g.rating - mu - b_i.reindex(g.profileID).values).sum() / (len(g) + λ)

    b_u = df.groupby("userID").apply(user_bias)
    return mu, b_i, b_u


def baseline_pred(user, item, mu, b_i, b_u):
    return mu + b_u.get(user, 0.0) + b_i.get(item, 0.0)


# 3) Compute baseline on train_sub and residuals
mu_sub, b_i_sub, b_u_sub = compute_baseline(train_sub, λ=10)
train_sub = train_sub.copy()
train_sub["residual"] = train_sub.apply(
    lambda r: r.rating - baseline_pred(r.userID, r.profileID, mu_sub, b_i_sub, b_u_sub),
    axis=1,
)

  b_i = df.groupby("profileID").apply(lambda g: (g.rating - mu).sum() / (len(g) + λ))
  b_u = df.groupby("userID").apply(user_bias)


## 3. Content-Based Model Construction

The core of content-based filtering is to represent items (profiles) as feature vectors. We'll create these feature vectors from both explicit attributes (like gender distribution of raters) and implicit signals (like average rating residuals).

Key aspects of our content-based model:

1. We work with residuals (actual rating - baseline prediction) rather than raw ratings
2. We create profile-level features that capture demographic and popularity aspects
3. Features are standardized to ensure they contribute equally to similarity calculations
4. We build a mapping structure to efficiently access profile features and user histories

In [13]:
# 4) Function to build content-based structures on any DataFrame with 'residual'
def build_content_model(df):
    # profile-level aggregates on residuals
    pf = df.groupby("profileID").agg(
        rating_count=("residual", "count"),
        avg_residual=("residual", "mean"),
        female_count=("Gender", lambda x: (x == "F").sum()),
        male_count=("Gender", lambda x: (x == "M").sum()),
        unknown_count=("Gender", lambda x: (x == "U").sum()),
    )
    total_count = pf["rating_count"]
    pf["unknown_count"] = pf["rating_count"] - pf["female_count"] - pf["male_count"]
    pf["p_female"] = pf["female_count"] / total_count
    pf["p_male"] = pf["male_count"] / total_count
    pf["p_unknown"] = pf["unknown_count"] / total_count
    pf["log_count"] = np.log1p(pf["rating_count"])

    ############################
    # Add gender ratio feature
    # pf["gender_ratio"] = pf["female_count"] / (pf["male_count"] + 1e-8)  # F:M ratio

    # # Add additional statistics about ratings by gender
    # gender_stats = {}
    # for gender in ["F", "M", "U"]:
    #     gender_df = df[df["Gender"] == gender]
    #     if not gender_df.empty:
    #         gender_avg = gender_df.groupby("profileID")["residual"].mean()
    #         gender_stats[f"{gender.lower()}_avg_residual"] = gender_avg

    # # Merge these statistics with the profile dataframe
    # for col, series in gender_stats.items():
    #     pf = pf.join(series.rename(col), how="left")
    #     pf[col] = pf[col].fillna(0)  # Fill NAs with 0

    # # Expanded feature list
    # feats = [
    #     "avg_residual",
    #     "log_count",
    #     "p_female",
    #     "p_male",
    #     "p_unknown",
    #     "gender_ratio",
    # ]
    # feats.extend([col for col in gender_stats.keys()])
    ###############################

    # features matrix
    feats = ["avg_residual", "log_count", "p_female", "p_male", "p_unknown"]
    scaler = StandardScaler().fit(pf[feats])
    F = scaler.transform(pf[feats])
    norms = np.linalg.norm(F, axis=1)
    idx_map = {pid: i for i, pid in enumerate(pf.index)}
    # user histories: indices into F and their residuals
    user_hist = {}
    for uid, grp in df.groupby("userID"):
        mask = grp["profileID"].isin(idx_map)
        pids = grp.loc[mask, "profileID"]
        idxs = [idx_map[pid] for pid in pids]
        res = grp.loc[mask, "residual"].values
        user_hist[uid] = (np.array(idxs), res)
    return pf, F, norms, idx_map, user_hist, scaler


# 5) Build content model on train_sub
pf_sub, F_sub, norms_sub, idx_map_sub, user_hist_sub, scaler_sub = build_content_model(
    train_sub
)

## 4. Content-Based Prediction Function

Now we implement the core prediction function that uses content-based similarity. The key idea is:

1. For a given user-profile pair, find profiles that the user has already rated
2. Calculate the similarity between the target profile and each rated profile using cosine similarity
3. Predict the residual as a weighted average of residuals from the k most similar profiles

This approach assumes that if a user liked a profile with certain features, they'll likely have similar reactions to profiles with similar features. We use cosine similarity in the standardized feature space to find meaningful profile relationships, regardless of scale.

In [14]:
def predict_content(user_id, profile_id, k, F, norms, idx_map, user_hist):
    """Predict the residual via cosine-weighted average of k neighbors."""
    if profile_id not in idx_map:
        return 0.0
    tgt = idx_map[profile_id]
    v_t = F[tgt]
    n_t = norms[tgt]
    hist = user_hist.get(user_id)
    if hist is None:
        return 0.0
    idxs, res = hist
    sims = (F[idxs] @ v_t) / (norms[idxs] * n_t + 1e-8)
    top = np.argsort(-sims)[:k]
    sims_k, res_k = sims[top], res[top]
    if sims_k.sum() <= 0:
        return res_k.mean()
    return np.dot(sims_k, res_k) / sims_k.sum()


# 6) Tune k on validation set
best_mae, best_k = np.inf, None
for k in [5, 10, 20, 50, 100]:
    # predict on val_sub: baseline + content
    preds = []
    for _, r in val_sub.iterrows():
        base = baseline_pred(r.userID, r.profileID, mu_sub, b_i_sub, b_u_sub)
        res = predict_content(
            r.userID, r.profileID, k, F_sub, norms_sub, idx_map_sub, user_hist_sub
        )
        preds.append(base + res)
    mae = mean_absolute_error(val_sub["rating"], preds)
    print(f"k={k:3d} → Validation MAE = {mae:.4f}")
    if mae < best_mae:
        best_mae, best_k = mae, k

print(f"Best k = {best_k}, Validation MAE = {best_mae:.4f}")

# 7) Retrain on full training set
mu_full, b_i_full, b_u_full = compute_baseline(train_df, λ=10)
train_df_full = train_df.copy()
train_df_full["residual"] = train_df_full.apply(
    lambda r: r.rating
    - baseline_pred(r.userID, r.profileID, mu_full, b_i_full, b_u_full),
    axis=1,
)
pf_full, F_full, norms_full, idx_map_full, user_hist_full, scaler_full = (
    build_content_model(train_df_full)
)

k=  5 → Validation MAE = 1.5707
k= 10 → Validation MAE = 1.5194
k= 20 → Validation MAE = 1.4984
k= 50 → Validation MAE = 1.4898
k=100 → Validation MAE = 1.5102
Best k = 50, Validation MAE = 1.4898


  b_i = df.groupby("profileID").apply(lambda g: (g.rating - mu).sum() / (len(g) + λ))
  b_u = df.groupby("userID").apply(user_bias)


## 5. Parameter Tuning and Model Evaluation

The parameter k (number of neighbors) significantly affects the performance of our model:

1. Small k: Predictions are influenced by only a few highly similar profiles (more specific but potentially unstable)
2. Large k: Predictions are influenced by many profiles (more stable but potentially less relevant)

We'll tune k by evaluating the model on our validation set using Mean Absolute Error (MAE). After finding the optimal k value, we'll retrain the model on the full training set and evaluate on the test set.

In [15]:
# 8) Evaluate on true test set
preds_test = []
for _, r in test_df.iterrows():
    # best_k = 50
    base = baseline_pred(r.userID, r.profileID, mu_full, b_i_full, b_u_full)
    res = predict_content(
        r.userID, r.profileID, best_k, F_full, norms_full, idx_map_full, user_hist_full
    )
    preds_test.append(base + res)

test_mae = mean_absolute_error(test_df["rating"], preds_test)
print(f"Test MAE at k={best_k}: {test_mae:.4f}")

Test MAE at k=50: 1.5936
