# AmazonBooks – NCF Baseline

This notebook trains a Neural Collaborative Filtering (NCF) baseline on the AmazonBooks dataset and evaluates it on a 100-user candidate subset, to be compared later with LLM-based recommenders.

---

## 1. Setup and load data

In this section we:

1. Set up paths to the `splits/` and `candidates_subset100/` folders.
2. Detect the device (CPU / GPU).
3. Load:
  - `train_indexed.parquet`  (implicit feedback positives)
  - `val_targets_indexed.parquet` / `test_targets_indexed.parquet` (ground-truth items)
  - `val.parquet` / `test.parquet` (candidate pools for the 100-user subset)
- Run basic sanity checks on user and item counts and on the candidate pools.


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm

import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

BASE   = Path(r"C:\Users\carlk\OneDrive\Documents\uoft\ECE1508H F\Project")
SPLITS = BASE / "splits"
CANDS  = BASE / "candidates_subset100"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

train_idx = pd.read_parquet(SPLITS / "train_indexed.parquet")          # [uid, iid, ts]
val_tgt   = pd.read_parquet(CANDS  / "val_targets_indexed.parquet")    # [uid, val_item(iid), ts_val]
test_tgt  = pd.read_parquet(CANDS  / "test_targets_indexed.parquet")   # [uid, test_item(iid), ts_test]

cand_val  = pd.read_parquet(CANDS  / "val.parquet")                    # [uid, candidates]
cand_test = pd.read_parquet(CANDS  / "test.parquet")                   # [uid, candidates]

print("Loaded train_idx:", train_idx.shape)
print("Loaded val_tgt  :", val_tgt.shape)
print("Loaded test_tgt :", test_tgt.shape)
print("Loaded cand_val :", cand_val.shape)
print("Loaded cand_test:", cand_test.shape)

Using device: cpu
Loaded train_idx: (60536, 3)
Loaded val_tgt  : (100, 4)
Loaded test_tgt : (100, 4)
Loaded cand_val : (100, 2)
Loaded cand_test: (100, 2)


## 2. Inspect the 100-user subset and candidate pools

Here I restrict the training data to the 100 users that appear in the validation / test splits and run sanity checks:

1. Number of users in train / val / test for the subset.
2. Distribution of train positives per user.
3. Number of candidate rows per user.
4. Verify that validation and test use the same candidate pool.
5. Derive `num_users` and `num_items` for the NCF model.

In [2]:
subset_users = sorted(val_tgt["uid"].unique())
train_sub = train_idx[train_idx["uid"].isin(subset_users)].copy()

print("\n[1] User counts (subset)")
print("train users in subset:", train_sub["uid"].nunique())
print("val users           :", val_tgt["uid"].nunique())
print("test users          :", test_tgt["uid"].nunique())

train_counts = train_sub.groupby("uid")["iid"].nunique()
print("\n[2] Train positives per user (subset)")
print(train_counts.describe())
print("Min train positives per user:", int(train_counts.min()))
print("Max train positives per user:", int(train_counts.max()))

print("\n[3] Candidate rows per user")
print("val candidate users :", cand_val["uid"].nunique())
print("test candidate users:", cand_test["uid"].nunique())

cand_val_sorted  = cand_val.sort_values("uid").reset_index(drop=True)
cand_test_sorted = cand_test.sort_values("uid").reset_index(drop=True)

same_uid_order = cand_val_sorted["uid"].tolist() == cand_test_sorted["uid"].tolist()
val_pools  = cand_val_sorted["candidates"].tolist()
test_pools = cand_test_sorted["candidates"].tolist()

same_pools = all(np.array_equal(v, t) for v, t in zip(val_pools, test_pools))

print("\n[4] Same candidate pool for val & test?")
print("same uid order:", same_uid_order)
print("same pools   :", same_pools)

num_users = int(train_idx["uid"].max()) + 1
num_items = int(train_idx["iid"].max()) + 1

print("\nnum_users =", num_users)
print("num_items =", num_items)


[1] User counts (subset)
train users in subset: 100
val users           : 100
test users          : 100

[2] Train positives per user (subset)
count    100.000000
mean       4.090000
std        0.911154
min        3.000000
25%        3.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: iid, dtype: float64
Min train positives per user: 3
Max train positives per user: 5

[3] Candidate rows per user
val candidate users : 100
test candidate users: 100

[4] Same candidate pool for val & test?
same uid order: True
same pools   : True

num_users = 14064
num_items = 18782


## 3. Define the training dataset (implicit feedback with negatives)

I define an `NCFDataset` that:

1. Takes the positive interactions `df[["uid", "iid"]]` for the 100-user subset.
2. Pre-computes, for each user, the set of items they have interacted with.
3. For every positive (user, item) pair, samples a fixed number of negative items
  that the user has not interacted with.

The dataset returns batches of:
1. user IDs
2. item IDs (one positive + several negatives)
3. binary labels (1 for positive, 0 for negatives)


In [3]:
class NCFDataset(Dataset):
    def __init__(self, df, num_items, num_neg=4):

        self.users     = df["uid"].values.astype(np.int64)
        self.items     = df["iid"].values.astype(np.int64)
        self.num_items = int(num_items)
        self.num_neg   = int(num_neg)

        user_pos = {}
        for u, i in zip(self.users, self.items):
            user_pos.setdefault(int(u), set()).add(int(i))
        self.user_pos = user_pos

    def __len__(self):
        return len(self.users)

    def __getitem__(self, idx):
        u = int(self.users[idx])
        i = int(self.items[idx])

        user_list  = [u]
        item_list  = [i]
        label_list = [1.0]

        pos_items = self.user_pos[u]
        for _ in range(self.num_neg):
            j = np.random.randint(0, self.num_items)

            while j in pos_items:
                j = np.random.randint(0, self.num_items)
            user_list.append(u)
            item_list.append(j)
            label_list.append(0.0)

        return (
            torch.tensor(user_list,  dtype=torch.long),
            torch.tensor(item_list,  dtype=torch.long),
            torch.tensor(label_list, dtype=torch.float32),
        )


## 4. Define the NCF model

I implement a simple Neural Collaborative Filtering (NCF) model:

1. Learnable user and item embeddings of size `emb_dim`.
2. Concatenate user and item embeddings.
3. Pass through a small MLP (`mlp_dims`) with ReLU activations.
4. Final linear layer + sigmoid to output a relevance score in `[0, 1]`.

This is the baseline recommender that will be compared against LLM-based methods.


In [4]:
class NCF(nn.Module):
    def __init__(self, num_users, num_items,
                 emb_dim=64, mlp_dims=(128, 64, 32)):
        super().__init__()
        self.user_emb = nn.Embedding(num_users, emb_dim)
        self.item_emb = nn.Embedding(num_items, emb_dim)

        layers = []
        input_dim = emb_dim * 2
        for h in mlp_dims:
            layers.append(nn.Linear(input_dim, h))
            layers.append(nn.ReLU())
            input_dim = h
        self.mlp = nn.Sequential(*layers)

        self.out = nn.Linear(input_dim, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, user, item):
        u = self.user_emb(user)
        v = self.item_emb(item)
        x = torch.cat([u, v], dim=-1)
        x = self.mlp(x)
        x = self.out(x)
        x = self.sigmoid(x)
        return x.squeeze(-1)

## 5. Train the NCF baseline

I now:

1. Build the `NCFDataset` and PyTorch `DataLoader` using the 100-user subset.
2. Instantiate the NCF model and an Adam optimizer.
3. Train for a small number of epochs using binary cross-entropy loss
  on positive + sampled negative items.

The printed epoch losses give a quick sanity check that training is converging.


In [5]:
BATCH_SIZE   = 512
NUM_EPOCHS   = 5
NEG_PER_POS  = 4
LR           = 1e-3

# Build dataset + dataloader using the 100-user subset
train_ds = NCFDataset(
    train_sub[["uid", "iid"]],
    num_items=num_items,
    num_neg=NEG_PER_POS,
)
train_loader = DataLoader(
    train_ds,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=0,
)

model = NCF(num_users=num_users, num_items=num_items).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
criterion = nn.BCELoss()

for epoch in range(1, NUM_EPOCHS + 1):
    model.train()
    total_loss = 0.0

    for users, items, labels in tqdm(train_loader, desc=f"Epoch {epoch}"):
        # users/items: [batch, 1 + num_neg]
        users  = users.view(-1).to(device)
        items  = items.view(-1).to(device)
        labels = labels.view(-1).to(device)

        optimizer.zero_grad()
        preds = model(users, items)
        loss  = criterion(preds, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch}: loss = {avg_loss:.4f}")

Epoch 1: 100%|██████████| 1/1 [00:00<00:00, 13.05it/s]


Epoch 1: loss = 0.6728


Epoch 2: 100%|██████████| 1/1 [00:00<00:00, 24.85it/s]


Epoch 2: loss = 0.6614


Epoch 3: 100%|██████████| 1/1 [00:00<00:00, 32.17it/s]


Epoch 3: loss = 0.6513


Epoch 4: 100%|██████████| 1/1 [00:00<00:00, 38.09it/s]


Epoch 4: loss = 0.6414


Epoch 5: 100%|██████████| 1/1 [00:00<00:00, 34.93it/s]

Epoch 5: loss = 0.6326





## 6. Evaluation helper: HitRate / NDCG / Precision@K

I define a reusable evaluation function that:

1. Joins ground-truth targets with candidate pools.
2. Scores all candidate items for each user with the trained NCF model.
3. Ranks candidates and computes:
  - HitRate@K
  - NDCG@K
  - Precision@K
  - F1@K

Users whose ground-truth item is not in the candidate pool are skipped.


In [6]:
def eval_split(model, cand_df, tgt_df, topk=10, max_users=None):

    model.eval()

    tgt = tgt_df[["uid", "iid"]].rename(columns={"iid": "target_iid"})
    df  = cand_df.merge(tgt, on="uid", how="inner")

    if max_users is not None:
        df = df.iloc[:max_users]

    hits, ndcgs, precs = [], [], []

    with torch.no_grad():
        for _, row in tqdm(df.iterrows(), total=len(df), desc="Eval users"):
            uid    = int(row.uid)
            cands  = list(row.candidates)
            target = int(row.target_iid)

            if target not in cands:
                continue

            items = torch.tensor(cands, device=device, dtype=torch.long)
            users = torch.full_like(items, uid, device=device)

            scores      = model(users, items).cpu().numpy()
            ranking_idx = np.argsort(-scores)
            topk_items  = [cands[i] for i in ranking_idx[:topk]]

            hit = int(target in topk_items)
            hits.append(hit)

            if hit:
                rank = topk_items.index(target) + 1
                ndcgs.append(1.0 / np.log2(rank + 1))
            else:
                ndcgs.append(0.0)

            precs.append(hit / topk)

    if len(hits) == 0:
        print("No users with target inside candidate pool!")
        return {"HR": 0.0, "NDCG": 0.0, "Precision": 0.0, "F1": 0.0}

    hr    = float(np.mean(hits))
    ndcg  = float(np.mean(ndcgs))
    prec  = float(np.mean(precs))
    recall = hr
    f1    = 2 * prec * recall / (prec + recall) if (prec + recall) > 0 else 0.0

    print(f"Evaluated users: {len(hits)}")
    print(f"HR@{topk}:       {hr:.4f}")
    print(f"NDCG@{topk}:     {ndcg:.4f}")
    print(f"Precision@{topk}: {prec:.4f}")
    print(f"F1@{topk}:       {f1:.4f}")

    return {"HR": hr, "NDCG": ndcg, "Precision": prec, "F1": f1}

## 7. Evaluate on validation and test candidate pools

Finally, I evaluate the trained NCF model on:

1. The validation candidate pools for the 100-user subset.
2. The test candidate pools for the same users.

I report the top-10 ranking metrics (HR@10, NDCG@10, Precision@10, F1@10).
These results form the NCF baseline that will be compared directly
against the LLM-based recommendation approach.


In [7]:
print("\nValidation")
val_metrics = eval_split(model, cand_val, val_tgt, topk=10)

print("\nTest")
test_metrics = eval_split(model, cand_test, test_tgt, topk=10)

print("\nVal metrics :", val_metrics)
print("Test metrics:", test_metrics)


Validation


Eval users: 100%|██████████| 100/100 [00:00<00:00, 2825.68it/s]


Evaluated users: 100
HR@10:       0.1300
NDCG@10:     0.0585
Precision@10: 0.0130
F1@10:       0.0236

Test


Eval users: 100%|██████████| 100/100 [00:00<00:00, 2812.99it/s]

Evaluated users: 100
HR@10:       0.2300
NDCG@10:     0.1072
Precision@10: 0.0230
F1@10:       0.0418

Val metrics : {'HR': 0.13, 'NDCG': 0.058466059117301475, 'Precision': 0.013000000000000001, 'F1': 0.02363636363636364}
Test metrics: {'HR': 0.23, 'NDCG': 0.10723639114963507, 'Precision': 0.023000000000000003, 'F1': 0.04181818181818183}



