# LLM Zero-shot Reranker Baseline on MovieLens

This notebook runs the **prompted LLM zero-shot reranker** for MovieLens dataset using Llama-3.2-3B-Instruct-bnb-4bit.

**What it does:**
- Loads precomputed splits + candidate pools (history = 3, candidates = 50).
- Loads movie metadata from `movies.dat` and `item_id_map.parquet`.
- Builds 3‑item histories from `train_indexed.parquet`.
- Uses a chat‑style prompt to ask the LLM to **rank the 50 candidate movies**.
- Evaluates **HR@10** and **NDCG@10** on VAL and TEST.
- Saves all prompts to JSON files `llama32_3b_val_logs.json`, `llama32_3b_test_logs.json` for inspection.



In [13]:
# Install dependencies
!pip install -q "unsloth[colab-new]" pandas pyarrow numpy scipy tqdm


[0m

In [14]:
from google.colab import drive
drive.mount("/content/drive")

from pathlib import Path
import numpy as np
import pandas as pd
from tqdm import tqdm
import json

# ----- Paths -----
#   - ratings.csv, movies.dat
#   - splits/ (train_indexed.parquet, val_targets_indexed.parquet, etc.)
#   - candidates/ (val.parquet, test.parquet)
ROOT = Path("/content/drive/MyDrive/deep learning/project 2")

SPLITS = ROOT / "splits"
CANDS = ROOT / "candidates"

# ----- Hyperparameters -----
HISTORY_LEN = 3        # number of recent items per user
TOP_K = 10             # evaluate HR@K / NDCG@K
CANDIDATE_SIZE = 50    # expected candidate pool size


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Load splits and candidate pools

In [15]:
print("Loading splits and candidate pools...")

train_idx = pd.read_parquet(SPLITS / "train_indexed.parquet")         # [uid, iid, ts]
val_idx   = pd.read_parquet(SPLITS / "val_targets_indexed_100.parquet")   # [uid, iid, ts]
test_idx  = pd.read_parquet(SPLITS / "test_targets_indexed_100.parquet")

cand_val  = pd.read_parquet(CANDS / "val_100.parquet")    # [uid, candidates(list of iids)]
cand_test = pd.read_parquet(CANDS / "test_100.parquet")

print("Example candidate lengths (VAL, first 5 rows):",
 [len(c) for c in cand_val["candidates"].head(5)])

n_users_train_raw = train_idx["uid"].nunique()
n_users_val_raw   = val_idx["uid"].nunique()
n_users_test_raw  = test_idx["uid"].nunique()

print("Raw users in TRAIN:", n_users_train_raw)
print("Raw users in VAL targets:", n_users_val_raw)
print("Raw users in TEST targets:", n_users_test_raw)


Loading splits and candidate pools...
Example candidate lengths (VAL, first 5 rows): [50, 50, 50, 50, 50]
Raw users in TRAIN: 4675
Raw users in VAL targets: 100
Raw users in TEST targets: 100


#Additional validation

In [16]:

# start with checking which users have their true target inside the candidate pool
def mark_coverage(cands_df, targets_df, tgt_col_name: str):
    df = cands_df.merge(
        targets_df[["uid", "iid"]].rename(columns={"iid": tgt_col_name}),
        on="uid",
        how="inner",
    )
    #ensure candidates always a list
    df["candidates"] = df["candidates"].apply(
        lambda x: list(x) if isinstance(x, (list, tuple, np.ndarray, pd.Series)) else []
    )
    #clean up target column
    df[tgt_col_name] = df[tgt_col_name].fillna(-1).astype(int)

    #check if target is in candidate pool
    df["target_in_pool"] = [
        int(t) in set(c) for t, c in zip(df[tgt_col_name], df["candidates"])
    ]
    return df

val_cov  = mark_coverage(cand_val,  val_idx,  "target")
test_cov = mark_coverage(cand_test, test_idx, "target")

covered_val  = val_cov[val_cov["target_in_pool"]].copy()
covered_test = test_cov[test_cov["target_in_pool"]].copy()

val_eval  = covered_val[["uid", "candidates", "target"]].reset_index(drop=True)
test_eval = covered_test[["uid", "candidates", "target"]].reset_index(drop=True)

print(f"Users with target in pool: VAL={len(val_eval)}  TEST={len(test_eval)}")

# after target_in_pool filtering
n_users_val_cov  = val_eval["uid"].nunique()
n_users_test_cov = test_eval["uid"].nunique()

print("Eval users in VAL (with target in pool):", n_users_val_cov)
print("Eval users in TEST (with target in pool):", n_users_test_cov)

val_eval.head()


Users with target in pool: VAL=100  TEST=100
Eval users in VAL (with target in pool): 100
Eval users in TEST (with target in pool): 100


Unnamed: 0,uid,candidates,target
0,20,"[366, 353, 2029, 1578, 161, 356, 1816, 0, 180,...",0
1,29,"[366, 353, 1578, 161, 356, 1816, 0, 180, 291, ...",1578
2,31,"[366, 353, 2029, 1578, 161, 356, 1816, 0, 180,...",2144
3,53,"[366, 353, 2029, 1578, 161, 356, 0, 180, 291, ...",1780
4,61,"[366, 353, 2029, 1578, 161, 356, 1816, 0, 180,...",1050


#Load movie metadata

In [17]:
print("Loading movie metadata...")

# MovieLens 1M format: MovieID::Title::Genres
movies_raw = pd.read_csv(
    ROOT / "movies.dat",
    sep="::",
    engine="python",
    names=["movieId", "title", "genres"],
    encoding="ISO-8859-1",
)

item_map = pd.read_parquet(SPLITS / "item_id_map.parquet")  # [movieId, iid]

movies = movies_raw.merge(item_map, on="movieId", how="inner")
movies = movies.set_index("iid")

print("Movies with iid mapping:", movies.shape)
movies.head()


Loading movie metadata...
Movies with iid mapping: (2233, 3)


Unnamed: 0_level_0,movieId,title,genres
iid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,5,Father of the Bride Part II (1995),Comedy
4,6,Heat (1995),Action|Crime|Thriller


#Build user histories

In [18]:
print("Building user histories...")

# Sort by time, then group into full sequences
train_sorted = train_idx.sort_values(["uid", "ts"])
user_history_full = train_sorted.groupby("uid")["iid"].apply(list).to_dict()

# Return last max_len item ids from user's training history
def get_recent_history_iids(uid: int, max_len: int = HISTORY_LEN):
    seq = user_history_full.get(uid, [])
    if not seq:
        return []
    return seq[-max_len:]

# Quick sanity-check for one example user
if len(val_eval) > 0:
    example_uid = int(val_eval.loc[0, "uid"])
    print("Example uid:", example_uid)
    print("Full history iids:", user_history_full.get(example_uid, []))
    print(f"Recent {HISTORY_LEN}:", get_recent_history_iids(example_uid))
else:
    print("val_eval is empty; check splits.")


Building user histories...
Example uid: 20
Full history iids: [496, 318, 492]
Recent 3: [496, 318, 492]


#Define metrics

In [19]:
# returns 1 if target_id is in the first k ranked_ids, else 0
def hit_at_k(ranked_ids, target_id, k=TOP_K):
    return 1.0 if target_id in ranked_ids[:k] else 0.0

# score is higher when the true item is near the top
def ndcg_at_k(ranked_ids, target_id, k=TOP_K):
    for rank, iid in enumerate(ranked_ids[:k], start=1):
        if iid == target_id:
            return 1.0 / np.log2(rank + 1)
    return 0.0

print("Metric helpers defined: HR@K and NDCG@K.")

Metric helpers defined: HR@K and NDCG@K.


#Define and load LLM

In [20]:
from unsloth import FastLanguageModel
import torch

UNSLOTH_MODEL_NAME = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"

max_seq_length = 4096
dtype = None         # let Unsloth pick (bf16/float16)
load_in_4bit = True  # 4-bit quantization for Colab GPU

print("Loading Unsloth model:", UNSLOTH_MODEL_NAME)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=UNSLOTH_MODEL_NAME,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

FastLanguageModel.for_inference(model)
device = model.device

# Make sure padding is defined
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

print("Model loaded on device:", device)


Loading Unsloth model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model loaded on device: cuda:0


#Build prompt messages

In [21]:
system_prompt = (
    "You are a movie recommendation assistant. "
    "You see a user's watch history and a list of candidate movies, each with an internal ID. "
    "Your job is to choose and rank the candidate movies the user is most likely to enjoy. "
    "Only use the given IDs. "
    "Always follow the requested output format exactly."
)

# Build prompt messages (system + user) for one user by providing user histories and candidates:
# history_entries: list of strings like "Title (Year) | Genres=..."
# candidates: list of dicts: {"iid": int, "title": str, "genres": str}
def build_zero_shot_messages(history_entries, candidates, k=TOP_K):
    if history_entries:
        history_block = "\n".join(f"- {h}" for h in history_entries)
    else:
        history_block = "(no history available)"

    cand_lines = []
    for c in candidates:
        genres = c.get("genres", "Unknown")
        cand_lines.append(
            f"ID={c['iid']} | Title={c['title']} | Genres={genres}"
        )
    candidates_block = "\n".join(cand_lines)

    user_content = f"""
Here is a user and their watch history.

Watch history (movies the user liked):
{history_block}

Here is a list of candidate movies. Each line has an internal ID, the movie title, and its genres.

Candidates:
{candidates_block}

Task:
From the candidate list, select the TOP {k} movies this user is most likely to watch next,
ranking them from most to least likely.

Rules:
- Only choose IDs from the candidate list.
- Do NOT invent new IDs or movies.
- Base your decision on how similar each candidate is to the history, plus your general movie knowledge.

Output format:
Return your answer as a JSON list of IDs only, in order, with no extra text.
For example: [312, 55, 1021, 87, 99]

Now return the JSON list of candidate IDs for this user.
""".strip()

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content},
    ]
    return messages

# extract the first JSON list from a string
def extract_json_list(text: str):
    first = text.find("[")
    last = text.rfind("]")
    if first == -1 or last == -1 or last <= first:
        return []
    snippet = text[first:last+1]
    try:
        parsed = json.loads(snippet)
        if isinstance(parsed, list):
            return parsed
    except Exception:
        return []
    return []


#Prompting the LLM

In [22]:
#Run the LLM for a single user and return:
#ranked_ids: list[int] (LLM's ranking over candidate_iids)
#raw_text: the raw text response (for debugging)
#prompt_text: the exact prompt sent (for logging)

def llm_rerank(history_iids, candidate_iids, k=TOP_K, max_new_tokens=256, temperature=0.0):
    # History: titles + genres
    history_entries = []
    for iid in history_iids:
        if iid in movies.index:
            row = movies.loc[iid]
            title = str(row["title"])
            genres = str(row.get("genres", "Unknown"))
            # e.g. "Antz (1998) | Genres=Animation|Children's|Comedy"
            history_entries.append(f"{title} | Genres={genres}")

    # Candidates: IDs + titles + genres
    cand_list = []
    for iid in candidate_iids:
        iid_int = int(iid)
        if iid in movies.index:
            row = movies.loc[iid]
            title = str(row["title"])
            genres = str(row.get("genres", "Unknown"))
            cand_list.append({
                "iid": iid_int,
                "title": title,
                "genres": genres,
            })
        else:
            cand_list.append({
                "iid": iid_int,
                "title": f"Unknown movie (iid={iid_int})",
                "genres": "Unknown",
            })

    # Build messages (this includes genres in both history + candidates)
    messages = build_zero_shot_messages(history_entries, cand_list, k=k)

    # Build model input using chat template (Unsloth Llama)
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=False,  # greedy for deterministic ranking
        )

    gen_tokens = outputs[0, input_ids.shape[-1]:]
    text = tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()

    # parse JSON list of IDs
    raw_list = extract_json_list(text)   # returns [] on failure
    cand_set = set(int(x) for x in candidate_iids)

    ranked_ids = []
    for x in raw_list:
        try:
            # handle "362" or 362
            iid_int = int(x)
        except Exception:
            continue
        # keep only candidates
        if iid_int in cand_set:
            ranked_ids.append(iid_int)


    # For logging to see the exact prompt:
    prompt_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)

    return ranked_ids, text, prompt_text


#Evaluating LLM performance

In [23]:
#Evaluate the zero-shot LLM reranker on a given split.
#df_eval columns: [uid, candidates(list[iid]), target(int)]
def eval_llm(df_eval, split_name: str, max_users=None, log_path=None):
    hits, ndcgs = [], []
    logs = []

    n_total = len(df_eval)
    n_users = n_total if max_users is None else min(n_total, max_users)

    print(f"\nEvaluating LLM on {split_name} for {n_users} users...")

    processed = 0
    # using tqdm for a progress bar over users
    for idx, row in tqdm(df_eval.iloc[:n_users].iterrows(),
                         total=n_users, desc=f"{split_name} users"):
        uid = int(row["uid"])
        candidate_iids = list(row["candidates"])
        target = int(row["target"])

        # grab user's recent history (last 3 items from train set)
        history_iids = get_recent_history_iids(uid, max_len=HISTORY_LEN)

        # skip if user has no candidates
        if not candidate_iids:
            continue

        # call the LLM to rerank the candidate list given the user history
        ranked_ids, raw_text, prompt_text = llm_rerank(
            history_iids=history_iids,
            candidate_iids=candidate_iids,
            k=TOP_K,
        )

        if not ranked_ids:
            continue

        # keep only IDs that were in the candidate list
        cand_set = set(int(x) for x in candidate_iids)
        ranked_ids = [int(iid) for iid in ranked_ids if iid in cand_set]

        # skip if empty
        if not ranked_ids:
            continue

        # compute metrics given the ranked list and the true target for each user
        h = hit_at_k(ranked_ids, target, k=TOP_K)
        d = ndcg_at_k(ranked_ids, target, k=TOP_K)
        hits.append(h)
        ndcgs.append(d)

        # store everything for our log files
        logs.append({
            "row_index": int(idx),
            "uid": uid,
            "target": target,
            "history_iids": [int(i) for i in history_iids],
            "candidate_iids": [int(i) for i in candidate_iids],
            "ranked_ids": ranked_ids,
            "raw_output": raw_text,
            "prompt_text": prompt_text,
        })

        #stats to keep track of progress
        processed += 1
        if processed % 20 == 0:
            print(
                f"[{split_name}] Processed {processed}/{n_users} users → "
                f"HR@{TOP_K}={np.mean(hits):.3f}, NDCG@{TOP_K}={np.mean(ndcgs):.3f}"
            )

    # average metrics over all users
    hr = float(np.mean(hits)) if hits else 0.0
    ndcg = float(np.mean(ndcgs)) if ndcgs else 0.0
    print(f"[{split_name}] Zero-shot LLM HR@{TOP_K}={hr:.3f}  NDCG@{TOP_K}={ndcg:.3f}")

    # save logs
    if log_path is not None:
        log_path = Path(log_path)
        with open(log_path, "w", encoding="utf-8") as f:
            json.dump(logs, f, indent=2)
        print(f"  Saved {len(logs)} per-user logs to: {log_path}")

    return hr, ndcg, logs


#Run evaluation

In [24]:
# Run evaluation on VAL and TEST splits

# define path to save log files
val_log_path  = ROOT / "llama32_3b_val_logs.json"
test_log_path = ROOT / "llama32_3b_test_logs.json"

# evaluate the LLM reranker on the validation set
val_hr, val_ndcg, val_logs = eval_llm(
    val_eval,
    "VAL",
    max_users=None,
    log_path=val_log_path,
)

# evaluate the LLM reranker on the test set
test_hr, test_ndcg, test_logs = eval_llm(
    test_eval,
    "TEST",
    max_users=None,
    log_path=test_log_path,
)

# print the final metrics summary
print("\n=== SUMMARY ===")
print(f"VAL  HR@{TOP_K}={val_hr:.3f}  NDCG@{TOP_K}={val_ndcg:.3f}")
print(f"TEST HR@{TOP_K}={test_hr:.3f}  NDCG@{TOP_K}={test_ndcg:.3f}")

# print one example from our log files
if val_logs:
    print("\nExample VAL log entry:")
    from pprint import pprint
    pprint(val_logs[0])



Evaluating LLM on VAL for 100 users...


VAL users:  20%|██        | 20/100 [00:47<03:11,  2.39s/it]

[VAL] Processed 20/100 users → HR@10=0.350, NDCG@10=0.221


VAL users:  40%|████      | 40/100 [01:33<02:23,  2.40s/it]

[VAL] Processed 40/100 users → HR@10=0.275, NDCG@10=0.157


VAL users:  60%|██████    | 60/100 [02:19<01:27,  2.19s/it]

[VAL] Processed 60/100 users → HR@10=0.350, NDCG@10=0.173


VAL users:  80%|████████  | 80/100 [03:10<00:47,  2.38s/it]

[VAL] Processed 80/100 users → HR@10=0.325, NDCG@10=0.169


VAL users: 100%|██████████| 100/100 [03:56<00:00,  2.36s/it]


[VAL] Processed 100/100 users → HR@10=0.340, NDCG@10=0.169
[VAL] Zero-shot LLM HR@10=0.340  NDCG@10=0.169
  Saved 100 per-user logs to: /content/drive/MyDrive/deep learning/project 2/llama32_3b_val_logs.json

Evaluating LLM on TEST for 100 users...


TEST users:  20%|██        | 20/100 [00:46<02:59,  2.25s/it]

[TEST] Processed 20/100 users → HR@10=0.100, NDCG@10=0.041


TEST users:  40%|████      | 40/100 [01:31<02:11,  2.19s/it]

[TEST] Processed 40/100 users → HR@10=0.200, NDCG@10=0.093


TEST users:  60%|██████    | 60/100 [02:17<01:38,  2.47s/it]

[TEST] Processed 60/100 users → HR@10=0.317, NDCG@10=0.140


TEST users:  80%|████████  | 80/100 [03:04<00:46,  2.32s/it]

[TEST] Processed 80/100 users → HR@10=0.300, NDCG@10=0.148


TEST users: 100%|██████████| 100/100 [03:49<00:00,  2.30s/it]

[TEST] Processed 100/100 users → HR@10=0.280, NDCG@10=0.136
[TEST] Zero-shot LLM HR@10=0.280  NDCG@10=0.136
  Saved 100 per-user logs to: /content/drive/MyDrive/deep learning/project 2/llama32_3b_test_logs.json

=== SUMMARY ===
VAL  HR@10=0.340  NDCG@10=0.169
TEST HR@10=0.280  NDCG@10=0.136

Example VAL log entry:
{'candidate_iids': [366,
                    353,
                    2029,
                    1578,
                    161,
                    356,
                    1816,
                    0,
                    180,
                    291,
                    730,
                    665,
                    504,
                    1364,
                    1579,
                    2199,
                    2144,
                    735,
                    499,
                    76,
                    3892,
                    1698,
                    3985,
                    3896,
                    2539,
                    3659,
                    2546


