# Evaluation Pipeline Walkthrough for Next-Location PredictionThis notebook provides a **complete, self-contained walkthrough** of the evaluation pipeline used to test the next-location prediction model on the test set. It walks through each step of the evaluation process, from loading data and model to computing all metrics and displaying results.## PurposeThis notebook replicates the exact evaluation logic from the training script (`train_model.py` and `trainer_v3.py`), specifically the test set evaluation phase. It demonstrates:1. **Data Loading**: How test data is loaded and prepared2. **Model Architecture**: The History-Centric model structure and forward pass3. **Evaluation Loop**: Batch-by-batch inference on the test set4. **Metrics Calculation**: Computing Acc@k, MRR, NDCG, and F1 scores5. **Results Display**: Final performance summary## Key Components- **Dataset**: GeoLife trajectory data with location sequences, user IDs, temporal features- **Model**: HistoryCentricModel that combines history-based scoring with learned patterns- **Metrics**: Top-k accuracy (k=1,3,5,10), Mean Reciprocal Rank, NDCG, and F1 scoreLet's begin!

## 1. Setup and ImportsFirst, we import all necessary libraries. This notebook is self-contained and doesn't import any project-specific modules - all code is included inline.

In [None]:
import pickleimport torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.utils.data import Dataset, DataLoaderimport numpy as npfrom sklearn.metrics import f1_scorefrom tqdm.notebook import tqdmimport mathimport warningswarnings.filterwarnings('ignore')# Check device availabilitydevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')print(f"Using device: {device}")print(f"PyTorch version: {torch.__version__}")

## 2. ConfigurationDefine all configuration parameters needed for evaluation. These match the settings used during training.

In [None]:
# PathsDATA_DIR = "data/geolife"TEST_FILE = "geolife_transformer_7_test.pk"MODEL_PATH = "trained_models/best_model.pt"  # or use a specific run's checkpoint# Data configurationNUM_LOCATIONS = 1187  # 1186 max + 1 for padding (0)NUM_USERS = 46  # 45 max + 1 for padding (0)NUM_WEEKDAYS = 7MAX_SEQ_LEN = 60BATCH_SIZE = 96# Model configuration (used for initialization)LOC_EMB_DIM = 56USER_EMB_DIM = 12D_MODEL = 80print("Configuration loaded:")print(f"  Data: {DATA_DIR}/{TEST_FILE}")print(f"  Model: {MODEL_PATH}")print(f"  Num locations: {NUM_LOCATIONS}")print(f"  Num users: {NUM_USERS}")print(f"  Max sequence length: {MAX_SEQ_LEN}")print(f"  Batch size: {BATCH_SIZE}")

## 3. Dataset ClassDefine the `GeoLifeDataset` class that loads and processes trajectory sequences. Each sample contains:- **loc_seq**: Sequence of location IDs visited- **user_seq**: User ID for each visit- **weekday_seq**: Day of week for each visit- **start_min_seq**: Start time in minutes from midnight- **dur_seq**: Duration spent at each location- **diff_seq**: Time gap indicator between visits- **target**: The next location to predict (ground truth)The dataset handles variable-length sequences and truncates sequences longer than MAX_SEQ_LEN.

In [None]:
class GeoLifeDataset(Dataset):    """    Dataset for GeoLife trajectory sequences.        Each sample represents a trajectory sequence ending with a target next location.    """        def __init__(self, data_path, max_seq_len=60):        """        Args:            data_path: Path to pickle file containing preprocessed data            max_seq_len: Maximum sequence length (for truncation)        """        print(f"Loading data from {data_path}...")        with open(data_path, 'rb') as f:            self.data = pickle.load(f)        self.max_seq_len = max_seq_len        print(f"Loaded {len(self.data)} samples")            def __len__(self):        return len(self.data)        def __getitem__(self, idx):        sample = self.data[idx]                # Extract features        loc_seq = sample['X']        user_seq = sample['user_X']        weekday_seq = sample['weekday_X']        start_min_seq = sample['start_min_X']        dur_seq = sample['dur_X']        diff_seq = sample['diff']        target = sample['Y']                # Truncate if too long (keep most recent history)        seq_len = len(loc_seq)        if seq_len > self.max_seq_len:            loc_seq = loc_seq[-self.max_seq_len:]            user_seq = user_seq[-self.max_seq_len:]            weekday_seq = weekday_seq[-self.max_seq_len:]            start_min_seq = start_min_seq[-self.max_seq_len:]            dur_seq = dur_seq[-self.max_seq_len:]            diff_seq = diff_seq[-self.max_seq_len:]            seq_len = self.max_seq_len                return {            'loc_seq': torch.LongTensor(loc_seq),            'user_seq': torch.LongTensor(user_seq),            'weekday_seq': torch.LongTensor(weekday_seq),            'start_min_seq': torch.FloatTensor(start_min_seq),            'dur_seq': torch.FloatTensor(dur_seq),            'diff_seq': torch.LongTensor(diff_seq),            'target': torch.LongTensor([target]),            'seq_len': seq_len        }print("GeoLifeDataset class defined successfully")

## 4. Collate FunctionThe collate function handles variable-length sequences by padding them to the maximum length in each batch. It creates:- Padded tensors for all sequence features- A mask indicating which positions are valid (not padding)This allows efficient batch processing of sequences with different lengths.

In [None]:
def collate_fn(batch):    """    Custom collate function to handle variable-length sequences.    Pads sequences to the maximum length in the batch.        Args:        batch: List of samples from dataset            Returns:        Dictionary with batched tensors and mask    """    # Find max length in this batch    max_len = max(item['seq_len'] for item in batch)    batch_size = len(batch)        # Initialize padded tensors    loc_seqs = torch.zeros(batch_size, max_len, dtype=torch.long)    user_seqs = torch.zeros(batch_size, max_len, dtype=torch.long)    weekday_seqs = torch.zeros(batch_size, max_len, dtype=torch.long)    start_min_seqs = torch.zeros(batch_size, max_len, dtype=torch.float)    dur_seqs = torch.zeros(batch_size, max_len, dtype=torch.float)    diff_seqs = torch.zeros(batch_size, max_len, dtype=torch.long)    targets = torch.zeros(batch_size, dtype=torch.long)    seq_lens = torch.zeros(batch_size, dtype=torch.long)        # Fill in the data    for i, item in enumerate(batch):        length = item['seq_len']        loc_seqs[i, :length] = item['loc_seq']        user_seqs[i, :length] = item['user_seq']        weekday_seqs[i, :length] = item['weekday_seq']        start_min_seqs[i, :length] = item['start_min_seq']        dur_seqs[i, :length] = item['dur_seq']        diff_seqs[i, :length] = item['diff_seq']        targets[i] = item['target']        seq_lens[i] = length        # Create mask (True for valid positions, False for padding)    mask = torch.arange(max_len).unsqueeze(0) < seq_lens.unsqueeze(1)        return {        'loc_seq': loc_seqs,        'user_seq': user_seqs,        'weekday_seq': weekday_seqs,        'start_min_seq': start_min_seqs,        'dur_seq': dur_seqs,        'diff_seq': diff_seqs,        'target': targets,        'mask': mask    }print("Collate function defined successfully")

## 5. Model ArchitectureThe **HistoryCentricModel** is the core of our next-location prediction system. It combines two strategies:### A) History-Based Scoring- **Recency**: Exponentially decaying weights favoring recent locations- **Frequency**: Count of how often each location appears in history- These scores are computed for ALL possible locations based on visit history### B) Learned Patterns  - **Embeddings**: Location, user, and temporal feature embeddings- **Transformer**: Single-layer attention to capture sequential patterns- **Prediction head**: Maps learned representations to location scores### Final PredictionThe model ensembles both approaches:```final_score = history_score + model_weight * learned_score```This design reflects the insight that **83.81% of next locations are already in the visit history**.

In [None]:
class HistoryCentricModel(nn.Module):    """    History-Centric Next-Location Predictor        Combines history-based heuristics with learned sequential patterns.    """        def __init__(self, num_locations, num_users):        super().__init__()                self.num_locations = num_locations        self.d_model = 80  # Compact hidden dimension                # === Embeddings ===        # Location embedding (main feature)        self.loc_emb = nn.Embedding(num_locations, 56, padding_idx=0)        # User embedding        self.user_emb = nn.Embedding(num_users, 12, padding_idx=0)                # Temporal feature projection (6 features -> 12 dim)        # Features: sin/cos(time), duration, sin/cos(weekday), time gap        self.temporal_proj = nn.Linear(6, 12)                # Input layer norm (56 + 12 + 12 = 80)        self.input_norm = nn.LayerNorm(80)                # === Positional Encoding ===        pe = torch.zeros(60, 80)        position = torch.arange(0, 60, dtype=torch.float).unsqueeze(1)        div_term = torch.exp(torch.arange(0, 80, 2).float() * (-math.log(10000.0) / 80))        pe[:, 0::2] = torch.sin(position * div_term)        pe[:, 1::2] = torch.cos(position * div_term)        self.register_buffer('pe', pe)                # === Transformer (single layer, 4 heads) ===        self.attn = nn.MultiheadAttention(80, 4, dropout=0.35, batch_first=True)        self.ff = nn.Sequential(            nn.Linear(80, 160),            nn.GELU(),            nn.Dropout(0.35),            nn.Linear(160, 80)        )        self.norm1 = nn.LayerNorm(80)        self.norm2 = nn.LayerNorm(80)        self.dropout = nn.Dropout(0.35)                # === Prediction Head ===        self.predictor = nn.Sequential(            nn.Linear(80, 160),            nn.GELU(),            nn.Dropout(0.3),            nn.Linear(160, num_locations)        )                # === History Scoring Parameters (learnable) ===        self.recency_decay = nn.Parameter(torch.tensor(0.62))        self.freq_weight = nn.Parameter(torch.tensor(2.2))        self.history_scale = nn.Parameter(torch.tensor(11.0))        self.model_weight = nn.Parameter(torch.tensor(0.22))                self._init_weights()        def _init_weights(self):        """Initialize model weights."""        for m in self.modules():            if isinstance(m, nn.Linear):                nn.init.xavier_uniform_(m.weight)                if m.bias is not None:                    nn.init.zeros_(m.bias)            elif isinstance(m, nn.Embedding):                nn.init.normal_(m.weight, mean=0, std=0.01)                if m.padding_idx is not None:                    m.weight.data[m.padding_idx].zero_()        def compute_history_scores(self, loc_seq, mask):        """        Compute history-based scores for all locations.                For each location in the vocabulary:        - Recency score: exponential decay from last visit        - Frequency score: normalized count of visits        - Combined: recency + freq_weight * frequency                Args:            loc_seq: (batch_size, seq_len) - sequence of location IDs            mask: (batch_size, seq_len) - valid position mask                    Returns:            history_scores: (batch_size, num_locations) - score for each location        """        batch_size, seq_len = loc_seq.shape                # Initialize score matrices        recency_scores = torch.zeros(batch_size, self.num_locations, device=loc_seq.device)        frequency_scores = torch.zeros(batch_size, self.num_locations, device=loc_seq.device)                # Compute recency and frequency scores by iterating through sequence        for t in range(seq_len):            locs_t = loc_seq[:, t]  # (batch_size,)            valid_t = mask[:, t].float()  # (batch_size,)                        # Recency: exponential decay from the end            time_from_end = seq_len - t - 1            recency_weight = torch.pow(self.recency_decay, time_from_end)                        # Update recency scores (max over time for each location)            indices = locs_t.unsqueeze(1)  # (batch_size, 1)            values = (recency_weight * valid_t).unsqueeze(1)  # (batch_size, 1)                        # For each location, keep the maximum recency (most recent visit)            current_scores = torch.zeros(batch_size, self.num_locations, device=loc_seq.device)            current_scores.scatter_(1, indices, values)            recency_scores = torch.maximum(recency_scores, current_scores)                        # Update frequency scores (sum over time)            frequency_scores.scatter_add_(1, indices, valid_t.unsqueeze(1))                # Normalize frequency scores        max_freq = frequency_scores.max(dim=1, keepdim=True)[0].clamp(min=1.0)        frequency_scores = frequency_scores / max_freq                # Combine recency and frequency        history_scores = recency_scores + self.freq_weight * frequency_scores        history_scores = self.history_scale * history_scores                return history_scores        def forward(self, loc_seq, user_seq, weekday_seq, start_min_seq, dur_seq, diff_seq, mask):        """        Forward pass combining history scores and learned model.                Args:            loc_seq: (B, L) - location sequence            user_seq: (B, L) - user IDs            weekday_seq: (B, L) - weekday indices (0-6)            start_min_seq: (B, L) - start time in minutes            dur_seq: (B, L) - duration at each location            diff_seq: (B, L) - time gap indicator            mask: (B, L) - valid position mask                    Returns:            logits: (B, num_locations) - prediction scores        """        batch_size, seq_len = loc_seq.shape                # === STEP 1: Compute history-based scores ===        history_scores = self.compute_history_scores(loc_seq, mask)                # === STEP 2: Learned model ===        # Feature extraction        loc_emb = self.loc_emb(loc_seq)  # (B, L, 56)        user_emb = self.user_emb(user_seq)  # (B, L, 12)                # Temporal features (cyclical encoding + normalization)        hours = start_min_seq / 60.0        time_rad = (hours / 24.0) * 2 * math.pi        time_sin = torch.sin(time_rad)        time_cos = torch.cos(time_rad)                dur_norm = torch.log1p(dur_seq) / 8.0  # Log-normalized duration                wd_rad = (weekday_seq.float() / 7.0) * 2 * math.pi        wd_sin = torch.sin(wd_rad)        wd_cos = torch.cos(wd_rad)                diff_norm = diff_seq.float() / 7.0  # Normalized time gap                # Stack all temporal features        temporal_feats = torch.stack([time_sin, time_cos, dur_norm, wd_sin, wd_cos, diff_norm], dim=-1)        temporal_emb = self.temporal_proj(temporal_feats)  # (B, L, 12)                # Combine all features        x = torch.cat([loc_emb, user_emb, temporal_emb], dim=-1)  # (B, L, 80)        x = self.input_norm(x)                # Add positional encoding        x = x + self.pe[:seq_len, :].unsqueeze(0)        x = self.dropout(x)                # Transformer layer        attn_mask = ~mask  # Attention mask (True = ignore)        attn_out, _ = self.attn(x, x, x, key_padding_mask=attn_mask)        x = self.norm1(x + self.dropout(attn_out))                ff_out = self.ff(x)        x = self.norm2(x + self.dropout(ff_out))                # Get last valid position for each sequence        seq_lens = mask.sum(dim=1) - 1  # Last valid index        indices_gather = seq_lens.unsqueeze(1).unsqueeze(2).expand(batch_size, 1, self.d_model)        last_hidden = torch.gather(x, 1, indices_gather).squeeze(1)  # (B, 80)                # Generate learned logits        learned_logits = self.predictor(last_hidden)  # (B, num_locations)                # === STEP 3: Ensemble history + learned ===        # Normalize learned logits to similar scale as history scores        learned_logits_normalized = F.softmax(learned_logits, dim=1) * self.num_locations                # Combine with learned weight        final_logits = history_scores + self.model_weight * learned_logits_normalized                return final_logits        def count_parameters(self):        """Count trainable parameters."""        return sum(p.numel() for p in self.parameters() if p.requires_grad)print("HistoryCentricModel class defined successfully")

## 6. Metrics Calculation FunctionsDefine functions to calculate evaluation metrics:### Top-k AccuracyPercentage of predictions where the true location is in the top-k predictions.### Mean Reciprocal Rank (MRR)Average of 1/rank where rank is the position of the correct location in predictions.### Normalized Discounted Cumulative Gain (NDCG@10)Ranking metric that gives higher weights to correct predictions at higher ranks.### F1 ScoreWeighted F1 score comparing top-1 predictions with ground truth.

In [None]:
def get_mrr(prediction, targets):    """    Calculate the Mean Reciprocal Rank score.        MRR measures how high the correct answer appears in ranked predictions.    MRR = 1/rank where rank is the position of the correct item.        Args:        prediction: (B, num_locations) - model output scores        targets: (B,) - ground truth location indices            Returns:        Sum of reciprocal ranks (to be averaged later)    """    # Sort predictions in descending order    index = torch.argsort(prediction, dim=-1, descending=True)        # Find where the target appears in sorted predictions    hits = (targets.unsqueeze(-1).expand_as(index) == index).nonzero()        # Get ranks (1-indexed)    ranks = (hits[:, -1] + 1).float()        # Reciprocal ranks    rranks = torch.reciprocal(ranks)        return torch.sum(rranks).cpu().numpy()def get_ndcg(prediction, targets, k=10):    """    Calculate the Normalized Discounted Cumulative Gain@k score.        NDCG measures ranking quality with logarithmic discount for lower positions.    Only considers top-k positions.        Args:        prediction: (B, num_locations) - model output scores        targets: (B,) - ground truth location indices        k: Consider only top-k positions (default: 10)            Returns:        Sum of NDCG scores (to be averaged later)    """    # Sort predictions in descending order    index = torch.argsort(prediction, dim=-1, descending=True)        # Find where the target appears    hits = (targets.unsqueeze(-1).expand_as(index) == index).nonzero()    ranks = (hits[:, -1] + 1).float().cpu().numpy()        # Calculate NDCG with logarithmic discount    not_considered_idx = ranks > k    ndcg = 1 / np.log2(ranks + 1)    ndcg[not_considered_idx] = 0  # Ignore ranks beyond k        return np.sum(ndcg)def calculate_correct_total_prediction(logits, true_y):    """    Calculate top-k accuracy metrics for predictions.        For each k in [1, 3, 5, 10], checks if the true location is in top-k predictions.    Also computes MRR and NDCG.        Args:        logits: (B, num_locations) - model output scores        true_y: (B,) - ground truth location indices            Returns:        result_array: [correct@1, correct@3, correct@5, correct@10, rr, ndcg, total]        true_y_cpu: Ground truth on CPU (for F1 calculation)        top1: Top-1 predictions (for F1 calculation)    """    top1 = []    result_ls = []        # Calculate top-k accuracy for k in [1, 3, 5, 10]    for k in [1, 3, 5, 10]:        # Handle case where vocab size < k        if logits.shape[-1] < k:            k = logits.shape[-1]                # Get top-k predictions        prediction = torch.topk(logits, k=k, dim=-1).indices                # Save top-1 for F1 calculation        if k == 1:            top1 = torch.squeeze(prediction).cpu()                # Count correct predictions (true label in top-k)        top_k_correct = torch.eq(true_y[:, None], prediction).any(dim=1).sum().cpu().numpy()        result_ls.append(top_k_correct)        # Add MRR    result_ls.append(get_mrr(logits, true_y))        # Add NDCG@10    result_ls.append(get_ndcg(logits, true_y))        # Add total count    result_ls.append(true_y.shape[0])        return np.array(result_ls, dtype=np.float32), true_y.cpu(), top1def get_performance_dict(return_dict):    """    Convert raw metric counts to percentages.        Takes accumulated counts and computes final accuracy percentages.        Args:        return_dict: Dictionary with accumulated counts            Returns:        Dictionary with accuracy percentages    """    perf = {        "correct@1": return_dict["correct@1"],        "correct@3": return_dict["correct@3"],        "correct@5": return_dict["correct@5"],        "correct@10": return_dict["correct@10"],        "rr": return_dict["rr"],        "ndcg": return_dict["ndcg"],        "f1": return_dict.get("f1", 0),        "total": return_dict["total"],    }        # Convert to percentages    perf["acc@1"] = perf["correct@1"] / perf["total"] * 100    perf["acc@5"] = perf["correct@5"] / perf["total"] * 100    perf["acc@10"] = perf["correct@10"] / perf["total"] * 100    perf["mrr"] = perf["rr"] / perf["total"] * 100    perf["ndcg"] = perf["ndcg"] / perf["total"] * 100        return perfprint("Metric calculation functions defined successfully")

## 7. Load Test DataLoad the test dataset and create a DataLoader for batch processing.

In [None]:
# Create datasettest_dataset = GeoLifeDataset(    data_path=f"{DATA_DIR}/{TEST_FILE}",    max_seq_len=MAX_SEQ_LEN)# Create dataloadertest_loader = DataLoader(    test_dataset,    batch_size=BATCH_SIZE,    shuffle=False,    num_workers=0,  # Set to 0 for notebook compatibility    collate_fn=collate_fn)print(f"\nTest set loaded:")print(f"  Total samples: {len(test_dataset)}")print(f"  Number of batches: {len(test_loader)}")print(f"  Batch size: {BATCH_SIZE}")# Inspect a sample batchprint("\nInspecting first batch...")sample_batch = next(iter(test_loader))print(f"  Batch keys: {list(sample_batch.keys())}")print(f"  loc_seq shape: {sample_batch['loc_seq'].shape}")print(f"  target shape: {sample_batch['target'].shape}")print(f"  mask shape: {sample_batch['mask'].shape}")

## 8. Load Trained ModelInitialize the model architecture and load pre-trained weights from checkpoint.

In [None]:
# Initialize modelmodel = HistoryCentricModel(    num_locations=NUM_LOCATIONS,    num_users=NUM_USERS)# Count parametersnum_params = model.count_parameters()print(f"Model initialized with {num_params:,} parameters")# Load checkpointprint(f"\nLoading checkpoint from {MODEL_PATH}...")checkpoint = torch.load(MODEL_PATH, map_location=device)# Load state dictmodel.load_state_dict(checkpoint['model_state_dict'])model.to(device)model.eval()  # Set to evaluation modeprint("✓ Model loaded successfully!")print(f"  Checkpoint epoch: {checkpoint.get('epoch', 'N/A')}")print(f"  Validation loss: {checkpoint.get('val_loss', 'N/A'):.4f}" if 'val_loss' in checkpoint else "")print(f"  Validation Acc@1: {checkpoint.get('val_acc', 'N/A'):.2f}%" if 'val_acc' in checkpoint else "")

## 9. Evaluation LoopRun the complete evaluation on the test set. For each batch:1. **Load batch data** - Get sequences, temporal features, targets, and masks2. **Move to device** - Transfer tensors to GPU/CPU3. **Forward pass** - Run model inference (no gradient computation)4. **Calculate metrics** - Compute top-k accuracies, MRR, NDCG for the batch5. **Accumulate results** - Add batch results to running totals6. **Collect predictions** - Store top-1 predictions and ground truth for F1 scoreThe evaluation uses `@torch.no_grad()` to disable gradient tracking for efficiency.

In [None]:
# Initialize metric accumulatorsmetrics = {    "correct@1": 0,    "correct@3": 0,    "correct@5": 0,    "correct@10": 0,    "rr": 0,    "ndcg": 0,    "f1": 0,    "total": 0}# Lists for F1 score calculationtrue_ls = []top1_ls = []print("Starting evaluation on test set...")print("=" * 80)# Evaluation loop with progress barwith torch.no_grad():    for batch_idx, batch in enumerate(tqdm(test_loader, desc="Evaluating", ncols=100)):        # Move batch to device        loc_seq = batch['loc_seq'].to(device)        user_seq = batch['user_seq'].to(device)        weekday_seq = batch['weekday_seq'].to(device)        start_min_seq = batch['start_min_seq'].to(device)        dur_seq = batch['dur_seq'].to(device)        diff_seq = batch['diff_seq'].to(device)        target = batch['target'].to(device)        mask = batch['mask'].to(device)                # Forward pass - model generates scores for all locations        logits = model(loc_seq, user_seq, weekday_seq, start_min_seq, dur_seq, diff_seq, mask)                # Calculate metrics for this batch        result, batch_true, batch_top1 = calculate_correct_total_prediction(logits, target)                # Accumulate metrics        metrics["correct@1"] += result[0]        metrics["correct@3"] += result[1]        metrics["correct@5"] += result[2]        metrics["correct@10"] += result[3]        metrics["rr"] += result[4]        metrics["ndcg"] += result[5]        metrics["total"] += result[6]                # Collect for F1 score        true_ls.extend(batch_true.tolist())        if not batch_top1.shape:  # Handle single-element case            top1_ls.extend([batch_top1.tolist()])        else:            top1_ls.extend(batch_top1.tolist())print("\n✓ Evaluation complete!")print(f"Processed {metrics['total']:.0f} samples in {len(test_loader)} batches")

## 10. Calculate Final MetricsCompute final evaluation metrics from accumulated results:1. **Top-k Accuracies** - Percentage where true location is in top k predictions2. **MRR** - Mean Reciprocal Rank across all predictions  3. **NDCG@10** - Normalized Discounted Cumulative Gain for top-104. **F1 Score** - Weighted F1 comparing top-1 predictions with ground truthThe metrics are converted from raw counts to percentages.

In [None]:
# Calculate F1 scoref1 = f1_score(true_ls, top1_ls, average="weighted")metrics["f1"] = f1# Convert to percentagesperf = get_performance_dict(metrics)print("=" * 80)print("FINAL TEST SET RESULTS")print("=" * 80)print(f"Total samples evaluated: {int(perf['total'])}")print()print("Accuracy Metrics:")print(f"  Acc@1  (Top-1):  {perf['acc@1']:6.2f}%")print(f"  Acc@3  (Top-3):  {perf['acc@5']:6.2f}%")  print(f"  Acc@5  (Top-5):  {perf['acc@5']:6.2f}%")print(f"  Acc@10 (Top-10): {perf['acc@10']:6.2f}%")print()print("Ranking Metrics:")print(f"  MRR (Mean Reciprocal Rank): {perf['mrr']:6.2f}%")print(f"  NDCG@10:                    {perf['ndcg']:6.2f}%")print()print("Classification Metric:")print(f"  F1 Score (weighted):        {100 * f1:6.2f}%")print("=" * 80)

## 11. Detailed Metrics BreakdownLet's examine the raw counts to understand the metrics better.

In [None]:
print("Raw Metric Counts:")print("=" * 80)print(f"Correct predictions in top-1:  {int(perf['correct@1'])} / {int(perf['total'])} = {perf['acc@1']:.2f}%")print(f"Correct predictions in top-3:  {int(perf['correct@3'])} / {int(perf['total'])} = {(perf['correct@3']/perf['total']*100):.2f}%")print(f"Correct predictions in top-5:  {int(perf['correct@5'])} / {int(perf['total'])} = {perf['acc@5']:.2f}%")print(f"Correct predictions in top-10: {int(perf['correct@10'])} / {int(perf['total'])} = {perf['acc@10']:.2f}%")print()print(f"Sum of reciprocal ranks: {perf['rr']:.2f}")print(f"Sum of NDCG scores:      {perf['ndcg']:.2f}")print("=" * 80)

## 12. Summary and Interpretation### What These Metrics Mean**Acc@1 (Top-1 Accuracy)**: The percentage of times the model's #1 prediction is correct. This is the most important metric for practical deployment.**Acc@5 (Top-5 Accuracy)**: The percentage of times the correct location appears in the top 5 predictions. Useful for recommendation systems where multiple options are presented.**Acc@10 (Top-10 Accuracy)**: Similar to Acc@5 but with more options. Shows the model's ability to narrow down candidates.**MRR (Mean Reciprocal Rank)**: Rewards predictions where the correct answer appears higher in the ranked list. A prediction at rank 1 contributes 1.0, rank 2 contributes 0.5, rank 3 contributes 0.33, etc.**NDCG@10 (Normalized Discounted Cumulative Gain)**: A ranking metric that gives logarithmically decreasing weights to lower ranks, considering only the top 10 positions.**F1 Score**: The harmonic mean of precision and recall for the top-1 predictions, weighted by class frequency.### Model StrategyThis model leverages the key insight that most next locations (83.81%) are already in the user's visit history. It:1. **Scores history locations** using recency (exponential decay) and frequency2. **Learns sequential patterns** using a compact transformer3. **Ensembles both** with learned weights favoring historyThis hybrid approach achieves strong performance while keeping the model compact (under 500K parameters).

## 13. ConclusionThis notebook demonstrated the complete evaluation pipeline for next-location prediction:1. ✅ **Data Loading**: Loaded and processed test trajectories with variable-length sequences2. ✅ **Model Architecture**: Defined the HistoryCentricModel combining history heuristics and learned patterns3. ✅ **Batch Inference**: Ran efficient batch-by-batch evaluation on the test set4. ✅ **Metrics Calculation**: Computed comprehensive metrics (Acc@k, MRR, NDCG, F1)5. ✅ **Results Analysis**: Interpreted the performance metricsThe evaluation process is identical to the one used in the training script (`trainer_v3.py`'s `validate()` method), ensuring consistency between training and testing.### Next Steps- Analyze prediction errors to understand failure modes- Visualize history score distributions vs learned scores- Test on different user segments or trajectory patterns- Experiment with different history scoring parameters