# XGBoost Training on 10 Users' Trajectories

This notebook:
- Loads 10 users' trajectories (same as HMM/GNN/Fusion/Markov Chain/KNN models)
- Removes consecutive duplicates (AAABCDCCABB → ABCDCAB) for each user
- Creates sequences of length 50 from all users
- Trains an XGBoost multi-class classifier using feature engineering
- Evaluates all 4 metrics: Accuracy, Precision & Recall, Top-K Accuracy, MPD


## Section 1 — Imports & Setup


In [17]:
import os
import pandas as pd
import numpy as np
import json
import pickle
from tqdm import tqdm
from haversine import haversine
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, recall_score
from collections import defaultdict, Counter
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)

# Paths
BASE_PATH = "/home/root495/Inexture/Location Prediction Update"
PROCESSED_PATH = BASE_PATH + "/data/processed/"
SEQUENCES_FILE = PROCESSED_PATH + "place_sequences.json"
GRID_METADATA_FILE = PROCESSED_PATH + "grid_metadata.json"
CLEANED_WITH_PLACES_FILE = PROCESSED_PATH + "cleaned_with_places.csv"
OUTPUT_PATH = BASE_PATH + "/notebooks/"
MODELS_PATH = BASE_PATH + "/models/"
RESULTS_PATH = BASE_PATH + "/results/"
MODEL_SAVE_PATH = MODELS_PATH + "xgboost_model.pkl"
RESULTS_SAVE_PATH = RESULTS_PATH + "xgboost_results.json"

os.makedirs(OUTPUT_PATH, exist_ok=True)
os.makedirs(MODELS_PATH, exist_ok=True)
os.makedirs(RESULTS_PATH, exist_ok=True)

print("Libraries imported successfully!")


Libraries imported successfully!


## Section 2 — Load 10 Users' Trajectories


In [18]:
# Load place sequences
print("Loading place sequences...")
with open(SEQUENCES_FILE, 'r') as f:
    sequences_dict = json.load(f)

print(f"Total users available: {len(sequences_dict)}")

# Select first 10 users (same as other models)
user_ids = list(sequences_dict.keys())
NUM_USERS = 10
selected_users = user_ids[:NUM_USERS]

print(f"\nSelected {NUM_USERS} users: {selected_users}")

# Load sequences for all selected users
user_sequences = {}
total_places = 0
for user_id in selected_users:
    seq = sequences_dict[user_id]
    user_sequences[user_id] = seq
    total_places += len(seq)
    print(f"  User {user_id}: {len(seq)} places")

print(f"\nTotal places across all {NUM_USERS} users: {total_places}")


Loading place sequences...
Total users available: 54

Selected 10 users: ['000', '001', '005', '006', '009', '011', '014', '016', '019', '025']
  User 000: 173817 places
  User 001: 108561 places
  User 005: 108967 places
  User 006: 31809 places
  User 009: 84573 places
  User 011: 90770 places
  User 014: 388051 places
  User 016: 89208 places
  User 019: 47792 places
  User 025: 628816 places

Total places across all 10 users: 1752364


## Section 3 — Preprocess: Remove Consecutive Duplicates

Remove consecutive duplicate locations for each user. Example: AAABCDCCABB → ABCDCAB

Only consecutive duplicates are removed. If a location appears again later (non-consecutive), it is kept.


In [19]:
def remove_consecutive_duplicates(sequence):
    """
    Remove consecutive duplicates from sequence.
    Example: [A, A, A, B, C, D, C, C, A, B, B] → [A, B, C, D, C, A, B]
    """
    if len(sequence) == 0:
        return sequence
    
    processed = [sequence[0]]  # Always keep first element
    
    for i in range(1, len(sequence)):
        # Only add if different from previous (not consecutive duplicate)
        if sequence[i] != sequence[i-1]:
            processed.append(sequence[i])
    
    return processed

# Apply consecutive duplicate removal to each user
processed_sequences = {}
total_original = 0
total_processed = 0

print("Processing users...")
for user_id in tqdm(selected_users, desc="Removing duplicates"):
    original_seq = user_sequences[user_id]
    processed_seq = remove_consecutive_duplicates(original_seq)
    processed_sequences[user_id] = processed_seq
    
    original_len = len(original_seq)
    processed_len = len(processed_seq)
    total_original += original_len
    total_processed += processed_len
    
    reduction = original_len - processed_len
    reduction_pct = (reduction / original_len * 100) if original_len > 0 else 0
    print(f"  User {user_id}: {original_len} → {processed_len} places ({reduction_pct:.1f}% reduction)")

print(f"\nSummary:")
print(f"  Total original places: {total_original}")
print(f"  Total after processing: {total_processed}")
print(f"  Total duplicates removed: {total_original - total_processed} ({((total_original - total_processed)/total_original*100):.1f}%)")


Processing users...


Removing duplicates:  50%|█████     | 5/10 [00:00<00:00, 42.84it/s]

  User 000: 173817 → 795 places (99.5% reduction)
  User 001: 108561 → 186 places (99.8% reduction)
  User 005: 108967 → 283 places (99.7% reduction)
  User 006: 31809 → 103 places (99.7% reduction)
  User 009: 84573 → 17 places (100.0% reduction)
  User 011: 90770 → 125 places (99.9% reduction)
  User 014: 388051 → 766 places (99.8% reduction)
  User 016: 89208 → 124 places (99.9% reduction)
  User 019: 47792 → 120 places (99.7% reduction)


Removing duplicates: 100%|██████████| 10/10 [00:00<00:00, 29.06it/s]

  User 025: 628816 → 1568 places (99.8% reduction)

Summary:
  Total original places: 1752364
  Total after processing: 4087
  Total duplicates removed: 1748277 (99.8%)





## Section 4 — Create Sequences of Length 50

Split each user's processed sequence into fixed-length chunks of 50 events each.
Combine sequences from all users for training.


In [20]:
# Create sequences of fixed length 50
SEQUENCE_LENGTH = 50

# Use sliding windows for more training data (overlap helps with learning)
# Create overlapping sequences with step size of 25 (50% overlap)
all_sequences = []
step_size = 25  # Overlap of 50%

print("Creating sequences from all users...")
for user_id in tqdm(selected_users, desc="Processing users"):
    processed_seq = processed_sequences[user_id]
    user_sequences_list = []
    
    for i in range(0, len(processed_seq) - SEQUENCE_LENGTH + 1, step_size):
        chunk = processed_seq[i:i+SEQUENCE_LENGTH]
        if len(chunk) == SEQUENCE_LENGTH:  # Only full-length sequences
            user_sequences_list.append(chunk)
    
    all_sequences.extend(user_sequences_list)
    print(f"  User {user_id}: {len(user_sequences_list)} sequences")

print(f"\nTotal sequences created: {len(all_sequences)}")
print(f"Total events in sequences: {sum(len(s) for s in all_sequences)}")

# Split into train/test (80/20)
split_idx = int(len(all_sequences) * 0.8)
train_sequences = all_sequences[:split_idx]
test_sequences = all_sequences[split_idx:]

print(f"\nTraining sequences: {len(train_sequences)}")
print(f"Test sequences: {len(test_sequences)}")

if len(test_sequences) == 0:
    # If no test sequences, use last training sequence for testing
    test_sequences = [train_sequences[-1]]
    train_sequences = train_sequences[:-1]
    print(f"Adjusted: Training={len(train_sequences)}, Test=1 (using last training sequence)")


Creating sequences from all users...


Processing users: 100%|██████████| 10/10 [00:00<00:00, 16690.43it/s]

  User 000: 30 sequences
  User 001: 6 sequences
  User 005: 10 sequences
  User 006: 3 sequences
  User 009: 0 sequences
  User 011: 4 sequences
  User 014: 29 sequences
  User 016: 3 sequences
  User 019: 3 sequences
  User 025: 61 sequences

Total sequences created: 149
Total events in sequences: 7450

Training sequences: 119
Test sequences: 30





## Section 5 — Encode Sequences

Encode place_ids to integers for XGBoost training.


In [21]:
# Encode sequences to integers
print("Encoding sequences...")
le = LabelEncoder()

# Flatten all sequences for encoding
all_places = [place for seq in train_sequences + test_sequences for place in seq]
le.fit(all_places)

n_states = len(le.classes_)
print(f"Unique places across all users: {n_states}")

# Encode training sequences
train_encoded = []
for seq in train_sequences:
    encoded = le.transform(seq).tolist()
    train_encoded.append(encoded)

# Encode test sequences
test_encoded = []
for seq in test_sequences:
    encoded = le.transform(seq).tolist()
    test_encoded.append(encoded)

print(f"Encoded {len(train_encoded)} training sequences")
print(f"Encoded {len(test_encoded)} test sequences")

# Create mapping from encoded ID to original place_id for coordinate lookup
encoded_to_placeid = {}
for place_id in le.classes_:
    encoded_id = le.transform([place_id])[0]
    encoded_to_placeid[int(encoded_id)] = place_id

print(f"Created mapping for {len(encoded_to_placeid)} place IDs")


Encoding sequences...
Unique places across all users: 303
Encoded 119 training sequences
Encoded 30 test sequences
Created mapping for 303 place IDs


## Section 6 — Feature Engineering

Extract features from history sequences to create feature vectors for XGBoost.

Features include:
- Last N locations (last_1, last_2, ..., last_5)
- History length
- Location frequency in history (top-K most frequent)
- Transition patterns (bigrams)


In [22]:
# Feature Engineering Configuration
MAX_HISTORY = 5  # Number of last locations to include as features
TOP_K_FREQ = 10  # Number of top frequent locations to track

print("Setting up feature engineering...")
print(f"Max history locations: {MAX_HISTORY}")
print(f"Top-K frequent locations: {TOP_K_FREQ}")


def extract_features(history, max_history=MAX_HISTORY, top_k_freq=TOP_K_FREQ):
    """
    Extract features from history sequence.
    
    Args:
        history: List of encoded states (history sequence)
        max_history: Number of last locations to include
        top_k_freq: Number of top frequent locations to track
    
    Returns:
        Dictionary of feature values
    """
    features = {}
    
    # Feature 1-5: Last N locations (padded with -1 if history too short)
    for i in range(max_history):
        idx = len(history) - (max_history - i)
        if idx >= 0:
            features[f'last_{i+1}'] = history[idx]
        else:
            features[f'last_{i+1}'] = -1  # Padding for short histories
    
    # Feature 6: History length
    features['history_length'] = len(history)
    
    # Feature 7-16: Frequency of top-K most frequent locations in history
    if len(history) > 0:
        location_counts = Counter(history)
        # Get top-K most frequent locations across all training data for consistency
        # For now, use top-K from current history
        top_freq = location_counts.most_common(top_k_freq)
        top_freq_dict = {loc: count for loc, count in top_freq}
        
        # Calculate most frequent locations in training data (for feature consistency)
        # This will be computed from training data later
        for i in range(top_k_freq):
            if i < len(top_freq):
                features[f'top_freq_{i+1}_loc'] = top_freq[i][0]
                features[f'top_freq_{i+1}_count'] = top_freq[i][1]
            else:
                features[f'top_freq_{i+1}_loc'] = -1
                features[f'top_freq_{i+1}_count'] = 0
    else:
        # Empty history
        for i in range(top_k_freq):
            features[f'top_freq_{i+1}_loc'] = -1
            features[f'top_freq_{i+1}_count'] = 0
    
    # Feature: Number of unique locations in history
    features['unique_locations'] = len(set(history)) if len(history) > 0 else 0
    
    # Feature: Most recent bigram (last 2 locations as a transition feature)
    if len(history) >= 2:
        features['bigram_loc1'] = history[-2]
        features['bigram_loc2'] = history[-1]
    else:
        features['bigram_loc1'] = -1
        features['bigram_loc2'] = history[-1] if len(history) == 1 else -1
    
    return features


# Calculate global top-K most frequent locations from training data
print("\nCalculating global statistics from training data...")
all_training_locations = [loc for seq in train_encoded for loc in seq]
global_location_counts = Counter(all_training_locations)
global_top_freq_locations = [loc for loc, _ in global_location_counts.most_common(TOP_K_FREQ)]

print(f"Global top-{TOP_K_FREQ} most frequent locations: {global_top_freq_locations[:5]}...")

print("\nFeature extraction function defined successfully!")
print(f"Total features: {MAX_HISTORY + 1 + (TOP_K_FREQ * 2) + 3} (last_N, history_length, top_freq_loc*K, top_freq_count*K, unique_locations, bigram_loc1, bigram_loc2)")


Setting up feature engineering...
Max history locations: 5
Top-K frequent locations: 10

Calculating global statistics from training data...
Global top-10 most frequent locations: [220, 219, 213, 221, 212]...

Feature extraction function defined successfully!
Total features: 29 (last_N, history_length, top_freq_loc*K, top_freq_count*K, unique_locations, bigram_loc1, bigram_loc2)


## Section 7 — Prepare Training Data

Extract features from all training sequences and prepare feature matrix and labels.


In [23]:
# Prepare training data
print("Extracting features from training sequences...")

X_train = []
y_train = []

for seq in tqdm(train_encoded, desc="Processing sequences"):
    for i in range(1, len(seq)):
        history = seq[:i]
        next_location = seq[i]
        
        # Extract features
        features = extract_features(history, max_history=MAX_HISTORY, top_k_freq=TOP_K_FREQ)
        
        # Convert to list in consistent order
        feature_vector = [
            features['last_1'], features['last_2'], features['last_3'], 
            features['last_4'], features['last_5'],
            features['history_length'],
            features['unique_locations'],
            features['bigram_loc1'], features['bigram_loc2']
        ]
        # Add top-K frequency features
        for j in range(TOP_K_FREQ):
            feature_vector.append(features[f'top_freq_{j+1}_loc'])
            feature_vector.append(features[f'top_freq_{j+1}_count'])
        
        X_train.append(feature_vector)
        y_train.append(next_location)

X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.int32)

# Re-encode labels to be consecutive integers starting from 0 (required by XGBoost)
# Create mapping from original encoded IDs to XGBoost class indices
print("\nRe-encoding labels to consecutive integers for XGBoost...")
unique_labels = np.unique(y_train)
label_mapping = {orig_label: new_idx for new_idx, orig_label in enumerate(unique_labels)}
reverse_label_mapping = {new_idx: orig_label for orig_label, new_idx in label_mapping.items()}

# Convert y_train to consecutive indices
y_train_consecutive = np.array([label_mapping[label] for label in y_train], dtype=np.int32)

print(f"Original label range: {y_train.min()} to {y_train.max()}")
print(f"Original unique labels: {len(unique_labels)}")
print(f"Consecutive label range: {y_train_consecutive.min()} to {y_train_consecutive.max()}")
print(f"Consecutive unique labels: {len(np.unique(y_train_consecutive))}")

# Use consecutive labels for training
y_train = y_train_consecutive

print(f"\nTraining data prepared:")
print(f"  Feature matrix shape: {X_train.shape}")
print(f"  Labels shape: {y_train.shape}")
print(f"  Number of features: {X_train.shape[1]}")
print(f"  Number of training samples: {len(X_train)}")
print(f"  Number of classes (consecutive): {len(label_mapping)}")


Extracting features from training sequences...


Processing sequences: 100%|██████████| 119/119 [00:00<00:00, 372.94it/s]



Re-encoding labels to consecutive integers for XGBoost...
Original label range: 0 to 302
Original unique labels: 286
Consecutive label range: 0 to 285
Consecutive unique labels: 286

Training data prepared:
  Feature matrix shape: (5831, 29)
  Labels shape: (5831,)
  Number of features: 29
  Number of training samples: 5831
  Number of classes (consecutive): 286


## Section 8 — Train XGBoost Model

Train XGBoost multi-class classifier on the feature vectors.


In [24]:
# Configure XGBoost parameters
print("Training XGBoost model...")

# Determine actual number of classes from training data
actual_num_classes = len(np.unique(y_train))
print(f"Number of unique classes in training data: {actual_num_classes}")
print(f"Total possible classes (n_states): {n_states}")

xgb_params = {
    'objective': 'multi:softprob',  # Multi-class classification with probabilities
    # Note: num_class will be automatically inferred from y_train by XGBoost
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 100,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': 42,
    'n_jobs': -1,  # Use all available cores
    'eval_metric': 'mlogloss'
}

print(f"\nXGBoost parameters:")
for key, value in xgb_params.items():
    print(f"  {key}: {value}")

# Train XGBoost model
xgb_model = xgb.XGBClassifier(**xgb_params)

print(f"\nTraining on {len(X_train)} samples...")
xgb_model.fit(
    X_train, 
    y_train,
    verbose=True
)

print("\nXGBoost model trained successfully!")
print(f"Model has {xgb_model.n_estimators} trees")
print(f"Number of classes learned: {len(xgb_model.classes_)}")


Training XGBoost model...
Number of unique classes in training data: 286
Total possible classes (n_states): 303

XGBoost parameters:
  objective: multi:softprob
  max_depth: 6
  learning_rate: 0.1
  n_estimators: 100
  subsample: 0.8
  colsample_bytree: 0.8
  random_state: 42
  n_jobs: -1
  eval_metric: mlogloss

Training on 5831 samples...

XGBoost model trained successfully!
Model has 100 trees
Number of classes learned: 286


## Section 9 — Prediction Functions

Implement prediction functions using XGBoost probability outputs.


In [25]:
def predict_next_location(history):
    """
    Predict next location using XGBoost model.
    
    Args:
        history: List of encoded states (history sequence)
    
    Returns:
        Predicted next state (encoded integer) or None
    """
    if len(history) == 0:
        return None
    
    # Extract features
    features = extract_features(history, max_history=MAX_HISTORY, top_k_freq=TOP_K_FREQ)
    
    # Convert to feature vector in same order as training
    feature_vector = [
        features['last_1'], features['last_2'], features['last_3'], 
        features['last_4'], features['last_5'],
        features['history_length'],
        features['unique_locations'],
        features['bigram_loc1'], features['bigram_loc2']
    ]
    for j in range(TOP_K_FREQ):
        feature_vector.append(features[f'top_freq_{j+1}_loc'])
        feature_vector.append(features[f'top_freq_{j+1}_count'])
    
    feature_array = np.array([feature_vector], dtype=np.float32)
    
    # Predict class (returns XGBoost class index)
    xgb_class_idx = xgb_model.predict(feature_array)[0]
    
    # Map back to original encoded ID
    if xgb_class_idx in reverse_label_mapping:
        return int(reverse_label_mapping[xgb_class_idx])
    else:
        # Fallback: return most frequent label if mapping fails
        return None


def predict_top_k(history, k=5):
    """
    Get top-K most likely next locations using XGBoost model.
    
    Args:
        history: List of encoded states (history sequence)
        k: Number of top predictions to return
    
    Returns:
        List of top-K encoded states sorted by probability
    """
    if len(history) == 0:
        return []
    
    # Extract features
    features = extract_features(history, max_history=MAX_HISTORY, top_k_freq=TOP_K_FREQ)
    
    # Convert to feature vector in same order as training
    feature_vector = [
        features['last_1'], features['last_2'], features['last_3'], 
        features['last_4'], features['last_5'],
        features['history_length'],
        features['unique_locations'],
        features['bigram_loc1'], features['bigram_loc2']
    ]
    for j in range(TOP_K_FREQ):
        feature_vector.append(features[f'top_freq_{j+1}_loc'])
        feature_vector.append(features[f'top_freq_{j+1}_count'])
    
    feature_array = np.array([feature_vector], dtype=np.float32)
    
    # Get probability distribution (returns probabilities for XGBoost class indices)
    probabilities = xgb_model.predict_proba(feature_array)[0]
    
    # Get top-K class indices sorted by probability
    top_k_xgb_indices = np.argsort(probabilities)[-k:][::-1]
    
    # Map back to original encoded IDs
    top_k_locations = []
    for xgb_idx in top_k_xgb_indices:
        if xgb_idx in reverse_label_mapping:
            top_k_locations.append(int(reverse_label_mapping[xgb_idx]))
    
    return top_k_locations


print("Prediction functions defined successfully!")
print("\nFunction summary:")
print("  - predict_next_location(history): Returns single most likely next state")
print("  - predict_top_k(history, k=5): Returns top-K most likely next states")


Prediction functions defined successfully!

Function summary:
  - predict_next_location(history): Returns single most likely next state
  - predict_top_k(history, k=5): Returns top-K most likely next states


## Section 10 — Save Model

Save the trained XGBoost model, encoder, mappings, and feature extraction parameters.


In [26]:
# Save model
print("Saving model...")
model_data = {
    'xgb_model': xgb_model,
    'label_encoder': le,
    'encoded_to_placeid': encoded_to_placeid,
    'label_mapping': label_mapping,  # Original encoded ID -> XGBoost class index
    'reverse_label_mapping': reverse_label_mapping,  # XGBoost class index -> Original encoded ID
    'n_states': n_states,
    'max_history': MAX_HISTORY,
    'top_k_freq': TOP_K_FREQ,
    'feature_order': ['last_1', 'last_2', 'last_3', 'last_4', 'last_5', 
                      'history_length', 'unique_locations', 'bigram_loc1', 'bigram_loc2'] + 
                     [f'top_freq_{j+1}_loc' for j in range(TOP_K_FREQ)] +
                     [f'top_freq_{j+1}_count' for j in range(TOP_K_FREQ)]
}

with open(MODEL_SAVE_PATH, 'wb') as f:
    pickle.dump(model_data, f)

print(f"Model saved to {MODEL_SAVE_PATH}")
print(f"Model contains:")
print(f"  - XGBoost classifier")
print(f"  - LabelEncoder and mappings")
print(f"  - Feature extraction parameters")
print(f"  - Number of features: {X_train.shape[1]}")


Saving model...
Model saved to /home/root495/Inexture/Location Prediction Update/models/xgboost_model.pkl
Model contains:
  - XGBoost classifier
  - LabelEncoder and mappings
  - Feature extraction parameters
  - Number of features: 29


## Section 11 — Evaluation Setup

Prepare test sequences and helper functions for evaluation metrics.


In [27]:
# Use first test sequence for evaluation (same as other notebooks for fair comparison)
test_sequence = test_encoded[0]
print(f"Test sequence length: {len(test_sequence)} events")

# Create test cases: history -> next location
test_cases = []
for i in range(1, len(test_sequence)):
    history = test_sequence[:i]
    true_next = test_sequence[i]
    test_cases.append((history, true_next))

print(f"Created {len(test_cases)} test cases")

# Load grid metadata and coordinates for MPD calculation
with open(GRID_METADATA_FILE, 'r') as f:
    grid_metadata = json.load(f)

df_places = pd.read_csv(CLEANED_WITH_PLACES_FILE)
place_coords = df_places.groupby('place_id')[['lat', 'lon']].first().to_dict('index')

print(f"Loaded coordinates for {len(place_coords)} places")

# Helper function to get coordinates from place_id
def place_id_to_coords(place_id, place_coords, grid_metadata):
    """Get coordinates from place_id"""
    if place_id is None:
        return None, None
    
    # Try to find in place_coords first
    if place_id in place_coords:
        return place_coords[place_id]['lat'], place_coords[place_id]['lon']
    
    # Fallback: calculate from grid if place_id has format "row_col"
    try:
        if "_" in str(place_id):
            row, col = map(int, str(place_id).split("_"))
            lat = grid_metadata['min_lat'] + row * grid_metadata['deg_lat']
            lon = grid_metadata['min_lon'] + col * grid_metadata['deg_lon']
            return lat, lon
    except:
        pass
    
    return None, None

print("Evaluation setup complete!")


Test sequence length: 50 events
Created 49 test cases
Loaded coordinates for 2073 places
Evaluation setup complete!


## Section 12 — Metric 1: Accuracy

Calculate accuracy: fraction of predictions that exactly match the true next location.


In [28]:
# Calculate Accuracy
print("Calculating Accuracy...")
predictions = []
true_labels = []

for history, true_next in tqdm(test_cases, desc="Making predictions"):
    pred = predict_next_location(history)
    if pred is not None:
        predictions.append(pred)
        true_labels.append(true_next)

# Calculate accuracy
if len(predictions) == 0:
    print("ERROR: No predictions were made!")
    accuracy = 0
    correct = 0
    total = 0
else:
    correct = sum(1 for p, t in zip(predictions, true_labels) if p == t)
    total = len(predictions)
    accuracy = correct / total if total > 0 else 0
    
    # Debug: Show first few predictions vs true
    print(f"\nDebug - First 5 predictions:")
    for i in range(min(5, len(predictions))):
        pred_place = encoded_to_placeid.get(predictions[i], "Unknown")
        true_place = encoded_to_placeid.get(true_labels[i], "Unknown")
        match = "✓" if predictions[i] == true_labels[i] else "✗"
        print(f"  {match} Pred: {predictions[i]} ({pred_place[:20]}) | True: {true_labels[i]} ({true_place[:20]})")

print(f"\n{'='*60}")
print(f"METRIC 1: ACCURACY")
print(f"{'='*60}")
print(f"Correct predictions: {correct}")
print(f"Total predictions: {total}")
print(f"Accuracy: {accuracy:.12f}")
print(f"{'='*60}")


Calculating Accuracy...


Making predictions: 100%|██████████| 49/49 [00:00<00:00, 146.09it/s]


Debug - First 5 predictions:
  ✗ Pred: 219 (296_2075) | True: 213 (295_2076)
  ✗ Pred: 220 (296_2076) | True: 214 (295_2077)
  ✓ Pred: 213 (295_2076) | True: 213 (295_2076)
  ✓ Pred: 220 (296_2076) | True: 220 (296_2076)
  ✓ Pred: 219 (296_2075) | True: 219 (296_2075)

METRIC 1: ACCURACY
Correct predictions: 17
Total predictions: 49
Accuracy: 0.346938775510





## Section 13 — Metric 2: Precision & Recall

**Definition**: 
- **Precision**: How many predicted locations were actually correct, weighted by class frequency
- **Recall**: Out of all true next locations, how many you successfully predicted, weighted by class frequency

Measuring how trustworthy the model is with visited and predicted locations using weighted averages.


In [29]:
# Calculate Precision & Recall (Weighted)
print("Calculating Precision & Recall (Weighted)...")

if len(predictions) > 0:
    precision_weighted = precision_score(true_labels, predictions, average='weighted', zero_division=0)
    recall_weighted = recall_score(true_labels, predictions, average='weighted', zero_division=0)
else:
    precision_weighted = recall_weighted = 0

print(f"\n{'='*60}")
print(f"METRIC 2: PRECISION & RECALL")
print(f"{'='*60}")
print(f"Precision: {precision_weighted:.12f}")
print(f"Recall: {recall_weighted:.12f}")
print(f"{'='*60}")


Calculating Precision & Recall (Weighted)...

METRIC 2: PRECISION & RECALL
Precision: 0.251753008896
Recall: 0.346938775510


## Section 14 — Metric 3: Top-K Accuracy

**Definition**: The true next location is considered correct if it appears in the top K predicted locations.

Top-K Accuracy: If the true next position is included in the top-K predictions (K=1, 3, 5).


In [30]:
# Calculate Top-K Accuracy
print("Calculating Top-K Accuracy...")

k_values = [1, 3, 5]
top_k_results = {}

for k in k_values:
    correct_k = 0
    total_k = 0
    
    for history, true_next in tqdm(test_cases, desc=f"Top-{k}"):
        top_k_preds = predict_top_k(history, k=k)
        if top_k_preds:
            total_k += 1
            if true_next in top_k_preds:
                correct_k += 1
    
    top_k_accuracy = correct_k / total_k if total_k > 0 else 0
    
    top_k_results[k] = {
        'correct': correct_k,
        'total': total_k,
        'accuracy': top_k_accuracy
    }

print(f"\n{'='*60}")
print(f"METRIC 3: TOP-K ACCURACY")
print(f"{'='*60}")
for k in k_values:
    result = top_k_results[k]
    print(f"Top-{k} Accuracy: {result['accuracy']:.12f}")
print(f"{'='*60}")


Calculating Top-K Accuracy...


Top-1: 100%|██████████| 49/49 [00:00<00:00, 84.32it/s]
Top-3: 100%|██████████| 49/49 [00:00<00:00, 142.29it/s]
Top-5: 100%|██████████| 49/49 [00:00<00:00, 162.54it/s]


METRIC 3: TOP-K ACCURACY
Top-1 Accuracy: 0.346938775510
Top-3 Accuracy: 0.448979591837
Top-5 Accuracy: 0.551020408163





## Section 15 — Metric 4: Mean Prediction Distance (MPD)

**Definition**: Average Haversine distance (in meters) between actual next location and predicted next location.

MPD Distance: Mean Prediction Distance — Mean actual distance visited from predicted location of next visit.


In [31]:
# Calculate Mean Prediction Distance (MPD)
print("Calculating Mean Prediction Distance (MPD)...")

distances = []
failed_conversions = 0

for history, true_next in tqdm(test_cases, desc="Calculating distances"):
    pred = predict_next_location(history)
    
    if pred is not None:
        # Convert encoded IDs back to place_ids
        pred_place_id = encoded_to_placeid.get(pred)
        true_place_id = encoded_to_placeid.get(true_next)
        
        if pred_place_id and true_place_id:
            # Get coordinates
            pred_lat, pred_lon = place_id_to_coords(pred_place_id, place_coords, grid_metadata)
            true_lat, true_lon = place_id_to_coords(true_place_id, place_coords, grid_metadata)
            
            if pred_lat is not None and true_lat is not None:
                # Calculate haversine distance
                try:
                    distance_m = haversine((pred_lat, pred_lon), (true_lat, true_lon)) * 1000
                    # Filter out unrealistic distances (likely coordinate errors)
                    if distance_m < 1000000:  # Less than 1000 km
                        distances.append(distance_m)
                    else:
                        failed_conversions += 1
                except:
                    failed_conversions += 1
            else:
                failed_conversions += 1
        else:
            failed_conversions += 1
    else:
        failed_conversions += 1

if failed_conversions > 0:
    print(f"Warning: {failed_conversions} distance calculations failed or were filtered")

mpd = np.mean(distances) if len(distances) > 0 else 0
mpd_median = np.median(distances) if len(distances) > 0 else 0
mpd_std = np.std(distances) if len(distances) > 0 else 0

print(f"\n{'='*60}")
print(f"METRIC 4: MEAN PREDICTION DISTANCE (MPD)")
print(f"{'='*60}")
print(f"MPD Distance: {mpd:.12f} meters")
print(f"Valid distance calculations: {len(distances)}/{len(test_cases)}")
print(f"{'='*60}")


Calculating Mean Prediction Distance (MPD)...


Calculating distances: 100%|██████████| 49/49 [00:01<00:00, 47.62it/s]


METRIC 4: MEAN PREDICTION DISTANCE (MPD)
MPD Distance: 15229.258501444867 meters
Valid distance calculations: 49/49





In [32]:
# Compile all results
results = {
    'num_users': NUM_USERS,
    'selected_users': selected_users,
    'preprocessing': {
        'total_original_places': total_original,
        'total_after_duplicate_removal': total_processed,
        'total_duplicates_removed': total_original - total_processed,
        'sequence_length': SEQUENCE_LENGTH,
        'total_sequences': len(all_sequences),
        'training_sequences': len(train_sequences),
        'test_sequences': len(test_sequences)
    },
    'model': {
        'unique_states': n_states,
        'model_type': 'xgboost',
        'num_features': X_train.shape[1],
        'max_history': MAX_HISTORY,
        'top_k_freq': TOP_K_FREQ,
        'n_estimators': xgb_model.n_estimators,
        'max_depth': xgb_model.max_depth
    },
    'accuracy': {
        'value': accuracy,
        'correct': correct,
        'total': total
    },
    'precision_recall': {
        'precision': float(precision_weighted),
        'recall': float(recall_weighted)
    },
    'top_k_accuracy': {
        f'top_{k}_accuracy': float(top_k_results[k]['accuracy']) for k in k_values
    },
    'mpd_distance': {
        'mpd_distance_meters': float(mpd),
        'valid_calculations': len(distances)
    }
}

# Display summary
print(f"\n{'='*60}")
print(f"EVALUATION RESULTS SUMMARY")
print(f"{'='*60}")
print(f"\nNumber of users: {NUM_USERS}")
print(f"Users: {selected_users}")
print(f"Total original places: {total_original}")
print(f"After duplicate removal: {total_processed}")
print(f"Training sequences: {len(train_sequences)}")
print(f"Test sequences: {len(test_sequences)}")

print(f"\n1. ACCURACY")
print(f"   Accuracy: {accuracy:.12f}")

print(f"\n2. PRECISION & RECALL")
print(f"   Precision: {precision_weighted:.12f}")
print(f"   Recall: {recall_weighted:.12f}")

print(f"\n3. TOP-K ACCURACY")
for k in k_values:
    acc = top_k_results[k]['accuracy']
    print(f"   Top-{k} Accuracy: {acc:.12f}")

print(f"\n4. MEAN PREDICTION DISTANCE (MPD)")
print(f"   MPD Distance: {mpd:.12f} meters")

print(f"\n{'='*60}")

# Save results
with open(RESULTS_SAVE_PATH, 'w') as f:
    json.dump(results, f, indent=2)

print(f"\nResults saved to {RESULTS_SAVE_PATH}")

# Create results DataFrame
results_df = pd.DataFrame({
    'Metric': [
        'Accuracy',
        'Precision',
        'Recall',
        'Top-1 Accuracy',
        'Top-3 Accuracy',
        'Top-5 Accuracy',
        'MPD Distance'
    ],
    'Value': [
        f"{accuracy:.12f}",
        f"{precision_weighted:.12f}",
        f"{recall_weighted:.12f}",
        f"{top_k_results[1]['accuracy']:.12f}",
        f"{top_k_results[3]['accuracy']:.12f}",
        f"{top_k_results[5]['accuracy']:.12f}",
        f"{mpd:.12f}"
    ]
})

print("\nResults Table:")
print(results_df.to_string(index=False))



EVALUATION RESULTS SUMMARY

Number of users: 10
Users: ['000', '001', '005', '006', '009', '011', '014', '016', '019', '025']
Total original places: 1752364
After duplicate removal: 4087
Training sequences: 119
Test sequences: 30

1. ACCURACY
   Accuracy: 0.346938775510

2. PRECISION & RECALL
   Precision: 0.251753008896
   Recall: 0.346938775510

3. TOP-K ACCURACY
   Top-1 Accuracy: 0.346938775510
   Top-3 Accuracy: 0.448979591837
   Top-5 Accuracy: 0.551020408163

4. MEAN PREDICTION DISTANCE (MPD)
   MPD Distance: 15229.258501444867 meters


Results saved to /home/root495/Inexture/Location Prediction Update/results/xgboost_results.json

Results Table:
        Metric              Value
      Accuracy     0.346938775510
     Precision     0.251753008896
        Recall     0.346938775510
Top-1 Accuracy     0.346938775510
Top-3 Accuracy     0.448979591837
Top-5 Accuracy     0.551020408163
  MPD Distance 15229.258501444867


In [33]:
# Update models_comparison.csv
comparison_file = RESULTS_PATH + "models_comparison.csv"

# Read existing comparison file
try:
    comparison_df = pd.read_csv(comparison_file)
    
    # Check if XGBoost row already exists
    if 'XGBoost' in comparison_df['Model'].values:
        # Update existing row
        mask = comparison_df['Model'] == 'XGBoost'
        comparison_df.loc[mask, 'Accuracy'] = f"{accuracy:.12f}"
        comparison_df.loc[mask, 'Precision'] = f"{precision_weighted:.12f}"
        comparison_df.loc[mask, 'Recall'] = f"{recall_weighted:.12f}"
        comparison_df.loc[mask, 'Top-1 Accuracy'] = f"{top_k_results[1]['accuracy']:.12f}"
        comparison_df.loc[mask, 'Top-3 Accuracy'] = f"{top_k_results[3]['accuracy']:.12f}"
        comparison_df.loc[mask, 'Top-5 Accuracy'] = f"{top_k_results[5]['accuracy']:.12f}"
        comparison_df.loc[mask, 'MPD Distance (meters)'] = f"{mpd:.12f}"
        print("Updated existing XGBoost row in models_comparison.csv")
    else:
        # Add new row
        new_row = pd.DataFrame({
            'Model': ['XGBoost'],
            'Accuracy': [f"{accuracy:.12f}"],
            'Precision': [f"{precision_weighted:.12f}"],
            'Recall': [f"{recall_weighted:.12f}"],
            'Top-1 Accuracy': [f"{top_k_results[1]['accuracy']:.12f}"],
            'Top-3 Accuracy': [f"{top_k_results[3]['accuracy']:.12f}"],
            'Top-5 Accuracy': [f"{top_k_results[5]['accuracy']:.12f}"],
            'MPD Distance (meters)': [f"{mpd:.12f}"]
        })
        comparison_df = pd.concat([comparison_df, new_row], ignore_index=True)
        print("Added new XGBoost row to models_comparison.csv")
    
    # Save updated comparison file
    comparison_df.to_csv(comparison_file, index=False)
    print(f"Updated {comparison_file}")
    
    # Display updated comparison
    print("\nUpdated Models Comparison:")
    print(comparison_df.to_string(index=False))
    
except FileNotFoundError:
    # Create new comparison file if it doesn't exist
    comparison_df = pd.DataFrame({
        'Model': ['XGBoost'],
        'Accuracy': [f"{accuracy:.12f}"],
        'Precision': [f"{precision_weighted:.12f}"],
        'Recall': [f"{recall_weighted:.12f}"],
        'Top-1 Accuracy': [f"{top_k_results[1]['accuracy']:.12f}"],
        'Top-3 Accuracy': [f"{top_k_results[3]['accuracy']:.12f}"],
        'Top-5 Accuracy': [f"{top_k_results[5]['accuracy']:.12f}"],
        'MPD Distance (meters)': [f"{mpd:.12f}"]
    })
    comparison_df.to_csv(comparison_file, index=False)
    print(f"Created new {comparison_file}")
except Exception as e:
    print(f"Warning: Could not update models_comparison.csv: {e}")
    print("Results have been saved to JSON file. Please update CSV manually if needed.")


Added new XGBoost row to models_comparison.csv
Updated /home/root495/Inexture/Location Prediction Update/results/models_comparison.csv

Updated Models Comparison:
         Model       Accuracy      Precision         Recall Top-1 Accuracy Top-3 Accuracy Top-5 Accuracy MPD Distance (meters)
           HMM       0.653061       0.605081       0.653061       0.653061       0.897959       0.918367           4364.404451
           GNN       0.504762       0.438886       0.504762       0.504762       0.691837       0.787075           3216.861429
        Fusion       0.498639        0.44425       0.498639       0.498639       0.768027       0.819728           5196.347567
  Markov Chain       0.693878       0.730539       0.693878       0.693878       0.918367       0.918367            3691.02685
KNN Trajectory       0.142857       0.020833       0.142857       0.142857       0.367347        0.44898          17011.943967
       XGBoost 0.346938775510 0.251753008896 0.346938775510 0.346938775510 