# KNN Trajectory Training on 10 Users' Trajectories

This notebook:
- Loads 10 users' trajectories (same as HMM/GNN/Fusion/Markov Chain models)
- Removes consecutive duplicates (AAABCDCCABB → ABCDCAB) for each user
- Creates sequences of length 50 from all users
- Trains a K-Nearest Neighbors (KNN) trajectory model using sequence similarity
- Evaluates all 4 metrics: Accuracy, Precision & Recall, Top-K Accuracy, MPD


## Section 1 — Imports & Setup


In [1]:
import os
import pandas as pd
import numpy as np
import json
import pickle
from tqdm import tqdm
from haversine import haversine
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, recall_score
from collections import defaultdict, Counter
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)

# Paths
BASE_PATH = "/home/root495/Inexture/Location Prediction Update"
PROCESSED_PATH = BASE_PATH + "/data/processed/"
SEQUENCES_FILE = PROCESSED_PATH + "place_sequences.json"
GRID_METADATA_FILE = PROCESSED_PATH + "grid_metadata.json"
CLEANED_WITH_PLACES_FILE = PROCESSED_PATH + "cleaned_with_places.csv"
OUTPUT_PATH = BASE_PATH + "/notebooks/"
MODELS_PATH = BASE_PATH + "/models/"
RESULTS_PATH = BASE_PATH + "/results/"
MODEL_SAVE_PATH = MODELS_PATH + "knn_trajectory_model.pkl"
RESULTS_SAVE_PATH = RESULTS_PATH + "knn_trajectory_results.json"

os.makedirs(OUTPUT_PATH, exist_ok=True)
os.makedirs(MODELS_PATH, exist_ok=True)
os.makedirs(RESULTS_PATH, exist_ok=True)

print("Libraries imported successfully!")


Libraries imported successfully!


## Section 2 — Load 10 Users' Trajectories


In [2]:
# Load place sequences
print("Loading place sequences...")
with open(SEQUENCES_FILE, 'r') as f:
    sequences_dict = json.load(f)

print(f"Total users available: {len(sequences_dict)}")

# Select first 10 users (same as HMM/GNN/Fusion/Markov Chain models)
user_ids = list(sequences_dict.keys())
NUM_USERS = 10
selected_users = user_ids[:NUM_USERS]

print(f"\nSelected {NUM_USERS} users: {selected_users}")

# Load sequences for all selected users
user_sequences = {}
total_places = 0
for user_id in selected_users:
    seq = sequences_dict[user_id]
    user_sequences[user_id] = seq
    total_places += len(seq)
    print(f"  User {user_id}: {len(seq)} places")

print(f"\nTotal places across all {NUM_USERS} users: {total_places}")


Loading place sequences...
Total users available: 54

Selected 10 users: ['000', '001', '005', '006', '009', '011', '014', '016', '019', '025']
  User 000: 173817 places
  User 001: 108561 places
  User 005: 108967 places
  User 006: 31809 places
  User 009: 84573 places
  User 011: 90770 places
  User 014: 388051 places
  User 016: 89208 places
  User 019: 47792 places
  User 025: 628816 places

Total places across all 10 users: 1752364


## Section 3 — Preprocess: Remove Consecutive Duplicates

Remove consecutive duplicate locations for each user. Example: AAABCDCCABB → ABCDCAB

Only consecutive duplicates are removed. If a location appears again later (non-consecutive), it is kept.


In [3]:
def remove_consecutive_duplicates(sequence):
    """
    Remove consecutive duplicates from sequence.
    Example: [A, A, A, B, C, D, C, C, A, B, B] → [A, B, C, D, C, A, B]
    """
    if len(sequence) == 0:
        return sequence
    
    processed = [sequence[0]]  # Always keep first element
    
    for i in range(1, len(sequence)):
        # Only add if different from previous (not consecutive duplicate)
        if sequence[i] != sequence[i-1]:
            processed.append(sequence[i])
    
    return processed

# Apply consecutive duplicate removal to each user
processed_sequences = {}
total_original = 0
total_processed = 0

print("Processing users...")
for user_id in tqdm(selected_users, desc="Removing duplicates"):
    original_seq = user_sequences[user_id]
    processed_seq = remove_consecutive_duplicates(original_seq)
    processed_sequences[user_id] = processed_seq
    
    original_len = len(original_seq)
    processed_len = len(processed_seq)
    total_original += original_len
    total_processed += processed_len
    
    reduction = original_len - processed_len
    reduction_pct = (reduction / original_len * 100) if original_len > 0 else 0
    print(f"  User {user_id}: {original_len} → {processed_len} places ({reduction_pct:.1f}% reduction)")

print(f"\nSummary:")
print(f"  Total original places: {total_original}")
print(f"  Total after processing: {total_processed}")
print(f"  Total duplicates removed: {total_original - total_processed} ({((total_original - total_processed)/total_original*100):.1f}%)")


Processing users...


Removing duplicates:   0%|          | 0/10 [00:00<?, ?it/s]

  User 000: 173817 → 795 places (99.5% reduction)
  User 001: 108561 → 186 places (99.8% reduction)
  User 005: 108967 → 283 places (99.7% reduction)
  User 006: 31809 → 103 places (99.7% reduction)
  User 009: 84573 → 17 places (100.0% reduction)


Removing duplicates:  70%|███████   | 7/10 [00:00<00:00, 51.33it/s]

  User 011: 90770 → 125 places (99.9% reduction)
  User 014: 388051 → 766 places (99.8% reduction)
  User 016: 89208 → 124 places (99.9% reduction)
  User 019: 47792 → 120 places (99.7% reduction)
  User 025: 628816 → 1568 places (99.8% reduction)


Removing duplicates: 100%|██████████| 10/10 [00:00<00:00, 41.91it/s]


Summary:
  Total original places: 1752364
  Total after processing: 4087
  Total duplicates removed: 1748277 (99.8%)





## Section 4 — Create Sequences of Length 50

Split each user's processed sequence into fixed-length chunks of 50 events each.
Combine sequences from all users for training.


In [4]:
# Create sequences of fixed length 50
SEQUENCE_LENGTH = 50

# Use sliding windows for more training data (overlap helps with learning)
# Create overlapping sequences with step size of 25 (50% overlap)
all_sequences = []
step_size = 25  # Overlap of 50%

print("Creating sequences from all users...")
for user_id in tqdm(selected_users, desc="Processing users"):
    processed_seq = processed_sequences[user_id]
    user_sequences_list = []
    
    for i in range(0, len(processed_seq) - SEQUENCE_LENGTH + 1, step_size):
        chunk = processed_seq[i:i+SEQUENCE_LENGTH]
        if len(chunk) == SEQUENCE_LENGTH:  # Only full-length sequences
            user_sequences_list.append(chunk)
    
    all_sequences.extend(user_sequences_list)
    print(f"  User {user_id}: {len(user_sequences_list)} sequences")

print(f"\nTotal sequences created: {len(all_sequences)}")
print(f"Total events in sequences: {sum(len(s) for s in all_sequences)}")

# Split into train/test (80/20)
split_idx = int(len(all_sequences) * 0.8)
train_sequences = all_sequences[:split_idx]
test_sequences = all_sequences[split_idx:]

print(f"\nTraining sequences: {len(train_sequences)}")
print(f"Test sequences: {len(test_sequences)}")

if len(test_sequences) == 0:
    # If no test sequences, use last training sequence for testing
    test_sequences = [train_sequences[-1]]
    train_sequences = train_sequences[:-1]
    print(f"Adjusted: Training={len(train_sequences)}, Test=1 (using last training sequence)")


Creating sequences from all users...


Processing users: 100%|██████████| 10/10 [00:00<00:00, 6956.88it/s]

  User 000: 30 sequences
  User 001: 6 sequences
  User 005: 10 sequences
  User 006: 3 sequences
  User 009: 0 sequences
  User 011: 4 sequences
  User 014: 29 sequences
  User 016: 3 sequences
  User 019: 3 sequences
  User 025: 61 sequences

Total sequences created: 149
Total events in sequences: 7450

Training sequences: 119
Test sequences: 30





## Section 5 — Encode Sequences

Encode place_ids to integers for KNN trajectory training.


In [5]:
# Encode sequences to integers
print("Encoding sequences...")
le = LabelEncoder()

# Flatten all sequences for encoding
all_places = [place for seq in train_sequences + test_sequences for place in seq]
le.fit(all_places)

n_states = len(le.classes_)
print(f"Unique places across all users: {n_states}")

# Encode training sequences
train_encoded = []
for seq in train_sequences:
    encoded = le.transform(seq).tolist()
    train_encoded.append(encoded)

# Encode test sequences
test_encoded = []
for seq in test_sequences:
    encoded = le.transform(seq).tolist()
    test_encoded.append(encoded)

print(f"Encoded {len(train_encoded)} training sequences")
print(f"Encoded {len(test_encoded)} test sequences")

# Create mapping from encoded ID to original place_id for coordinate lookup
encoded_to_placeid = {}
for place_id in le.classes_:
    encoded_id = le.transform([place_id])[0]
    encoded_to_placeid[int(encoded_id)] = place_id

print(f"Created mapping for {len(encoded_to_placeid)} place IDs")


Encoding sequences...
Unique places across all users: 303
Encoded 119 training sequences
Encoded 30 test sequences
Created mapping for 303 place IDs


## Section 6 — Build KNN Trajectory Model

Store all training sequences as reference trajectories and implement similarity functions.

The KNN model finds k nearest neighbor sequences based on:
- Longest Common Prefix (LCP): How many locations match from the start
- Sequence similarity: Weighted by position (earlier matches more important)


In [6]:
# KNN Configuration
K_NEIGHBORS = 5  # Number of nearest neighbors to consider

print("Building KNN Trajectory Model...")
print(f"Storing {len(train_encoded)} training sequences as reference trajectories")

# Store all training sequences
reference_sequences = train_encoded.copy()
print(f"Reference trajectories stored: {len(reference_sequences)}")

# Calculate state frequencies for fallback
print("Calculating state frequencies for fallback...")
state_counts = Counter()
for seq in train_encoded:
    for state in seq:
        state_counts[state] += 1

most_frequent_state = state_counts.most_common(1)[0][0] if state_counts else None
print(f"Most frequent state: {most_frequent_state} (count: {state_counts[most_frequent_state]})")


def longest_common_prefix(seq1, seq2):
    """
    Find the length of the longest common prefix between two sequences.
    """
    min_len = min(len(seq1), len(seq2))
    lcp = 0
    for i in range(min_len):
        if seq1[i] == seq2[i]:
            lcp += 1
        else:
            break
    return lcp


def sequence_similarity(query, reference):
    """
    Calculate similarity score between query history and reference sequence.
    
    Uses longest common prefix with position weighting:
    - Earlier matches are more important
    - Score = sum of (match_score / position) for matching positions
    
    Args:
        query: List of encoded states (history)
        reference: List of encoded states (reference sequence)
    
    Returns:
        Similarity score (higher = more similar)
    """
    if len(query) == 0:
        return 0.0
    
    # Find longest common prefix
    lcp = longest_common_prefix(query, reference)
    
    if lcp == 0:
        return 0.0
    
    # Calculate weighted similarity score
    # Weight by position: earlier matches more important
    # Score = sum of (1.0 / (position + 1)) for each match
    score = 0.0
    for i in range(lcp):
        if query[i] == reference[i]:
            # Position weight: earlier positions have higher weight
            position_weight = 1.0 / (i + 1)
            score += position_weight
    
    # Normalize by query length to get score in [0, 1] range
    max_possible_score = sum(1.0 / (i + 1) for i in range(len(query)))
    if max_possible_score > 0:
        score = score / max_possible_score
    
    return score


def find_k_nearest(query_history, k=K_NEIGHBORS):
    """
    Find k nearest neighbor sequences based on similarity.
    
    Args:
        query_history: List of encoded states (query sequence)
        k: Number of neighbors to find
    
    Returns:
        List of tuples: (sequence_index, similarity_score, next_location)
        Sorted by similarity score (descending)
    """
    if len(query_history) == 0:
        return []
    
    similarities = []
    
    for idx, ref_seq in enumerate(reference_sequences):
        # Calculate similarity
        similarity = sequence_similarity(query_history, ref_seq)
        
        if similarity > 0:
            # Find the next location after the matching prefix
            lcp = longest_common_prefix(query_history, ref_seq)
            if lcp < len(ref_seq):
                next_location = ref_seq[lcp]
                similarities.append((idx, similarity, next_location))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Return top-k
    return similarities[:k]


print("\nKNN Trajectory Model built successfully!")
print(f"Similarity functions:")
print(f"  - longest_common_prefix(): Find matching prefix length")
print(f"  - sequence_similarity(): Calculate weighted similarity score")
print(f"  - find_k_nearest(): Find k most similar sequences")


Building KNN Trajectory Model...
Storing 119 training sequences as reference trajectories
Reference trajectories stored: 119
Calculating state frequencies for fallback...
Most frequent state: 220 (count: 1602)

KNN Trajectory Model built successfully!
Similarity functions:
  - longest_common_prefix(): Find matching prefix length
  - sequence_similarity(): Calculate weighted similarity score
  - find_k_nearest(): Find k most similar sequences


## Section 7 — Prediction Functions

Implement prediction functions using KNN voting based on nearest neighbor sequences.


In [7]:
def predict_next_location(query_history, k=K_NEIGHBORS):
    """
    Predict next location using KNN trajectory model.
    
    Args:
        query_history: List of encoded states (history)
        k: Number of nearest neighbors to consider
    
    Returns:
        Predicted next state (encoded integer) or None
    """
    if len(query_history) == 0:
        return most_frequent_state if most_frequent_state is not None else None
    
    # Find k nearest neighbors
    nearest = find_k_nearest(query_history, k=k)
    
    if not nearest:
        # No similar sequences found, use most frequent state
        return most_frequent_state if most_frequent_state is not None else None
    
    # Weighted voting: count votes weighted by similarity score
    votes = defaultdict(float)
    for _, similarity, next_location in nearest:
        votes[next_location] += similarity
    
    if not votes:
        return most_frequent_state if most_frequent_state is not None else None
    
    # Return location with highest weighted vote
    predicted = max(votes.items(), key=lambda x: x[1])[0]
    return int(predicted)


def predict_top_k(query_history, k_neighbors=K_NEIGHBORS, top_k=5):
    """
    Get top-K most likely next locations using KNN trajectory model.
    
    Args:
        query_history: List of encoded states (history)
        k_neighbors: Number of nearest neighbors to consider
        top_k: Number of top predictions to return
    
    Returns:
        List of top-K encoded states sorted by vote weight
    """
    if len(query_history) == 0:
        # Return top-K most frequent states
        top_states = [state for state, _ in state_counts.most_common(top_k)]
        return top_states[:top_k] if top_states else []
    
    # Find k nearest neighbors
    nearest = find_k_nearest(query_history, k=k_neighbors)
    
    if not nearest:
        # No similar sequences found, use most frequent states
        top_states = [state for state, _ in state_counts.most_common(top_k)]
        return top_states[:top_k] if top_states else []
    
    # Weighted voting: count votes weighted by similarity score
    votes = defaultdict(float)
    for _, similarity, next_location in nearest:
        votes[next_location] += similarity
    
    if not votes:
        top_states = [state for state, _ in state_counts.most_common(top_k)]
        return top_states[:top_k] if top_states else []
    
    # Sort by vote weight (descending) and return top-K
    sorted_locations = sorted(votes.items(), key=lambda x: x[1], reverse=True)
    top_k_locations = [int(loc) for loc, weight in sorted_locations[:top_k]]
    
    # If we have fewer than top_k, fill with most frequent states
    if len(top_k_locations) < top_k:
        existing = set(top_k_locations)
        additional = [state for state, _ in state_counts.most_common(top_k) 
                     if state not in existing]
        top_k_locations.extend(additional[:top_k - len(top_k_locations)])
    
    return top_k_locations[:top_k]


print("Prediction functions defined successfully!")
print("\nFunction summary:")
print(f"  - predict_next_location(query_history, k={K_NEIGHBORS}): Returns single most likely next state")
print(f"  - predict_top_k(query_history, k_neighbors={K_NEIGHBORS}, top_k=5): Returns top-K most likely next states")


Prediction functions defined successfully!

Function summary:
  - predict_next_location(query_history, k=5): Returns single most likely next state
  - predict_top_k(query_history, k_neighbors=5, top_k=5): Returns top-K most likely next states


In [8]:
# Save model
print("Saving model...")
model_data = {
    'reference_sequences': reference_sequences,
    'state_counts': dict(state_counts),
    'most_frequent_state': most_frequent_state,
    'k_neighbors': K_NEIGHBORS,
    'label_encoder': le,
    'encoded_to_placeid': encoded_to_placeid,
    'n_states': n_states,
    'similarity_metric': 'longest_common_prefix_weighted'
}

with open(MODEL_SAVE_PATH, 'wb') as f:
    pickle.dump(model_data, f)

print(f"Model saved to {MODEL_SAVE_PATH}")
print(f"Model contains:")
print(f"  - Reference trajectories: {len(reference_sequences)} sequences")
print(f"  - State frequencies: {len(state_counts)} states")
print(f"  - K neighbors: {K_NEIGHBORS}")
print(f"  - LabelEncoder and mappings")


Saving model...
Model saved to /home/root495/Inexture/Location Prediction Update/models/knn_trajectory_model.pkl
Model contains:
  - Reference trajectories: 119 sequences
  - State frequencies: 286 states
  - K neighbors: 5
  - LabelEncoder and mappings


## Section 9 — Evaluation Setup

Prepare test sequences and helper functions for evaluation metrics.


In [9]:
# Use first test sequence for evaluation (same as other notebooks for fair comparison)
test_sequence = test_encoded[0]
print(f"Test sequence length: {len(test_sequence)} events")

# Create test cases: history -> next location
test_cases = []
for i in range(1, len(test_sequence)):
    history = test_sequence[:i]
    true_next = test_sequence[i]
    test_cases.append((history, true_next))

print(f"Created {len(test_cases)} test cases")

# Load grid metadata and coordinates for MPD calculation
with open(GRID_METADATA_FILE, 'r') as f:
    grid_metadata = json.load(f)

df_places = pd.read_csv(CLEANED_WITH_PLACES_FILE)
place_coords = df_places.groupby('place_id')[['lat', 'lon']].first().to_dict('index')

print(f"Loaded coordinates for {len(place_coords)} places")

# Helper function to get coordinates from place_id
def place_id_to_coords(place_id, place_coords, grid_metadata):
    """Get coordinates from place_id"""
    if place_id is None:
        return None, None
    
    # Try to find in place_coords first
    if place_id in place_coords:
        return place_coords[place_id]['lat'], place_coords[place_id]['lon']
    
    # Fallback: calculate from grid if place_id has format "row_col"
    try:
        if "_" in str(place_id):
            row, col = map(int, str(place_id).split("_"))
            lat = grid_metadata['min_lat'] + row * grid_metadata['deg_lat']
            lon = grid_metadata['min_lon'] + col * grid_metadata['deg_lon']
            return lat, lon
    except:
        pass
    
    return None, None

print("Evaluation setup complete!")


Test sequence length: 50 events
Created 49 test cases
Loaded coordinates for 2073 places
Evaluation setup complete!


## Section 10 — Metric 1: Accuracy

Calculate accuracy: fraction of predictions that exactly match the true next location.


In [10]:
# Calculate Accuracy
print("Calculating Accuracy...")
predictions = []
true_labels = []

for history, true_next in tqdm(test_cases, desc="Making predictions"):
    pred = predict_next_location(history, k=K_NEIGHBORS)
    if pred is not None:
        predictions.append(pred)
        true_labels.append(true_next)

# Calculate accuracy
if len(predictions) == 0:
    print("ERROR: No predictions were made!")
    accuracy = 0
    correct = 0
    total = 0
else:
    correct = sum(1 for p, t in zip(predictions, true_labels) if p == t)
    total = len(predictions)
    accuracy = correct / total if total > 0 else 0
    
    # Debug: Show first few predictions vs true
    print(f"\nDebug - First 5 predictions:")
    for i in range(min(5, len(predictions))):
        pred_place = encoded_to_placeid.get(predictions[i], "Unknown")
        true_place = encoded_to_placeid.get(true_labels[i], "Unknown")
        match = "✓" if predictions[i] == true_labels[i] else "✗"
        print(f"  {match} Pred: {predictions[i]} ({pred_place[:20]}) | True: {true_labels[i]} ({true_place[:20]})")

print(f"\n{'='*60}")
print(f"METRIC 1: ACCURACY")
print(f"{'='*60}")
print(f"Correct predictions: {correct}")
print(f"Total predictions: {total}")
print(f"Accuracy: {accuracy:.12f}")
print(f"{'='*60}")


Calculating Accuracy...


Making predictions: 100%|██████████| 49/49 [00:00<00:00, 1990.34it/s]


Debug - First 5 predictions:
  ✗ Pred: 219 (296_2075) | True: 213 (295_2076)
  ✗ Pred: 220 (296_2076) | True: 214 (295_2077)
  ✗ Pred: 220 (296_2076) | True: 213 (295_2076)
  ✓ Pred: 220 (296_2076) | True: 220 (296_2076)
  ✗ Pred: 220 (296_2076) | True: 219 (296_2075)

METRIC 1: ACCURACY
Correct predictions: 7
Total predictions: 49
Accuracy: 0.142857142857





## Section 11 — Metric 2: Precision & Recall

**Definition**: 
- **Precision**: How many predicted locations were actually correct, weighted by class frequency
- **Recall**: Out of all true next locations, how many you successfully predicted, weighted by class frequency

Measuring how trustworthy the model is with visited and predicted locations using weighted averages.


In [11]:
# Calculate Precision & Recall (Weighted)
print("Calculating Precision & Recall (Weighted)...")

if len(predictions) > 0:
    precision_weighted = precision_score(true_labels, predictions, average='weighted', zero_division=0)
    recall_weighted = recall_score(true_labels, predictions, average='weighted', zero_division=0)
else:
    precision_weighted = recall_weighted = 0

print(f"\n{'='*60}")
print(f"METRIC 2: PRECISION & RECALL")
print(f"{'='*60}")
print(f"Precision: {precision_weighted:.12f}")
print(f"Recall: {recall_weighted:.12f}")
print(f"{'='*60}")


Calculating Precision & Recall (Weighted)...

METRIC 2: PRECISION & RECALL
Precision: 0.020833333333
Recall: 0.142857142857


## Section 12 — Metric 3: Top-K Accuracy

**Definition**: The true next location is considered correct if it appears in the top K predicted locations.

Top-K Accuracy: If the true next position is included in the top-K predictions (K=1, 3, 5).


In [12]:
# Calculate Top-K Accuracy
print("Calculating Top-K Accuracy...")

k_values = [1, 3, 5]
top_k_results = {}

for k in k_values:
    correct_k = 0
    total_k = 0
    
    for history, true_next in tqdm(test_cases, desc=f"Top-{k}"):
        top_k_preds = predict_top_k(history, k_neighbors=K_NEIGHBORS, top_k=k)
        if top_k_preds:
            total_k += 1
            if true_next in top_k_preds:
                correct_k += 1
    
    top_k_accuracy = correct_k / total_k if total_k > 0 else 0
    
    top_k_results[k] = {
        'correct': correct_k,
        'total': total_k,
        'accuracy': top_k_accuracy
    }

print(f"\n{'='*60}")
print(f"METRIC 3: TOP-K ACCURACY")
print(f"{'='*60}")
for k in k_values:
    result = top_k_results[k]
    print(f"Top-{k} Accuracy: {result['accuracy']:.12f}")
print(f"{'='*60}")


Calculating Top-K Accuracy...


Top-1: 100%|██████████| 49/49 [00:00<00:00, 2472.46it/s]
Top-3: 100%|██████████| 49/49 [00:00<00:00, 3033.03it/s]
Top-5: 100%|██████████| 49/49 [00:00<00:00, 3281.25it/s]


METRIC 3: TOP-K ACCURACY
Top-1 Accuracy: 0.142857142857
Top-3 Accuracy: 0.367346938776
Top-5 Accuracy: 0.448979591837





## Section 13 — Metric 4: Mean Prediction Distance (MPD)

**Definition**: Average Haversine distance (in meters) between actual next location and predicted next location.

MPD Distance: Mean Prediction Distance — Mean actual distance visited from predicted location of next visit.


In [13]:
# Calculate Mean Prediction Distance (MPD)
print("Calculating Mean Prediction Distance (MPD)...")

distances = []
failed_conversions = 0

for history, true_next in tqdm(test_cases, desc="Calculating distances"):
    pred = predict_next_location(history, k=K_NEIGHBORS)
    
    if pred is not None:
        # Convert encoded IDs back to place_ids
        pred_place_id = encoded_to_placeid.get(pred)
        true_place_id = encoded_to_placeid.get(true_next)
        
        if pred_place_id and true_place_id:
            # Get coordinates
            pred_lat, pred_lon = place_id_to_coords(pred_place_id, place_coords, grid_metadata)
            true_lat, true_lon = place_id_to_coords(true_place_id, place_coords, grid_metadata)
            
            if pred_lat is not None and true_lat is not None:
                # Calculate haversine distance
                try:
                    distance_m = haversine((pred_lat, pred_lon), (true_lat, true_lon)) * 1000
                    # Filter out unrealistic distances (likely coordinate errors)
                    if distance_m < 1000000:  # Less than 1000 km
                        distances.append(distance_m)
                    else:
                        failed_conversions += 1
                except:
                    failed_conversions += 1
            else:
                failed_conversions += 1
        else:
            failed_conversions += 1
    else:
        failed_conversions += 1

if failed_conversions > 0:
    print(f"Warning: {failed_conversions} distance calculations failed or were filtered")

mpd = np.mean(distances) if len(distances) > 0 else 0
mpd_median = np.median(distances) if len(distances) > 0 else 0
mpd_std = np.std(distances) if len(distances) > 0 else 0

print(f"\n{'='*60}")
print(f"METRIC 4: MEAN PREDICTION DISTANCE (MPD)")
print(f"{'='*60}")
print(f"MPD Distance: {mpd:.12f} meters")
print(f"Valid distance calculations: {len(distances)}/{len(test_cases)}")
print(f"{'='*60}")


Calculating Mean Prediction Distance (MPD)...


Calculating distances: 100%|██████████| 49/49 [00:00<00:00, 104.53it/s]


METRIC 4: MEAN PREDICTION DISTANCE (MPD)
MPD Distance: 17011.943966723749 meters
Valid distance calculations: 49/49





In [14]:
# Compile all results
results = {
    'num_users': NUM_USERS,
    'selected_users': selected_users,
    'preprocessing': {
        'total_original_places': total_original,
        'total_after_duplicate_removal': total_processed,
        'total_duplicates_removed': total_original - total_processed,
        'sequence_length': SEQUENCE_LENGTH,
        'total_sequences': len(all_sequences),
        'training_sequences': len(train_sequences),
        'test_sequences': len(test_sequences)
    },
    'model': {
        'unique_states': n_states,
        'model_type': 'knn_trajectory',
        'k_neighbors': K_NEIGHBORS,
        'num_reference_sequences': len(reference_sequences),
        'similarity_metric': 'longest_common_prefix_weighted'
    },
    'accuracy': {
        'value': accuracy,
        'correct': correct,
        'total': total
    },
    'precision_recall': {
        'precision': float(precision_weighted),
        'recall': float(recall_weighted)
    },
    'top_k_accuracy': {
        f'top_{k}_accuracy': float(top_k_results[k]['accuracy']) for k in k_values
    },
    'mpd_distance': {
        'mpd_distance_meters': float(mpd),
        'valid_calculations': len(distances)
    }
}

# Display summary
print(f"\n{'='*60}")
print(f"EVALUATION RESULTS SUMMARY")
print(f"{'='*60}")
print(f"\nNumber of users: {NUM_USERS}")
print(f"Users: {selected_users}")
print(f"Total original places: {total_original}")
print(f"After duplicate removal: {total_processed}")
print(f"Training sequences: {len(train_sequences)}")
print(f"Test sequences: {len(test_sequences)}")

print(f"\n1. ACCURACY")
print(f"   Accuracy: {accuracy:.12f}")

print(f"\n2. PRECISION & RECALL")
print(f"   Precision: {precision_weighted:.12f}")
print(f"   Recall: {recall_weighted:.12f}")

print(f"\n3. TOP-K ACCURACY")
for k in k_values:
    acc = top_k_results[k]['accuracy']
    print(f"   Top-{k} Accuracy: {acc:.12f}")

print(f"\n4. MEAN PREDICTION DISTANCE (MPD)")
print(f"   MPD Distance: {mpd:.12f} meters")

print(f"\n{'='*60}")

# Save results
with open(RESULTS_SAVE_PATH, 'w') as f:
    json.dump(results, f, indent=2)

print(f"\nResults saved to {RESULTS_SAVE_PATH}")

# Create results DataFrame
results_df = pd.DataFrame({
    'Metric': [
        'Accuracy',
        'Precision',
        'Recall',
        'Top-1 Accuracy',
        'Top-3 Accuracy',
        'Top-5 Accuracy',
        'MPD Distance'
    ],
    'Value': [
        f"{accuracy:.12f}",
        f"{precision_weighted:.12f}",
        f"{recall_weighted:.12f}",
        f"{top_k_results[1]['accuracy']:.12f}",
        f"{top_k_results[3]['accuracy']:.12f}",
        f"{top_k_results[5]['accuracy']:.12f}",
        f"{mpd:.12f}"
    ]
})

print("\nResults Table:")
print(results_df.to_string(index=False))



EVALUATION RESULTS SUMMARY

Number of users: 10
Users: ['000', '001', '005', '006', '009', '011', '014', '016', '019', '025']
Total original places: 1752364
After duplicate removal: 4087
Training sequences: 119
Test sequences: 30

1. ACCURACY
   Accuracy: 0.142857142857

2. PRECISION & RECALL
   Precision: 0.020833333333
   Recall: 0.142857142857

3. TOP-K ACCURACY
   Top-1 Accuracy: 0.142857142857
   Top-3 Accuracy: 0.367346938776
   Top-5 Accuracy: 0.448979591837

4. MEAN PREDICTION DISTANCE (MPD)
   MPD Distance: 17011.943966723749 meters


Results saved to /home/root495/Inexture/Location Prediction Update/results/knn_trajectory_results.json

Results Table:
        Metric              Value
      Accuracy     0.142857142857
     Precision     0.020833333333
        Recall     0.142857142857
Top-1 Accuracy     0.142857142857
Top-3 Accuracy     0.367346938776
Top-5 Accuracy     0.448979591837
  MPD Distance 17011.943966723749


In [15]:
# Update models_comparison.csv
comparison_file = RESULTS_PATH + "models_comparison.csv"

# Read existing comparison file
try:
    comparison_df = pd.read_csv(comparison_file)
    
    # Check if KNN Trajectory row already exists
    if 'KNN Trajectory' in comparison_df['Model'].values:
        # Update existing row
        mask = comparison_df['Model'] == 'KNN Trajectory'
        comparison_df.loc[mask, 'Accuracy'] = f"{accuracy:.12f}"
        comparison_df.loc[mask, 'Precision'] = f"{precision_weighted:.12f}"
        comparison_df.loc[mask, 'Recall'] = f"{recall_weighted:.12f}"
        comparison_df.loc[mask, 'Top-1 Accuracy'] = f"{top_k_results[1]['accuracy']:.12f}"
        comparison_df.loc[mask, 'Top-3 Accuracy'] = f"{top_k_results[3]['accuracy']:.12f}"
        comparison_df.loc[mask, 'Top-5 Accuracy'] = f"{top_k_results[5]['accuracy']:.12f}"
        comparison_df.loc[mask, 'MPD Distance (meters)'] = f"{mpd:.12f}"
        print("Updated existing KNN Trajectory row in models_comparison.csv")
    else:
        # Add new row
        new_row = pd.DataFrame({
            'Model': ['KNN Trajectory'],
            'Accuracy': [f"{accuracy:.12f}"],
            'Precision': [f"{precision_weighted:.12f}"],
            'Recall': [f"{recall_weighted:.12f}"],
            'Top-1 Accuracy': [f"{top_k_results[1]['accuracy']:.12f}"],
            'Top-3 Accuracy': [f"{top_k_results[3]['accuracy']:.12f}"],
            'Top-5 Accuracy': [f"{top_k_results[5]['accuracy']:.12f}"],
            'MPD Distance (meters)': [f"{mpd:.12f}"]
        })
        comparison_df = pd.concat([comparison_df, new_row], ignore_index=True)
        print("Added new KNN Trajectory row to models_comparison.csv")
    
    # Save updated comparison file
    comparison_df.to_csv(comparison_file, index=False)
    print(f"Updated {comparison_file}")
    
    # Display updated comparison
    print("\nUpdated Models Comparison:")
    print(comparison_df.to_string(index=False))
    
except FileNotFoundError:
    # Create new comparison file if it doesn't exist
    comparison_df = pd.DataFrame({
        'Model': ['KNN Trajectory'],
        'Accuracy': [f"{accuracy:.12f}"],
        'Precision': [f"{precision_weighted:.12f}"],
        'Recall': [f"{recall_weighted:.12f}"],
        'Top-1 Accuracy': [f"{top_k_results[1]['accuracy']:.12f}"],
        'Top-3 Accuracy': [f"{top_k_results[3]['accuracy']:.12f}"],
        'Top-5 Accuracy': [f"{top_k_results[5]['accuracy']:.12f}"],
        'MPD Distance (meters)': [f"{mpd:.12f}"]
    })
    comparison_df.to_csv(comparison_file, index=False)
    print(f"Created new {comparison_file}")
except Exception as e:
    print(f"Warning: Could not update models_comparison.csv: {e}")
    print("Results have been saved to JSON file. Please update CSV manually if needed.")


Added new KNN Trajectory row to models_comparison.csv
Updated /home/root495/Inexture/Location Prediction Update/results/models_comparison.csv

Updated Models Comparison:
         Model       Accuracy      Precision         Recall Top-1 Accuracy Top-3 Accuracy Top-5 Accuracy MPD Distance (meters)
           HMM       0.653061       0.605081       0.653061       0.653061       0.897959       0.918367           4364.404451
           GNN       0.504762       0.438886       0.504762       0.504762       0.691837       0.787075           3216.861429
        Fusion       0.498639        0.44425       0.498639       0.498639       0.768027       0.819728           5196.347567
  Markov Chain       0.693878       0.730539       0.693878       0.693878       0.918367       0.918367            3691.02685
KNN Trajectory 0.142857142857 0.020833333333 0.142857142857 0.142857142857 0.367346938776 0.448979591837    17011.943966723749
