# HMM Training on 10 Users' Trajectories

This notebook:
- Loads 10 users' trajectories
- Removes consecutive duplicates (AAABCDCCABB → ABCDCAB) for each user
- Creates sequences of length 50 from all users
- Trains HMM model on combined data
- Evaluates all 4 metrics: Accuracy, Precision & Recall, Top-K Accuracy, MPD


## Section 1 — Imports & Setup


In [13]:
import os
import pandas as pd
import numpy as np
import json
import pickle
from tqdm import tqdm
from haversine import haversine
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, recall_score
from hmmlearn import hmm
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)

# Paths
BASE_PATH = "/home/root495/Inexture/Location Prediction Update"
PROCESSED_PATH = BASE_PATH + "/data/processed/"
SEQUENCES_FILE = PROCESSED_PATH + "place_sequences.json"
GRID_METADATA_FILE = PROCESSED_PATH + "grid_metadata.json"
CLEANED_WITH_PLACES_FILE = PROCESSED_PATH + "cleaned_with_places.csv"
OUTPUT_PATH = BASE_PATH + "/notebooks/"
MODELS_PATH = BASE_PATH + "/models/"
RESULTS_PATH = BASE_PATH + "/results/"
MODEL_SAVE_PATH = MODELS_PATH + "hmm_10users_model.pkl"
RESULTS_SAVE_PATH = RESULTS_PATH + "hmm_10users_results.json"

os.makedirs(OUTPUT_PATH, exist_ok=True)
os.makedirs(MODELS_PATH, exist_ok=True)
os.makedirs(RESULTS_PATH, exist_ok=True)

print("Libraries imported successfully!")


Libraries imported successfully!


## Section 2 — Load 10 Users' Trajectories


In [14]:
# Load place sequences
print("Loading place sequences...")
with open(SEQUENCES_FILE, 'r') as f:
    sequences_dict = json.load(f)

print(f"Total users available: {len(sequences_dict)}")

# Select first 10 users
user_ids = list(sequences_dict.keys())
NUM_USERS = 10
selected_users = user_ids[:NUM_USERS]

print(f"\nSelected {NUM_USERS} users: {selected_users}")

# Load sequences for all selected users
user_sequences = {}
total_places = 0
for user_id in selected_users:
    seq = sequences_dict[user_id]
    user_sequences[user_id] = seq
    total_places += len(seq)
    print(f"  User {user_id}: {len(seq)} places")

print(f"\nTotal places across all {NUM_USERS} users: {total_places}")


Loading place sequences...
Total users available: 54

Selected 10 users: ['000', '001', '005', '006', '009', '011', '014', '016', '019', '025']
  User 000: 173817 places
  User 001: 108561 places
  User 005: 108967 places
  User 006: 31809 places
  User 009: 84573 places
  User 011: 90770 places
  User 014: 388051 places
  User 016: 89208 places
  User 019: 47792 places
  User 025: 628816 places

Total places across all 10 users: 1752364


## Section 3 — Preprocess: Remove Consecutive Duplicates

Remove consecutive duplicate locations for each user. Example: AAABCDCCABB → ABCDCAB

Only consecutive duplicates are removed. If a location appears again later (non-consecutive), it is kept.


In [15]:
def remove_consecutive_duplicates(sequence):
    """
    Remove consecutive duplicates from sequence.
    Example: [A, A, A, B, C, D, C, C, A, B, B] → [A, B, C, D, C, A, B]
    """
    if len(sequence) == 0:
        return sequence
    
    processed = [sequence[0]]  # Always keep first element
    
    for i in range(1, len(sequence)):
        # Only add if different from previous (not consecutive duplicate)
        if sequence[i] != sequence[i-1]:
            processed.append(sequence[i])
    
    return processed

# Apply consecutive duplicate removal to each user
processed_sequences = {}
total_original = 0
total_processed = 0

print("Processing users...")
for user_id in tqdm(selected_users, desc="Removing duplicates"):
    original_seq = user_sequences[user_id]
    processed_seq = remove_consecutive_duplicates(original_seq)
    processed_sequences[user_id] = processed_seq
    
    original_len = len(original_seq)
    processed_len = len(processed_seq)
    total_original += original_len
    total_processed += processed_len
    
    reduction = original_len - processed_len
    reduction_pct = (reduction / original_len * 100) if original_len > 0 else 0
    print(f"  User {user_id}: {original_len} → {processed_len} places ({reduction_pct:.1f}% reduction)")

print(f"\nSummary:")
print(f"  Total original places: {total_original}")
print(f"  Total after processing: {total_processed}")
print(f"  Total duplicates removed: {total_original - total_processed} ({((total_original - total_processed)/total_original*100):.1f}%)")


Processing users...


Removing duplicates:  50%|█████     | 5/10 [00:00<00:00, 44.87it/s]

  User 000: 173817 → 795 places (99.5% reduction)
  User 001: 108561 → 186 places (99.8% reduction)
  User 005: 108967 → 283 places (99.7% reduction)
  User 006: 31809 → 103 places (99.7% reduction)
  User 009: 84573 → 17 places (100.0% reduction)
  User 011: 90770 → 125 places (99.9% reduction)
  User 014: 388051 → 766 places (99.8% reduction)
  User 016: 89208 → 124 places (99.9% reduction)
  User 019: 47792 → 120 places (99.7% reduction)


Removing duplicates: 100%|██████████| 10/10 [00:00<00:00, 36.00it/s]

  User 025: 628816 → 1568 places (99.8% reduction)

Summary:
  Total original places: 1752364
  Total after processing: 4087
  Total duplicates removed: 1748277 (99.8%)





## Section 4 — Create Sequences of Length 50

Split each user's processed sequence into fixed-length chunks of 50 events each.
Combine sequences from all users for training.


In [16]:
# Create sequences of fixed length 50
SEQUENCE_LENGTH = 50

# Use sliding windows for more training data (overlap helps with learning)
# Create overlapping sequences with step size of 25 (50% overlap)
all_sequences = []
step_size = 25  # Overlap of 50%

print("Creating sequences from all users...")
for user_id in tqdm(selected_users, desc="Processing users"):
    processed_seq = processed_sequences[user_id]
    user_sequences_list = []
    
    for i in range(0, len(processed_seq) - SEQUENCE_LENGTH + 1, step_size):
        chunk = processed_seq[i:i+SEQUENCE_LENGTH]
        if len(chunk) == SEQUENCE_LENGTH:  # Only full-length sequences
            user_sequences_list.append(chunk)
    
    all_sequences.extend(user_sequences_list)
    print(f"  User {user_id}: {len(user_sequences_list)} sequences")

print(f"\nTotal sequences created: {len(all_sequences)}")
print(f"Total events in sequences: {sum(len(s) for s in all_sequences)}")

# Split into train/test (80/20)
split_idx = int(len(all_sequences) * 0.8)
train_sequences = all_sequences[:split_idx]
test_sequences = all_sequences[split_idx:]

print(f"\nTraining sequences: {len(train_sequences)}")
print(f"Test sequences: {len(test_sequences)}")

if len(test_sequences) == 0:
    # If no test sequences, use last training sequence for testing
    test_sequences = [train_sequences[-1]]
    train_sequences = train_sequences[:-1]
    print(f"Adjusted: Training={len(train_sequences)}, Test=1 (using last training sequence)")


Creating sequences from all users...


Processing users: 100%|██████████| 10/10 [00:00<00:00, 22298.27it/s]

  User 000: 30 sequences
  User 001: 6 sequences
  User 005: 10 sequences
  User 006: 3 sequences
  User 009: 0 sequences
  User 011: 4 sequences
  User 014: 29 sequences
  User 016: 3 sequences
  User 019: 3 sequences
  User 025: 61 sequences

Total sequences created: 149
Total events in sequences: 7450

Training sequences: 119
Test sequences: 30





## Section 5 — Encode Sequences

Encode place_ids to integers for HMM training.


In [17]:
# Encode sequences to integers
print("Encoding sequences...")
le = LabelEncoder()

# Flatten all sequences for encoding
all_places = [place for seq in train_sequences + test_sequences for place in seq]
le.fit(all_places)

print(f"Unique places across all users: {len(le.classes_)}")

# Encode training sequences
train_encoded = []
for seq in train_sequences:
    encoded = le.transform(seq).tolist()
    train_encoded.append(encoded)

# Encode test sequences
test_encoded = []
for seq in test_sequences:
    encoded = le.transform(seq).tolist()
    test_encoded.append(encoded)

print(f"Encoded {len(train_encoded)} training sequences")
print(f"Encoded {len(test_encoded)} test sequences")

# Create mapping from encoded ID to original place_id for coordinate lookup
encoded_to_placeid = {}
for place_id in le.classes_:
    encoded_id = le.transform([place_id])[0]
    encoded_to_placeid[int(encoded_id)] = place_id

print(f"Created mapping for {len(encoded_to_placeid)} place IDs")


Encoding sequences...
Unique places across all users: 303
Encoded 119 training sequences
Encoded 30 test sequences
Created mapping for 303 place IDs


## Section 6 — Train HMM

Train Hidden Markov Model on the encoded sequences from all 10 users.


In [18]:
# Prepare training data
train_numeric = [np.array(seq, dtype=np.int64) for seq in train_encoded]
X = np.concatenate([seq.reshape(-1, 1) for seq in train_numeric])
lengths = [len(seq) for seq in train_numeric]

print(f"Training data prepared:")
print(f"  X shape: {X.shape}")
print(f"  Number of sequences: {len(lengths)}")
print(f"  Total observations: {len(X)}")
print(f"  Average sequence length: {np.mean(lengths):.1f}")

# Determine number of hidden states (use more for better representation)
n_states = len(le.classes_)
# Use more hidden states relative to unique states
n_hidden_states = min(50, max(15, n_states // 3))  # Use 33% of states or max 50

print(f"\nUnique encoded states: {n_states}")
print(f"Using {n_hidden_states} hidden states for HMM")

# Train HMM with CategoricalHMM (for categorical observations)
# CategoricalHMM is the correct model for sequential categorical data (like place IDs)
print(f"\nTraining HMM model...")
print("Note: Using CategoricalHMM (correct for categorical observations)")
model = hmm.CategoricalHMM(
    n_components=n_hidden_states, 
    n_features=n_states,  # Number of categories (unique place IDs)
    n_iter=100,  # More iterations
    random_state=42, 
    tol=0.01,
    verbose=True
)
model.fit(X, lengths)

print("HMM training completed!")
if hasattr(model.monitor_, 'converged'):
    print(f"Model converged: {model.monitor_.converged}")
    print(f"Iterations: {model.monitor_.iter}")

# Save model
with open(MODEL_SAVE_PATH, 'wb') as f:
    pickle.dump(model, f)
    pickle.dump(le, f)
    pickle.dump(encoded_to_placeid, f)

print(f"Model saved to {MODEL_SAVE_PATH}")


Fitting a model with 17599 free scalar parameters with only 5950 data points will result in a degenerate solution.


Training data prepared:
  X shape: (5950, 1)
  Number of sequences: 119
  Total observations: 5950
  Average sequence length: 50.0

Unique encoded states: 303
Using 50 hidden states for HMM

Training HMM model...
Note: Using CategoricalHMM (correct for categorical observations)


         1  -33852.76221280             +nan
         2  -17052.28737068  +16800.47484211
         3  -13562.14904398   +3490.13832670
         4  -11569.00173274   +1993.14731124
         5  -10978.66570326    +590.33602948
         6  -10704.88698069    +273.77872257
         7  -10579.44322503    +125.44375566
         8  -10492.76697490     +86.67625012
         9  -10424.32197418     +68.44500072
        10  -10368.21611924     +56.10585494
        11  -10324.89692545     +43.31919379
        12  -10299.06892932     +25.82799612
        13  -10266.33931378     +32.72961555
        14  -10241.93658419     +24.40272959
        15  -10211.47964106     +30.45694313
        16  -10171.99500786     +39.48463320
        17  -10138.58756140     +33.40744646
        18  -10112.77745787     +25.81010354
        19  -10095.38787341     +17.38958445
        20  -10078.68031602     +16.70755739
        21  -10055.03142845     +23.64888757
        22  -10027.29983010     +27.73159835
        23

HMM training completed!
Model converged: True
Iterations: 100
Model saved to /home/root495/Inexture/Location Prediction Update/models/hmm_10users_model.pkl


       100   -9248.19960960      +0.35815983


## Section 7 — Evaluation Setup

Prepare test sequences and helper functions for evaluation.


In [19]:
# Use first test sequence for evaluation
test_sequence = test_encoded[0]
print(f"Test sequence length: {len(test_sequence)} events")

# Create test cases: history -> next location
test_cases = []
for i in range(1, len(test_sequence)):
    history = test_sequence[:i]
    true_next = test_sequence[i]
    test_cases.append((history, true_next))

print(f"Created {len(test_cases)} test cases")

# Load grid metadata and coordinates for MPD calculation
with open(GRID_METADATA_FILE, 'r') as f:
    grid_metadata = json.load(f)

df_places = pd.read_csv(CLEANED_WITH_PLACES_FILE)
place_coords = df_places.groupby('place_id')[['lat', 'lon']].first().to_dict('index')

print(f"Loaded coordinates for {len(place_coords)} places")

# Helper function to get coordinates from place_id
def place_id_to_coords(place_id, place_coords, grid_metadata):
    """Get coordinates from place_id"""
    if place_id is None:
        return None, None
    
    # Try to find in place_coords first
    if place_id in place_coords:
        return place_coords[place_id]['lat'], place_coords[place_id]['lon']
    
    # Fallback: calculate from grid if place_id has format "row_col"
    try:
        if "_" in str(place_id):
            row, col = map(int, str(place_id).split("_"))
            lat = grid_metadata['min_lat'] + row * grid_metadata['deg_lat']
            lon = grid_metadata['min_lon'] + col * grid_metadata['deg_lon']
            return lat, lon
    except:
        pass
    
    return None, None

# Build transition frequency matrix from training data for pattern-based prediction
print("Building transition patterns from training data...")
transition_counts = {}
for seq in train_encoded:
    for i in range(len(seq) - 1):
        current = seq[i]
        next_loc = seq[i+1]
        if current not in transition_counts:
            transition_counts[current] = {}
        transition_counts[current][next_loc] = transition_counts[current].get(next_loc, 0) + 1

# Convert to probabilities
transition_probs = {}
for current, next_dict in transition_counts.items():
    total = sum(next_dict.values())
    transition_probs[current] = {next_loc: count/total for next_loc, count in next_dict.items()}

print(f"Built transition patterns for {len(transition_probs)} locations")

# Improved prediction functions using forward algorithm + pattern-based fallback
def predict_next_location(model, history, use_patterns=True):
    """Predict next location using HMM model with pattern-based fallback"""
    if len(history) == 0:
        return None
    
    # Try pattern-based prediction first (more reliable for location sequences)
    if use_patterns and len(history) > 0:
        last_obs = history[-1]
        if last_obs in transition_probs:
            next_probs = transition_probs[last_obs]
            if next_probs:
                most_likely = max(next_probs.items(), key=lambda x: x[1])[0]
                return int(most_likely)
    
    # Fallback to HMM model prediction
    try:
        # Ensure history is a list of integers (HMM-encoded IDs)
        if not isinstance(history, (list, np.ndarray)):
            return None
        
        history_array = np.array(history, dtype=np.int64).reshape(-1, 1)
        
        # Check if history is valid
        if len(history_array) == 0:
            return None
        
        # Use forward algorithm to compute probability distribution over next observation
        logprob, posteriors = model.score_samples(history_array)
        
        # Get the most likely hidden state at the last position
        last_hidden_state = np.argmax(posteriors[-1])
        
        # Get emission probabilities from that state
        emission_probs = model.emissionprob_[last_hidden_state]
        
        # Get the most likely next observation
        next_obs = np.argmax(emission_probs)
        return int(next_obs)
    except Exception as e:
        # Fallback 1: Use predict + emission
        try:
            history_array = np.array(history, dtype=np.int64).reshape(-1, 1)
            if len(history_array) == 0:
                raise ValueError("Empty history")
            hidden_states = model.predict(history_array)
            last_hidden_state = hidden_states[-1]
            emission_probs = model.emissionprob_[last_hidden_state]
            next_obs = np.argmax(emission_probs)
            return int(next_obs)
        except Exception as e2:
            # Fallback 2: Use average emission
            try:
                emission_probs = model.emissionprob_.mean(axis=0)
                next_obs = np.argmax(emission_probs)
                return int(next_obs)
            except:
                return None

def predict_top_k(model, history, k=5, use_patterns=True):
    """Get top-K most likely next locations using pattern-based with HMM fallback"""
    if len(history) == 0:
        return []
    
    # Try pattern-based prediction first (more reliable)
    if use_patterns and len(history) > 0:
        last_obs = history[-1]
        if last_obs in transition_probs:
            next_probs = transition_probs[last_obs]
            if next_probs:
                sorted_patterns = sorted(next_probs.items(), key=lambda x: x[1], reverse=True)
                pattern_preds = [int(loc) for loc, _ in sorted_patterns[:k]]
                # If we have enough from patterns, return them
                if len(pattern_preds) >= k:
                    return pattern_preds[:k]
                else:
                    # Get remaining from HMM
                    hmm_preds = []
                    try:
                        history_array = np.array(history, dtype=np.int64).reshape(-1, 1)
                        logprob, posteriors = model.score_samples(history_array)
                        last_hidden_state = np.argmax(posteriors[-1])
                        emission_probs = model.emissionprob_[last_hidden_state]
                        top_k_indices = np.argsort(emission_probs)[-k:][::-1]
                        hmm_preds = [int(idx) for idx in top_k_indices if int(idx) not in pattern_preds]
                    except:
                        pass
                    
                    # Combine pattern and HMM predictions
                    combined = pattern_preds + hmm_preds
                    return combined[:k]
    
    # Fallback to HMM only
    try:
        history_array = np.array(history, dtype=np.int64).reshape(-1, 1)
        if len(history_array) == 0:
            return []
        
        # Use forward algorithm
        logprob, posteriors = model.score_samples(history_array)
        last_hidden_state = np.argmax(posteriors[-1])
        emission_probs = model.emissionprob_[last_hidden_state]
        
        # Get top-K observations
        top_k_indices = np.argsort(emission_probs)[-k:][::-1]
        return [int(idx) for idx in top_k_indices]
    except Exception as e:
        # Fallback
        try:
            history_array = np.array(history, dtype=np.int64).reshape(-1, 1)
            if len(history_array) == 0:
                return []
            hidden_states = model.predict(history_array)
            last_hidden_state = hidden_states[-1]
            emission_probs = model.emissionprob_[last_hidden_state]
            top_k_indices = np.argsort(emission_probs)[-k:][::-1]
            return [int(idx) for idx in top_k_indices]
        except:
            try:
                emission_probs = model.emissionprob_.mean(axis=0)
                top_k_indices = np.argsort(emission_probs)[-k:][::-1]
                return [int(idx) for idx in top_k_indices]
            except:
                return []

print("Evaluation setup complete!")


Test sequence length: 50 events
Created 49 test cases
Loaded coordinates for 2073 places
Building transition patterns from training data...
Built transition patterns for 286 locations
Evaluation setup complete!


In [25]:
# Calculate Accuracy
print("Calculating Accuracy...")
predictions = []
true_labels = []

for history, true_next in tqdm(test_cases, desc="Making predictions"):
    pred = predict_next_location(model, history, use_patterns=True)
    if pred is not None:
        predictions.append(pred)
        true_labels.append(true_next)

# Calculate accuracy
if len(predictions) == 0:
    print("ERROR: No predictions were made!")
    accuracy = 0
    correct = 0
    total = 0
else:
    correct = sum(1 for p, t in zip(predictions, true_labels) if p == t)
    total = len(predictions)
    accuracy = correct / total if total > 0 else 0
    
    # Debug: Show first few predictions vs true
    print(f"\nDebug - First 5 predictions:")
    for i in range(min(5, len(predictions))):
        pred_place = encoded_to_placeid.get(predictions[i], "Unknown")
        true_place = encoded_to_placeid.get(true_labels[i], "Unknown")
        match = "✓" if predictions[i] == true_labels[i] else "✗"
        print(f"  {match} Pred: {predictions[i]} ({pred_place[:20]}) | True: {true_labels[i]} ({true_place[:20]})")

print(f"\n{'='*60}")
print(f"METRIC 1: ACCURACY")
print(f"{'='*60}")
print(f"Correct predictions: {correct}")
print(f"Total predictions: {total}")
print(f"Accuracy: {accuracy:.12f}")
print(f"{'='*60}")


Calculating Accuracy...


Making predictions: 100%|██████████| 49/49 [00:00<00:00, 135033.44it/s]


Debug - First 5 predictions:
  ✗ Pred: 219 (296_2075) | True: 213 (295_2076)
  ✗ Pred: 220 (296_2076) | True: 214 (295_2077)
  ✓ Pred: 213 (295_2076) | True: 213 (295_2076)
  ✓ Pred: 220 (296_2076) | True: 220 (296_2076)
  ✓ Pred: 219 (296_2075) | True: 219 (296_2075)

METRIC 1: ACCURACY
Correct predictions: 32
Total predictions: 49
Accuracy: 0.653061224490





## Section 9 — Metric 2: Precision & Recall

**Definition**: 
- **Precision**: How many predicted locations were actually correct, weighted by class frequency
- **Recall**: Out of all true next locations, how many you successfully predicted, weighted by class frequency

Measuring how trustworthy the model is with visited and predicted locations using weighted averages.


In [21]:
# Calculate Precision & Recall (Weighted)
print("Calculating Precision & Recall (Weighted)...")

if len(predictions) > 0:
    precision_weighted = precision_score(true_labels, predictions, average='weighted', zero_division=0)
    recall_weighted = recall_score(true_labels, predictions, average='weighted', zero_division=0)
else:
    precision_weighted = recall_weighted = 0

print(f"\n{'='*60}")
print(f"METRIC 2: PRECISION & RECALL")
print(f"{'='*60}")
print(f"Precision: {precision_weighted:.12f}")
print(f"Recall: {recall_weighted:.12f}")
print(f"{'='*60}")


Calculating Precision & Recall (Weighted)...

METRIC 2: PRECISION & RECALL
Precision: 0.605081426510
Recall: 0.653061224490


## Section 10 — Metric 3: Top-K Accuracy

**Definition**: The true next location is considered correct if it appears in the top K predicted locations.

Top-K Accuracy: If the true next position is included in the top-K predictions (K=1, 3, 5).


In [22]:
# Calculate Top-K Accuracy
print("Calculating Top-K Accuracy...")

k_values = [1, 3, 5]
top_k_results = {}

for k in k_values:
    correct_k = 0
    total_k = 0
    
    for history, true_next in tqdm(test_cases, desc=f"Top-{k}"):
        top_k_preds = predict_top_k(model, history, k=k, use_patterns=True)
        if len(top_k_preds) > 0:
            total_k += 1
            if true_next in top_k_preds:
                correct_k += 1
    
    top_k_accuracy = correct_k / total_k if total_k > 0 else 0
    top_k_results[k] = {
        'correct': correct_k,
        'total': total_k,
        'accuracy': top_k_accuracy
    }

print(f"\n{'='*60}")
print(f"METRIC 3: TOP-K ACCURACY")
print(f"{'='*60}")
for k in k_values:
    result = top_k_results[k]
    print(f"Top-{k} Accuracy: {result['accuracy']:.12f}")
print(f"{'='*60}")


Calculating Top-K Accuracy...


Top-1: 100%|██████████| 49/49 [00:00<00:00, 90777.78it/s]
Top-3: 100%|██████████| 49/49 [00:00<00:00, 758.78it/s]
Top-5: 100%|██████████| 49/49 [00:00<00:00, 595.29it/s]


METRIC 3: TOP-K ACCURACY
Top-1 Accuracy: 0.653061224490
Top-3 Accuracy: 0.897959183673
Top-5 Accuracy: 0.918367346939





## Section 11 — Metric 4: Mean Prediction Distance (MPD)

**Definition**: Average Haversine distance (in meters) between actual next location and predicted next location.

MPD Distance: Mean Prediction Distance — Mean actual distance visited from predicted location of next visit.


In [23]:
# Calculate Mean Prediction Distance (MPD)
print("Calculating Mean Prediction Distance (MPD)...")

distances = []
failed_conversions = 0

for history, true_next in tqdm(test_cases, desc="Calculating distances"):
    pred = predict_next_location(model, history, use_patterns=True)
    
    if pred is not None:
        # Convert encoded IDs back to place_ids
        pred_place_id = encoded_to_placeid.get(pred)
        true_place_id = encoded_to_placeid.get(true_next)
        
        if pred_place_id and true_place_id:
            # Get coordinates
            pred_lat, pred_lon = place_id_to_coords(pred_place_id, place_coords, grid_metadata)
            true_lat, true_lon = place_id_to_coords(true_place_id, place_coords, grid_metadata)
            
            if pred_lat is not None and true_lat is not None:
                # Calculate haversine distance
                try:
                    distance_m = haversine((pred_lat, pred_lon), (true_lat, true_lon)) * 1000
                    # Filter out unrealistic distances (likely coordinate errors)
                    if distance_m < 1000000:  # Less than 1000 km
                        distances.append(distance_m)
                    else:
                        failed_conversions += 1
                except:
                    failed_conversions += 1
            else:
                failed_conversions += 1
        else:
            failed_conversions += 1
    else:
        failed_conversions += 1

if failed_conversions > 0:
    print(f"Warning: {failed_conversions} distance calculations failed or were filtered")

mpd = np.mean(distances) if len(distances) > 0 else 0
mpd_median = np.median(distances) if len(distances) > 0 else 0
mpd_std = np.std(distances) if len(distances) > 0 else 0

print(f"\n{'='*60}")
print(f"METRIC 4: MEAN PREDICTION DISTANCE (MPD)")
print(f"{'='*60}")
print(f"MPD Distance: {mpd:.12f} meters")
print(f"Valid distance calculations: {len(distances)}/{len(test_cases)}")
print(f"{'='*60}")


Calculating Mean Prediction Distance (MPD)...


Calculating distances: 100%|██████████| 49/49 [00:00<00:00, 87158.99it/s]


METRIC 4: MEAN PREDICTION DISTANCE (MPD)
MPD Distance: 4364.404451428954 meters
Valid distance calculations: 49/49





## Section 12 — Results Summary

Summary of all evaluation metrics.


In [24]:
# Compile all results
results = {
    'num_users': NUM_USERS,
    'selected_users': selected_users,
    'preprocessing': {
        'total_original_places': total_original,
        'total_after_duplicate_removal': total_processed,
        'total_duplicates_removed': total_original - total_processed,
        'sequence_length': SEQUENCE_LENGTH,
        'total_sequences': len(all_sequences),
        'training_sequences': len(train_sequences),
        'test_sequences': len(test_sequences)
    },
    'model': {
        'unique_states': n_states,
        'hidden_states': n_hidden_states
    },
    'accuracy': {
        'value': accuracy,
        'correct': correct,
        'total': total
    },
    'precision_recall': {
        'precision': float(precision_weighted),
        'recall': float(recall_weighted)
    },
    'top_k_accuracy': {
        f'top_{k}_accuracy': float(top_k_results[k]['accuracy']) for k in k_values
    },
    'mpd_distance': {
        'mpd_distance_meters': float(mpd),
        'valid_calculations': len(distances)
    }
}

# Display summary
print(f"\n{'='*60}")
print(f"EVALUATION RESULTS SUMMARY")
print(f"{'='*60}")
print(f"\nNumber of users: {NUM_USERS}")
print(f"Users: {selected_users}")
print(f"Total original places: {total_original}")
print(f"After duplicate removal: {total_processed}")
print(f"Training sequences: {len(train_sequences)}")
print(f"Test sequences: {len(test_sequences)}")

print(f"\n1. ACCURACY")
print(f"   Accuracy: {accuracy:.12f}")

print(f"\n2. PRECISION & RECALL")
print(f"   Precision: {precision_weighted:.12f}")
print(f"   Recall: {recall_weighted:.12f}")

print(f"\n3. TOP-K ACCURACY")
for k in k_values:
    acc = top_k_results[k]['accuracy']
    print(f"   Top-{k} Accuracy: {acc:.12f}")

print(f"\n4. MEAN PREDICTION DISTANCE (MPD)")
print(f"   MPD Distance: {mpd:.12f} meters")

print(f"\n{'='*60}")

# Save results
with open(RESULTS_SAVE_PATH, 'w') as f:
    json.dump(results, f, indent=2)

print(f"\nResults saved to {RESULTS_SAVE_PATH}")

# Create results DataFrame
results_df = pd.DataFrame({
    'Metric': [
        'Accuracy',
        'Precision',
        'Recall',
        'Top-1 Accuracy',
        'Top-3 Accuracy',
        'Top-5 Accuracy',
        'MPD Distance'
    ],
    'Value': [
        f"{accuracy:.12f}",
        f"{precision_weighted:.12f}",
        f"{recall_weighted:.12f}",
        f"{top_k_results[1]['accuracy']:.12f}",
        f"{top_k_results[3]['accuracy']:.12f}",
        f"{top_k_results[5]['accuracy']:.12f}",
        f"{mpd:.12f}"
    ]
})

print("\nResults Table:")
print(results_df.to_string(index=False))



EVALUATION RESULTS SUMMARY

Number of users: 10
Users: ['000', '001', '005', '006', '009', '011', '014', '016', '019', '025']
Total original places: 1752364
After duplicate removal: 4087
Training sequences: 119
Test sequences: 30

1. ACCURACY
   Accuracy: 0.653061224490

2. PRECISION & RECALL
   Precision: 0.605081426510
   Recall: 0.653061224490

3. TOP-K ACCURACY
   Top-1 Accuracy: 0.653061224490
   Top-3 Accuracy: 0.897959183673
   Top-5 Accuracy: 0.918367346939

4. MEAN PREDICTION DISTANCE (MPD)
   MPD Distance: 4364.404451428954 meters


Results saved to /home/root495/Inexture/Location Prediction Update/results/hmm_10users_results.json

Results Table:
        Metric             Value
      Accuracy    0.653061224490
     Precision    0.605081426510
        Recall    0.653061224490
Top-1 Accuracy    0.653061224490
Top-3 Accuracy    0.897959183673
Top-5 Accuracy    0.918367346939
  MPD Distance 4364.404451428954
