# BlocksWorld TTM
Use the **2nd Encoding** format to use the TTM Granite model on the BlocksWorld domain.

Key modifications from standard TTM:
- Input format includes goal state concatenated with current state
- Binary state prediction instead of continuous values
- Custom metrics for planning success
- Sequence padding to handle variable-length plans

In [1]:
import json
import math
import os
from dataclasses import dataclass, asdict
from typing import List, Optional, Dict, Any

import numpy as np
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import Dataset
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments, set_seed
from pprint import pprint, pformat

from tsfm_public import TimeSeriesPreprocessor, TrackingCallback
from tsfm_public.toolkit.get_model import get_model

from BlocksWorld import BlocksWorldGenerator
from BlocksWorldValidator import ValidationResult, BlocksWorldValidator

In [2]:
# Constants
SEED = 13
set_seed(SEED)
TTM_MODEL_PATH = "ibm-granite/granite-timeseries-ttm-r2"

In [3]:
# Determine device
if torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
elif torch.cuda.is_available():
    DEVICE = torch.device("cuda")
else:
    DEVICE = torch.device("cpu")

print(f"Using device: {DEVICE}")

Using device: mps


## Helper Classes

### Plan Dataclass
For storing individual planning examples

In [4]:
@dataclass
class ModelConfig:
    context_length: int = 512
    prediction_length: int = 96
    learning_rate: float = 1e-4
    batch_size: int = 32
    num_epochs: int = 50
    state_dim: Optional[int] = None  # Will be set during training

In [5]:
@dataclass
class BlocksWorldSample:
    initial_state: List[int]
    goal_state: List[int]
    plan: List[List[int]]
    actions: List[List[str]]
    feature_names: List[str]

### Custom BlocksWorld Dataset Class
The class handles:
  - Loading JSON plan data
  - Padding sequences to match context length
  - Combining state and goal information
  - Converting to appropriate tensor format

In [6]:
class BlocksWorldDataset(Dataset):
    def __init__(self, data_path: str, context_length: int, prediction_length: int):
        self.context_length: int = context_length
        self.prediction_length: int = prediction_length
        self.device = DEVICE

        with open(data_path, 'r') as f:
            raw_data = json.load(f)['plans']

        self.samples: List[BlocksWorldSample] = []
        for item in raw_data:
            sample = BlocksWorldSample(
                initial_state=item['initial_state'],
                goal_state=item['goal_state'],
                plan=item['plan'],
                actions=item['actions'],
                feature_names=item['feature_names']
            )
            self.samples.append(sample)

        # Get dimensionality from first sample
        self.state_dim: int = len(self.samples[0].initial_state)

    def __len__(self):  # Length of the Dataset
        return len(self.samples)
    
    def __getitem__(self, idx):
        sample = self.samples[idx]
        
        # Pad plan sequence to match context_length + prediction_length
        plan_len = len(sample.plan)
        full_seq = sample.plan + [sample.goal_state] * (self.context_length + self.prediction_length - plan_len)
        
        # Split into past and future
        past_seq = full_seq[:self.context_length]
        future_seq = full_seq[self.context_length:self.context_length + self.prediction_length]
        
        # Convert to numpy arrays
        past_values = np.array(past_seq, dtype=np.float32)
        future_values = np.array(future_seq, dtype=np.float32)
        
        # Create masks (1 indicates valid values)
        past_observed_mask = np.ones((self.context_length, self.state_dim), dtype=np.float32)
        future_observed_mask = np.ones((self.prediction_length, self.state_dim), dtype=np.float32)
        
        # Include goal state as static categorical feature
        static_categorical_values = np.array(sample.goal_state, dtype=np.float32)

        return {
            "past_values": torch.tensor(past_values, dtype=torch.float32).to(self.device),
            "future_values": torch.tensor(future_values, dtype=torch.float32).to(self.device),
            "past_observed_mask": torch.tensor(past_observed_mask, dtype=torch.float32).to(self.device),
            "future_observed_mask": torch.tensor(future_observed_mask, dtype=torch.float32).to(self.device),
            "static_categorical_values": torch.tensor(static_categorical_values, dtype=torch.float32).to(self.device),
            "freq_token": torch.zeros(1, dtype=torch.long).to(self.device)  # Placeholder for TTM
        }

**Why we need padding?**
The model expects every input sequence to be exactly `context_length` timesteps long, but our planning sequences can vary in length (some plans take 3 steps, others might take 10). So, the padding strategy is, (1) when plan is too short, pad with the goal state or (2) when plan is too long, truncate to context_length.

For example, if we have:
```python
context_length = 5
plan = [[1,0,0], [1,1,0], [0,1,1]]  # 3 steps
goal_state = [0,1,1]
```

After padding:
```python
padded_plan = [
    [1,0,0],    # Original step 1
    [1,1,0],    # Original step 2
    [0,1,1],    # Original step 3
    [0,1,1],    # Padded with goal
    [0,1,1]     # Padded with goal
]
```

Why pad with goal state instead of zeros?:
1. **Semantic Meaning**: Using the goal state maintains the logical meaning - "after reaching the goal, we stay in the goal state"
2. **Learning Signal**: It helps the model understand that reaching the goal is a stable state
3. **Consistency**: Ensures all states in the sequence are valid block configurations

### BlocksWorld-Based TTM Class
To handle training and prediction.

In [7]:
class BlocksWorldTTM:
    def __init__(
        self,
        context_length: int = 512,
        prediction_length: int = 96,
        learning_rate: float = 1e-4,
        batch_size: int = 32,
        num_epochs: int = 50
    ):
        self.config = ModelConfig(
            context_length=context_length,
            prediction_length=prediction_length,
            learning_rate=learning_rate,
            batch_size=batch_size,
            num_epochs=num_epochs
        )
        self.device = DEVICE
        self.model = None
        self.trainer = None
    
    def train(self, train_dataset: Dataset, val_dataset: Optional[Dataset] = None):
        """Train the model on given datasets"""
        # Store state dimension from training data
        self.config.state_dim = train_dataset.dataset.state_dim if hasattr(train_dataset, 'dataset') else train_dataset.state_dim
        
        # Initialize model
        self.model = get_model(
            TTM_MODEL_PATH,
            context_length=self.config.context_length,
            prediction_length=self.config.prediction_length,
            head_dropout=0.1
        ).to(self.device)
        
        # Training arguments
        training_args = TrainingArguments(
            output_dir="blocks_world_ttm",
            learning_rate=self.config.learning_rate,
            num_train_epochs=self.config.num_epochs,
            per_device_train_batch_size=self.config.batch_size,
            per_device_eval_batch_size=self.config.batch_size,
            evaluation_strategy="epoch" if val_dataset else "no",
            save_strategy="epoch",
            load_best_model_at_end=True if val_dataset else False,
            metric_for_best_model="eval_loss",
            greater_is_better=False,
            seed=SEED,
            report_to="none"
        )
        
        # Callbacks
        callbacks = [
            TrackingCallback(),
            EarlyStoppingCallback(early_stopping_patience=5)
        ]
        
        # Optimizer and scheduler
        optimizer = AdamW(self.model.parameters(), lr=self.config.learning_rate)
        scheduler = OneCycleLR(
            optimizer,
            max_lr=self.config.learning_rate,
            epochs=self.config.num_epochs,
            steps_per_epoch=math.ceil(len(train_dataset) / self.config.batch_size)
        )
        
        # Initialize trainer
        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            callbacks=callbacks,
            optimizers=(optimizer, scheduler)
        )
        
        # Train
        self.trainer.train()

    def predict(self, initial_states: torch.Tensor, goal_states: torch.Tensor) -> torch.Tensor:
        """Generate action sequences to reach goals from given states"""
        if self.model is None:
            raise RuntimeError("Model needs to be trained or loaded before prediction")

        self.model.eval()
        with torch.no_grad():
            batch_size = initial_states.shape[0]
            
            # Create context sequence by repeating initial states
            context_sequence = initial_states.unsqueeze(1).repeat(1, self.config.context_length, 1)
            
            # Prepare inputs
            inputs = {
                "past_values": context_sequence.to(self.device),
                "past_observed_mask": torch.ones_like(context_sequence).to(self.device),
                "static_categorical_values": goal_states.to(self.device),
                "freq_token": torch.zeros(batch_size, dtype=torch.long).to(self.device)
            }
            
            # Generate predictions
            outputs = self.model(**inputs)
            predictions = torch.sigmoid(outputs[0])
            predictions = torch.round(predictions)
            
        return predictions
    
    def save(self, path: str):
        """Save model weights and configuration"""
        if self.model is None:
            raise RuntimeError("No model to save. Train or load a model first.")
            
        # Create directory if it doesn't exist
        os.makedirs(path, exist_ok=True)
        
        # Save model state
        model_path = os.path.join(path, "model.pt")
        torch.save(self.model.state_dict(), model_path)
        
        # Save configuration
        config_path = os.path.join(path, "config.json")
        with open(config_path, 'w') as f:
            json.dump(asdict(self.config), f)
            
        print(f"Model saved to {path}")

    @classmethod
    def load(cls, path: str) -> 'BlocksWorldTTM':
        """Load model weights and configuration"""
        # Load configuration
        config_path = os.path.join(path, "config.json")
        with open(config_path, 'r') as f:
            config_dict = json.load(f)
            
        # Create instance with loaded config
        instance = cls(
            context_length=config_dict['context_length'],
            prediction_length=config_dict['prediction_length'],
            learning_rate=config_dict['learning_rate'],
            batch_size=config_dict['batch_size'],
            num_epochs=config_dict['num_epochs']
        )
        instance.config.state_dim = config_dict['state_dim']
        
        # Initialize and load model
        instance.model = get_model(
            TTM_MODEL_PATH,
            context_length=instance.config.context_length,
            prediction_length=instance.config.prediction_length,
            head_dropout=0.1
        ).to(instance.device)
        
        model_path = os.path.join(path, "model.pt")
        instance.model.load_state_dict(torch.load(model_path, map_location=instance.device))
        instance.model.eval()
        
        print(f"Model loaded from {path}")
        return instance

The model receives these key components for each sample during **training**:
1. Past Values (past_values):
    
    `past_values = torch.tensor(past_seq, dtype=torch.float32)`
    - Shape: [batch_size, context_length, state_dim]
    - These are sequences of states leading up to the current point
    - Each state is a binary vector representing the blocks world predicates
    - Length is padded to context_length (512 by default)
2. Future Values (future_values):
    
    `future_values = torch.tensor(future_seq, dtype=torch.float32)`
    - Shape: [batch_size, prediction_length, state_dim]`
    - These are the target sequences of states we want to predict
    - Length is padded to prediction_length (96 by default)
3. Observation Masks:
    
    `past_observed_mask = torch.ones((context_length, state_dim))`

    `future_observed_mask = torch.ones((prediction_length, state_dim))`
    - Binary masks indicating which values are valid (1) vs padding (0)
    - Helps model ignore padded values during training
4. Static Categorical Values:
    
    `static_categorical_values = torch.tensor(sample.goal_state)`
    - Shape: [batch_size, state_dim]
    - The goal state we want to reach
    - Stays constant across the entire sequence
    - Helps guide the prediction towards the goal


During Prediction the model takes:
1. Initial States:
    
    `past_values = torch.tensor(initial_states).unsqueeze(1).repeat(1, context_length, 1)`
    - The starting state is repeated to fill the context window
    - This gives the model the current state as context
2. Goal States:

    `static_categorical_values = torch.tensor(goal_states)`
    - Target goal state as static features
    - Guides the generation of the plan

The model outputs:

```
predictions = torch.sigmoid(outputs[0])  # Convert to probabilities
predictions = torch.round(predictions)   # Convert to binary states
```

- Shape: [batch_size, prediction_length, state_dim]
- Sequence of predicted states forming a plan
- Each state is a binary vector matching the input encoding
- The sequence should transition from initial state to goal state

### Helper Methods

In [8]:
def prepare_datasets(data_path: str, context_length: int, prediction_length: int):
    """Create train/val/test datasets"""
    full_dataset = BlocksWorldDataset(data_path, context_length, prediction_length)
    
    # Split indices
    total_size = len(full_dataset)
    train_size = int(0.7 * total_size)
    val_size = int(0.15 * total_size)
    test_size = total_size - train_size - val_size
    
    train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(
        full_dataset, 
        [train_size, val_size, test_size],
        generator=torch.Generator().manual_seed(SEED)
    )
    
    return train_dataset, val_dataset, test_dataset

In [9]:
def evaluate_predictions(predictions, targets):
    """Compute metrics for predicted plans"""
    predictions = predictions.numpy()
    targets = targets.numpy()
    
    # State prediction accuracy
    state_accuracy = np.mean(predictions == targets)
    
    # Goal achievement rate (exact match of final state)
    goal_achieved = np.all(predictions[:, -1] == targets[:, -1], axis=1)
    goal_achievement_rate = np.mean(goal_achieved)
    
    # Partial goal achievement (percentage of correct final state bits)
    partial_goal = np.mean(predictions[:, -1] == targets[:, -1], axis=1)
    avg_partial_goal = np.mean(partial_goal)
    
    return {
        "state_accuracy": state_accuracy,
        "goal_achievement_rate": goal_achievement_rate,
        "avg_partial_goal": avg_partial_goal
    }

In [10]:
def evaluate_model(model, test_dataset, verbose=True):
    """
    Comprehensive evaluation of the model focusing on goal state prediction
    """
    model.model.eval()
    all_predictions = []
    all_targets = []
    goal_state_predictions = []
    goal_state_targets = []
    
    with torch.no_grad():
        for i in range(len(test_dataset)):
            sample = test_dataset[i]
            
            # Get initial and goal states
            initial_state = sample['past_values'][0]  # First timestep
            goal_state = sample['static_categorical_values']
            target = sample['future_values']
            
            # Create context sequence by repeating initial state
            context_sequence = initial_state.unsqueeze(0).repeat(1, model.config.context_length, 1)
            
            # Prepare inputs
            inputs = {
                "past_values": context_sequence.to(model.device),
                "past_observed_mask": torch.ones_like(context_sequence).to(model.device),
                "static_categorical_values": goal_state.unsqueeze(0).to(model.device),
                "freq_token": torch.zeros(1, dtype=torch.long).to(model.device)
            }
            
            # Get model prediction
            outputs = model.model(**inputs)
            prediction = torch.sigmoid(outputs[0])
            prediction = torch.round(prediction)
            
            # Only keep the relevant part of the prediction 
            # (same length as target and same feature dimension)
            prediction = prediction[:, :target.shape[0], :target.shape[-1]]
            
            all_predictions.append(prediction)
            all_targets.append(target)
            
            # Store just the final states (goal states)
            goal_state_predictions.append(prediction[:, -1])
            goal_state_targets.append(target[-1])
    
    # Convert lists to tensors
    all_predictions = torch.cat(all_predictions, dim=0)
    all_targets = torch.stack(all_targets, dim=0)
    goal_state_predictions = torch.cat(goal_state_predictions, dim=0)
    goal_state_targets = torch.stack(goal_state_targets, dim=0)
    
    # Calculate metrics
    metrics = {
        # Overall sequence metrics
        "sequence_accuracy": torch.mean((all_predictions == all_targets).float()).item(),
        "sequence_hamming_distance": torch.mean(torch.sum((all_predictions != all_targets).float(), dim=(1,2))).item(),
        
        # Goal state specific metrics
        "goal_state_accuracy": torch.mean((goal_state_predictions == goal_state_targets).float()).item(),
        "goal_state_hamming_distance": torch.mean(torch.sum((goal_state_predictions != goal_state_targets).float(), dim=-1)).item(),
        
        # Partial goal achievement (percentage of correct bits in goal state)
        "partial_goal_achievement": torch.mean(torch.mean((goal_state_predictions == goal_state_targets).float(), dim=-1)).item(),
        
        # Perfect goal achievement rate (exact matches)
        "perfect_goal_achievement_rate": torch.mean(torch.all(goal_state_predictions == goal_state_targets, dim=-1).float()).item(),
    }
    
    if verbose:
        print("\nModel Evaluation Metrics:")
        print("-" * 50)
        for name, value in metrics.items():
            print(f"{name}: {value:.4f}")
            
        # Print shapes for debugging
        print("\nTensor shapes:")
        print(f"Predictions shape: {all_predictions.shape}")
        print(f"Targets shape: {all_targets.shape}")
        print(f"Goal state predictions shape: {goal_state_predictions.shape}")
        print(f"Goal state targets shape: {goal_state_targets.shape}")
        
    return metrics

In [11]:
def analyze_error_patterns(model, test_dataset, n_samples=5):
    """
    Analyze specific cases where the model fails or succeeds
    """
    model.model.eval()
    successes = []
    failures = []
    
    with torch.no_grad():
        for i in range(len(test_dataset)):
            sample = test_dataset[i]
            
            # Get initial and goal states
            initial_state = sample['past_values'][0]
            goal_state = sample['static_categorical_values']
            target = sample['future_values'][-1]
            
            # Create context sequence
            context_sequence = initial_state.unsqueeze(0).repeat(1, model.config.context_length, 1)
            
            # Prepare inputs
            inputs = {
                "past_values": context_sequence.to(model.device),
                "past_observed_mask": torch.ones_like(context_sequence).to(model.device),
                "static_categorical_values": goal_state.unsqueeze(0).to(model.device),
                "freq_token": torch.zeros(1, dtype=torch.long).to(model.device)
            }
            
            # Get prediction
            outputs = model.model(**inputs)
            prediction = torch.sigmoid(outputs[0])
            prediction = torch.round(prediction)
            predicted_goal = prediction[0, -1]
            
            # Check if prediction matches target
            is_correct = torch.all(predicted_goal == target)
            
            case = {
                'initial_state': initial_state.cpu().numpy(),
                'goal_state': goal_state.cpu().numpy(),
                'predicted_goal': predicted_goal.cpu().numpy(),
                'target_goal': target.cpu().numpy(),
                'hamming_distance': torch.sum((predicted_goal != target).float()).item()
            }
            
            if is_correct:
                successes.append(case)
            else:
                failures.append(case)
            
            if len(successes) >= n_samples and len(failures) >= n_samples:
                break
    
    return successes[:n_samples], failures[:n_samples]

## Model Training

In [12]:
# Create datasets
dataset_file = "../data/dataset_3.json"
print(f"Number of blocks in the dataset: {(num_blocks := int(dataset_file.split("_")[-1][0]))}")

train_dataset, val_dataset, test_dataset = prepare_datasets(
    dataset_file,
    context_length=512,
    prediction_length=96
)

print(f"Train size: {len(train_dataset)}, Val size: {len(val_dataset)}, Test size: {len(test_dataset)}")

Number of blocks in the dataset: 3
Train size: 35, Val size: 7, Test size: 8


In [13]:
# # Initialize and train model
# ttm = BlocksWorldTTM(
#     context_length=512,
#     prediction_length=96,
#     learning_rate=1e-4,
#     batch_size=32,
#     num_epochs=50
# )

# print("Starting training...")
# ttm.train(train_dataset, val_dataset)

In [14]:
# Save & Load Model
save_path = f"../models/blocks_world_ttm_{num_blocks}"

# ttm.save(save_path)
# print(f"Saved to {save_path}")

In [15]:
ttm = BlocksWorldTTM.load(save_path)
print(f"Loaded from {save_path}")

INFO:p-56849:t-8250786368:get_model.py:get_model:Loading model from: ibm-granite/granite-timeseries-ttm-r2
INFO:p-56849:t-8250786368:get_model.py:get_model:Selected prediction_length = 96


INFO:p-56849:t-8250786368:get_model.py:get_model:Model loaded successfully!
INFO:p-56849:t-8250786368:get_model.py:get_model:[TTM] context_len = 512, forecast_len = 96


Model loaded from ../models/blocks_world_ttm_3
Loaded from ../models/blocks_world_ttm_3


  instance.model.load_state_dict(torch.load(model_path, map_location=instance.device))


In [16]:
# Evaluate model
print("\nEvaluating model performance...")
metrics = evaluate_model(ttm, test_dataset)

# Analyze error patterns
print("\nAnalyzing error patterns...")
successes, failures = analyze_error_patterns(ttm, test_dataset)

# Examine the inital states, goal states, and actions
gen = BlocksWorldGenerator(num_blocks=num_blocks)

print("\n--------------------------------------------------")
print(f"Length of the test dataset: {len(test_dataset)}")
print("\nExample Successes:")
for i, case in enumerate(successes[:3]):
    print(f"\nCase {i+1}:")
    print(f"Initial State: {gen.decode_vector_to_blocks(case['initial_state'])}")
    print(f"Goal State: {gen.decode_vector_to_blocks(case['goal_state'])}")
    print(f"Predicted Goal: {gen.decode_vector_to_blocks(case['predicted_goal'])}")
    
print("\nExample Failures:")
for i, case in enumerate(failures[:3]):
    print(f"\nCase {i+1}:")
    print(f"Initial State: {gen.decode_vector_to_blocks(case['initial_state'])}")
    print(f"Goal State: {gen.decode_vector_to_blocks(case['goal_state'])}")
    print(f"Predicted Goal: {gen.decode_vector_to_blocks(case['predicted_goal'])}")
    print(f"Target Goal: {gen.decode_vector_to_blocks(case['target_goal'])}")
    print(f"Hamming Distance: {case['hamming_distance']}")


Evaluating model performance...

Model Evaluation Metrics:
--------------------------------------------------
sequence_accuracy: 0.7413
sequence_hamming_distance: 372.5000
goal_state_accuracy: 0.7500
goal_state_hamming_distance: 3.7500
partial_goal_achievement: 0.7500
perfect_goal_achievement_rate: 0.1250

Tensor shapes:
Predictions shape: torch.Size([8, 96, 15])
Targets shape: torch.Size([8, 96, 15])
Goal state predictions shape: torch.Size([8, 15])
Goal state targets shape: torch.Size([8, 15])

Analyzing error patterns...

--------------------------------------------------
Length of the test dataset: 8

Example Successes:

Case 1:
Initial State: BlockState(clear={'B', 'A'}, on_table={'C', 'A'}, on={'B': 'C'}, holding=None)
Goal State: BlockState(clear={'B', 'A'}, on_table={'C', 'A'}, on={'B': 'C'}, holding=None)
Predicted Goal: BlockState(clear={'B', 'A'}, on_table={'C', 'A'}, on={'B': 'C'}, holding=None)

Example Failures:

Case 1:
Initial State: BlockState(clear={'C', 'B'}, on_tab

### Explanation of BlocksWorld Metrics

1. **Sequence Accuracy** (`sequence_accuracy`):

```python
"sequence_accuracy": torch.mean((all_predictions == all_targets).float()).item()
```
- Measures how accurately the model predicts the entire sequence of states
- Compares each predicted state with the target state at each timestep
- Value ranges from 0 to 1, where 1 means perfect prediction of all states in the sequence

2. **Sequence Hamming Distance** (`sequence_hamming_distance`):

```python
"sequence_hamming_distance": torch.mean(torch.sum((all_predictions != all_targets).float(), dim=-1)).item()
```
- Measures the average number of bits that differ between predicted and target sequences
- Higher values indicate more differences between predicted and actual states
- Useful for understanding how "far off" predictions are from targets

3. **Goal State Accuracy** (`goal_state_accuracy`):

```python
"goal_state_accuracy": torch.mean((goal_state_predictions == goal_state_targets).float()).item()
```
- Measures accuracy of just the final state prediction (ignoring intermediate states)
- Focuses on whether the model reaches the correct goal state
- Value between 0 and 1, where 1 means perfect goal state prediction

4. **Goal State Hamming Distance** (`goal_state_hamming_distance`):

```python
"goal_state_hamming_distance": torch.mean(torch.sum((goal_state_predictions != goal_state_targets).float(), dim=-1)).item()
```
- Number of bits that differ between predicted and target goal states
- Lower values are better
- Helps understand how close the predicted goal state is to the target

5. **Partial Goal Achievement** (`partial_goal_achievement`):

```python
"partial_goal_achievement": torch.mean(torch.mean((goal_state_predictions == goal_state_targets).float(), dim=-1)).item()
```
- Percentage of correctly predicted bits in the goal state
- Useful for understanding partial success
- Example: If 8 out of 10 bits are correct, value would be 0.8

6. **Perfect Goal Achievement Rate** (`perfect_goal_achievement_rate`):

```python
"perfect_goal_achievement_rate": torch.mean(torch.all(goal_state_predictions == goal_state_targets, dim=-1).float()).item()
```
- Proportion of cases where the goal state is perfectly predicted
- Stricter than partial goal achievement
- Only counts exact matches as successes

These metrics help understand:
- Overall sequence prediction quality (sequence_accuracy, sequence_hamming_distance)
- Goal achievement quality (goal_state_accuracy, goal_state_hamming_distance)
- Partial vs Perfect success rates (partial_goal_achievement, perfect_goal_achievement_rate)

For the blocks world domain:
- A high sequence accuracy means the model predicts valid intermediate states
- A low goal state hamming distance means predictions are close to desired configurations
- Perfect goal achievement rate shows how often the model reaches exactly the right block configuration