# BlocksWorld TTM

Use the **2nd Encoding** format to use the TTM Granite model on the BlocksWorld domain.

Key modifications from standard TTM:

- Input format includes goal state concatenated with current state
- Binary state prediction instead of continuous values
- Custom metrics for planning success
- Sequence padding to handle variable-length plans


In [1]:
import json
import math
import os
from dataclasses import dataclass, asdict
from typing import List, Optional, Dict, Any
import numpy as np
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import Dataset
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments, set_seed

from tsfm_public import TrackingCallback
from tsfm_public.toolkit.get_model import get_model

from BlocksWorld import BlocksWorldGenerator
from pprint import pformat

In [2]:
# Constants
SEED = 13
set_seed(SEED)
TTM_MODEL_PATH = "ibm-granite/granite-timeseries-ttm-r2"

# Supported CLs are 52, 90, 180, 360, 520, 1024, 1536
CONTEXT_LENGTH = 360

# Determine device
if torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
elif torch.cuda.is_available():
    DEVICE = torch.device("cuda")
else:
    DEVICE = torch.device("cpu")

print(f"Using: {pformat([SEED, TTM_MODEL_PATH, CONTEXT_LENGTH, DEVICE])}")

Using: [13, 'ibm-granite/granite-timeseries-ttm-r2', 360, device(type='mps')]


## Helper Classes


### Plan Dataclass

For storing individual planning examples


In [3]:
@dataclass
class ModelConfig:
    context_length: int = CONTEXT_LENGTH
    prediction_length: int = 96
    learning_rate: float = 1e-4
    batch_size: int = 32
    num_epochs: int = 50
    state_dim: Optional[int] = None  # Will be set during training

In [4]:
@dataclass
class BlocksWorldSample:
    initial_state: List[int]
    goal_state: List[int]
    plan: List[List[int]]
    actions: List[List[str]]
    feature_names: List[str]

### Custom BlocksWorld Dataset Class

The class handles:

- Loading JSON plan data
- Padding sequences to match context length
- Combining state and goal information
- Converting to appropriate tensor format


In [5]:
class BlocksWorldDataset(Dataset):
    def __init__(self, data_path: str, context_length: int, prediction_length: int):
        self.context_length: int = context_length
        self.prediction_length: int = prediction_length
        self.device = DEVICE

        with open(data_path, "r") as f:
            raw_data = json.load(f)["plans"]

        self.samples: List[BlocksWorldSample] = []
        for item in raw_data:
            sample = BlocksWorldSample(
                initial_state=item["initial_state"],
                goal_state=item["goal_state"],
                plan=item["plan"],
                actions=item["actions"],
                feature_names=item["feature_names"],
            )
            self.samples.append(sample)

        # Get dimensionality from first sample
        self.state_dim: int = len(self.samples[0].initial_state)

    def __len__(self):  # Length of the Dataset
        return len(self.samples)

    def __getitem__(self, idx):
        sample = self.samples[idx]

        # Pad plan sequence to match context_length + prediction_length
        plan_len = len(sample.plan)
        full_seq = sample.plan + [sample.goal_state] * (
            self.context_length + self.prediction_length - plan_len
        )

        # Split into past and future
        past_seq = full_seq[: self.context_length]
        future_seq = full_seq[self.context_length : self.context_length + self.prediction_length]

        # Convert to numpy arrays
        past_values = np.array(past_seq, dtype=np.float32)
        future_values = np.array(future_seq, dtype=np.float32)

        # Create masks (1 indicates valid values)
        past_observed_mask = np.ones((self.context_length, self.state_dim), dtype=np.float32)
        future_observed_mask = np.ones((self.prediction_length, self.state_dim), dtype=np.float32)

        # Include goal state as static categorical feature
        static_categorical_values = np.array(sample.goal_state, dtype=np.float32)

        return {
            "past_values": torch.tensor(past_values, dtype=torch.float32).to(self.device),
            "future_values": torch.tensor(future_values, dtype=torch.float32).to(self.device),
            "past_observed_mask": torch.tensor(past_observed_mask, dtype=torch.float32).to(
                self.device
            ),
            "future_observed_mask": torch.tensor(future_observed_mask, dtype=torch.float32).to(
                self.device
            ),
            "static_categorical_values": torch.tensor(
                static_categorical_values, dtype=torch.float32
            ).to(self.device),
            "freq_token": torch.zeros(1, dtype=torch.long).to(self.device),  # Placeholder for TTM
        }

**Why we need padding?**
The model expects every input sequence to be exactly `context_length` timesteps long, but our planning sequences can vary in length (some plans take 3 steps, others might take 10). So, the padding strategy is, (1) when plan is too short, pad with the goal state or (2) when plan is too long, truncate to context_length.

For example, if we have:

```python
context_length = 5
plan = [[1,0,0], [1,1,0], [0,1,1]]  # 3 steps
goal_state = [0,1,1]
```

After padding:

```python
padded_plan = [
    [1,0,0],    # Original step 1
    [1,1,0],    # Original step 2
    [0,1,1],    # Original step 3
    [0,1,1],    # Padded with goal
    [0,1,1]     # Padded with goal
]
```

Why pad with goal state instead of zeros?:

1. **Semantic Meaning**: Using the goal state maintains the logical meaning - "after reaching the goal, we stay in the goal state"
2. **Learning Signal**: It helps the model understand that reaching the goal is a stable state
3. **Consistency**: Ensures all states in the sequence are valid block configurations


### BlocksWorld-Based TTM Class

To handle training and prediction.


In [6]:
class BlocksWorldTTM:
    def __init__(
        self,
        context_length: int = CONTEXT_LENGTH,
        prediction_length: int = 96,
        learning_rate: float = 1e-4,
        batch_size: int = 32,
        num_epochs: int = 50,
    ):
        self.config = ModelConfig(
            context_length=context_length,
            prediction_length=prediction_length,
            learning_rate=learning_rate,
            batch_size=batch_size,
            num_epochs=num_epochs,
        )
        self.device = DEVICE
        self.model = None
        self.trainer = None

    def train(self, train_dataset: Dataset, val_dataset: Optional[Dataset] = None):
        """Train the model on given datasets"""
        # Store state dimension from training data
        self.config.state_dim = (
            train_dataset.dataset.state_dim
            if hasattr(train_dataset, "dataset")
            else train_dataset.state_dim
        )

        # Initialize model
        self.model = get_model(
            TTM_MODEL_PATH,
            context_length=self.config.context_length,
            prediction_length=self.config.prediction_length,
            head_dropout=0.1,
        ).to(self.device)

        # Training arguments
        training_args = TrainingArguments(
            output_dir="blocks_world_ttm",
            learning_rate=self.config.learning_rate,
            num_train_epochs=self.config.num_epochs,
            per_device_train_batch_size=self.config.batch_size,
            per_device_eval_batch_size=self.config.batch_size,
            eval_strategy="epoch" if val_dataset else "no",
            save_strategy="epoch",
            load_best_model_at_end=True if val_dataset else False,
            metric_for_best_model="eval_loss",
            greater_is_better=False,
            seed=SEED,
            report_to="none",
            dataloader_pin_memory=False,
        )

        # Callbacks
        callbacks = [
            TrackingCallback(),
            EarlyStoppingCallback(early_stopping_patience=5),
        ]

        # Optimizer and scheduler
        optimizer = AdamW(self.model.parameters(), lr=self.config.learning_rate)
        scheduler = OneCycleLR(
            optimizer,
            max_lr=self.config.learning_rate,
            epochs=self.config.num_epochs,
            steps_per_epoch=math.ceil(len(train_dataset) / self.config.batch_size),
        )

        # Initialize trainer
        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            callbacks=callbacks,
            optimizers=(optimizer, scheduler),
        )

        # Train
        self.trainer.train()

    def predict(self, initial_states: torch.Tensor, goal_states: torch.Tensor) -> torch.Tensor:
        """Generate action sequences to reach goals from given states"""
        if self.model is None:
            raise RuntimeError("Model needs to be trained or loaded before prediction")

        self.model.eval()
        with torch.no_grad():
            batch_size = initial_states.shape[0]

            # Create context sequence by repeating initial states
            context_sequence = initial_states.unsqueeze(1).repeat(1, self.config.context_length, 1)

            # Prepare inputs
            inputs = {
                "past_values": context_sequence.to(self.device),
                "past_observed_mask": torch.ones_like(context_sequence).to(self.device),
                "static_categorical_values": goal_states.to(self.device),
                "freq_token": torch.zeros(batch_size, dtype=torch.long).to(self.device),
            }

            # Generate predictions
            outputs = self.model(**inputs)
            predictions = torch.sigmoid(outputs[0])
            predictions = torch.round(predictions)

        return predictions

    def save(self, path: str):
        """Save model weights and configuration"""
        if self.model is None:
            raise RuntimeError("No model to save. Train or load a model first.")

        # Create directory if it doesn't exist
        os.makedirs(path, exist_ok=True)

        # Save model state
        model_path = os.path.join(path, "model.pt")
        torch.save(self.model.state_dict(), model_path)

        # Save configuration
        config_path = os.path.join(path, "config.json")
        with open(config_path, "w") as f:
            json.dump(asdict(self.config), f)

        print(f"Model saved to {path}")

    @classmethod
    def load(cls, path: str) -> "BlocksWorldTTM":
        """Load model weights and configuration"""
        # Load configuration
        config_path = os.path.join(path, "config.json")
        with open(config_path, "r") as f:
            config_dict = json.load(f)

        # Create instance with loaded config
        instance = cls(
            context_length=config_dict["context_length"],
            prediction_length=config_dict["prediction_length"],
            learning_rate=config_dict["learning_rate"],
            batch_size=config_dict["batch_size"],
            num_epochs=config_dict["num_epochs"],
        )
        instance.config.state_dim = config_dict["state_dim"]

        # Initialize and load model
        instance.model = get_model(
            TTM_MODEL_PATH,
            context_length=instance.config.context_length,
            prediction_length=instance.config.prediction_length,
            head_dropout=0.1,
        ).to(instance.device)

        model_path = os.path.join(path, "model.pt")
        instance.model.load_state_dict(torch.load(model_path, map_location=instance.device))
        instance.model.eval()

        print(f"Model loaded from {path}")
        return instance

The model receives these key components for each sample during **training**:

1. Past Values (past_values):

   `past_values = torch.tensor(past_seq, dtype=torch.float32)`

   - Shape: [batch_size, context_length, state_dim]
   - These are sequences of states leading up to the current point
   - Each state is a binary vector representing the blocks world predicates
   - Length is padded to context_length (`CONTEXT_LENGTH` by default)

2. Future Values (future_values):

   `future_values = torch.tensor(future_seq, dtype=torch.float32)`

   - Shape: [batch_size, prediction_length, state_dim]`
   - These are the target sequences of states we want to predict
   - Length is padded to prediction_length (96 by default)

3. Observation Masks:

   `past_observed_mask = torch.ones((context_length, state_dim))`

   `future_observed_mask = torch.ones((prediction_length, state_dim))`

   - Binary masks indicating which values are valid (1) vs padding (0)
   - Helps model ignore padded values during training

4. Static Categorical Values:

   `static_categorical_values = torch.tensor(sample.goal_state)`

   - Shape: [batch_size, state_dim]
   - The goal state we want to reach
   - Stays constant across the entire sequence
   - Helps guide the prediction towards the goal


During Prediction the model takes:

1. Initial States:

   `past_values = torch.tensor(initial_states).unsqueeze(1).repeat(1, context_length, 1)`

   - The starting state is repeated to fill the context window
   - This gives the model the current state as context

2. Goal States:

   `static_categorical_values = torch.tensor(goal_states)`

   - Target goal state as static features
   - Guides the generation of the plan

The model outputs:

```
predictions = torch.sigmoid(outputs[0])  # Convert to probabilities
predictions = torch.round(predictions)   # Convert to binary states
```

- Shape: [batch_size, prediction_length, state_dim]
- Sequence of predicted states forming a plan
- Each state is a binary vector matching the input encoding
- The sequence should transition from initial state to goal state


### Helper Methods


In [7]:
def prepare_datasets(data_path: str, context_length: int, prediction_length: int):
    """Create train/val/test datasets"""
    full_dataset = BlocksWorldDataset(data_path, context_length, prediction_length)

    # Split indices
    total_size = len(full_dataset)
    train_size = int(0.7 * total_size)
    val_size = int(0.15 * total_size)
    test_size = total_size - train_size - val_size

    train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(
        full_dataset,
        [train_size, val_size, test_size],
        generator=torch.Generator().manual_seed(SEED),
    )

    return train_dataset, val_dataset, test_dataset

In [8]:
def evaluate_predictions(predictions, targets):
    """Compute metrics for predicted plans"""
    predictions = predictions.numpy()
    targets = targets.numpy()

    # State prediction accuracy
    state_accuracy = np.mean(predictions == targets)

    # Goal achievement rate (exact match of final state)
    goal_achieved = np.all(predictions[:, -1] == targets[:, -1], axis=1)
    goal_achievement_rate = np.mean(goal_achieved)

    # Partial goal achievement (percentage of correct final state bits)
    partial_goal = np.mean(predictions[:, -1] == targets[:, -1], axis=1)
    avg_partial_goal = np.mean(partial_goal)

    return {
        "state_accuracy": state_accuracy,
        "goal_achievement_rate": goal_achievement_rate,
        "avg_partial_goal": avg_partial_goal,
    }

In [9]:
def evaluate_model(model, test_dataset, verbose=True):
    """
    Comprehensive evaluation of the model with more detailed metrics
    """
    model.model.eval()
    all_predictions = []
    all_targets = []
    goal_state_predictions = []
    goal_state_targets = []

    num_samples = len(test_dataset)
    num_exact_matches = 0
    num_partial_matches = 0
    total_bits_correct = 0
    total_bits = 0

    with torch.no_grad():
        for i in range(num_samples):
            sample = test_dataset[i]

            # Get initial and goal states
            initial_state = sample["past_values"][0]
            goal_state = sample["static_categorical_values"]
            target = sample["future_values"]

            # Create context sequence
            context_sequence = initial_state.unsqueeze(0).repeat(1, model.config.context_length, 1)

            # Prepare inputs
            inputs = {
                "past_values": context_sequence.to(model.device),
                "past_observed_mask": torch.ones_like(context_sequence).to(model.device),
                "static_categorical_values": goal_state.unsqueeze(0).to(model.device),
                "freq_token": torch.zeros(1, dtype=torch.long).to(model.device),
            }

            # Get prediction
            outputs = model.model(**inputs)
            prediction = torch.sigmoid(outputs[0])
            prediction = torch.round(prediction)

            # Store predictions and targets
            all_predictions.append(prediction)
            all_targets.append(target)

            # Focus on goal states (final states)
            pred_goal = prediction[0, -1]
            true_goal = target[-1]

            goal_state_predictions.append(pred_goal)
            goal_state_targets.append(true_goal)

            # Calculate exact matches
            if torch.all(pred_goal == true_goal):
                num_exact_matches += 1

            # Calculate partial matches (more than 50% bits correct)
            num_correct_bits = torch.sum(pred_goal == true_goal).item()
            total_bits_correct += num_correct_bits
            total_bits += len(pred_goal)

            if num_correct_bits > len(pred_goal) / 2:
                num_partial_matches += 1

    # Calculate metrics
    metrics = {
        "num_samples": num_samples,
        "num_exact_matches": num_exact_matches,
        "exact_match_rate": num_exact_matches / num_samples,
        "num_partial_matches": num_partial_matches,
        "partial_match_rate": num_partial_matches / num_samples,
        "bit_accuracy": total_bits_correct / total_bits,
    }

    if verbose:
        print("\nDetailed Model Evaluation Metrics:")
        print("-" * 50)
        print(f"Total number of test samples: {metrics['num_samples']}")
        print(f"Number of exact goal state matches: {metrics['num_exact_matches']}")
        print(f"Exact match rate: {metrics['exact_match_rate']:.4f}")
        print(f"Number of partial matches (>50% correct): {metrics['num_partial_matches']}")
        print(f"Partial match rate: {metrics['partial_match_rate']:.4f}")
        print(f"Bit-level accuracy: {metrics['bit_accuracy']:.4f}")

    return metrics


def analyze_error_patterns(model, test_dataset, verbose=True):
    """
    Enhanced error pattern analysis with more detailed statistics
    """
    model.model.eval()
    successes = []
    failures = []

    bit_error_counts = {}  # Track which bits are most commonly wrong

    with torch.no_grad():
        for i in range(len(test_dataset)):
            sample = test_dataset[i]

            # Get initial and goal states
            initial_state = sample["past_values"][0]
            goal_state = sample["static_categorical_values"]
            target = sample["future_values"][-1]

            # Create context sequence
            context_sequence = initial_state.unsqueeze(0).repeat(1, model.config.context_length, 1)

            # Prepare inputs
            inputs = {
                "past_values": context_sequence.to(model.device),
                "past_observed_mask": torch.ones_like(context_sequence).to(model.device),
                "static_categorical_values": goal_state.unsqueeze(0).to(model.device),
                "freq_token": torch.zeros(1, dtype=torch.long).to(model.device),
            }

            # Get prediction
            outputs = model.model(**inputs)
            prediction = torch.sigmoid(outputs[0])
            prediction = torch.round(prediction)
            predicted_goal = prediction[0, -1]

            # Calculate error statistics
            errors = (predicted_goal != target).nonzero().squeeze(1)
            num_errors = len(errors)

            # Track which bits had errors
            for error_idx in errors:
                if error_idx.item() not in bit_error_counts:
                    bit_error_counts[error_idx.item()] = 0
                bit_error_counts[error_idx.item()] += 1

            case = {
                "initial_state": initial_state.cpu().numpy(),
                "goal_state": goal_state.cpu().numpy(),
                "predicted_goal": predicted_goal.cpu().numpy(),
                "target_goal": target.cpu().numpy(),
                "num_errors": num_errors,
                "error_positions": errors.cpu().numpy(),
            }

            if num_errors == 0:
                successes.append(case)
            else:
                failures.append(case)

    analysis = {
        "num_successes": len(successes),
        "num_failures": len(failures),
        "success_rate": len(successes) / (len(successes) + len(failures)),
        "bit_error_counts": bit_error_counts,
        "successes": successes,
        "failures": failures,
    }

    if verbose:
        print("\nError Pattern Analysis:")
        print("-" * 50)
        print(f"Number of successful predictions: {analysis['num_successes']}")
        print(f"Number of failed predictions: {analysis['num_failures']}")
        print(f"Success rate: {analysis['success_rate']:.4f}")
        print("\nMost common error positions:")
        sorted_errors = sorted(bit_error_counts.items(), key=lambda x: x[1], reverse=True)
        for bit, count in sorted_errors[:5]:
            print(f"Bit {bit}: {count} errors")

    return analysis

## Model Training


In [10]:
def analyze_dataset(data_path: str) -> Dict[str, Any]:
    """Analyze the dataset to determine appropriate parameters"""
    with open(data_path, "r") as f:
        data = json.load(f)["plans"]

    # Get key statistics
    max_plan_length = max(len(item["plan"]) for item in data)
    avg_plan_length = sum(len(item["plan"]) for item in data) / len(data)
    state_dim = len(data[0]["initial_state"])
    num_samples = len(data)

    stats = {
        "max_plan_length": max_plan_length,
        "avg_plan_length": avg_plan_length,
        "state_dim": state_dim,
        "num_samples": num_samples,
        "recommended_prediction_length": max_plan_length + 2,  # Add small buffer
    }

    print("\nDataset Statistics:")
    print(f"Number of samples: {num_samples}")
    print(f"State dimension: {state_dim}")
    print(f"Maximum plan length: {max_plan_length}")
    print(f"Average plan length: {avg_plan_length:.2f}")
    print(f"Recommended prediction length: {stats['recommended_prediction_length']}")

    return stats

In [11]:
# Create datasets
dataset_file = "../data/dataset_4.json"
print(f"Number of blocks in the dataset: {(num_blocks := int(dataset_file.split('_')[-1][0]))}")

# Analyze dataset
stats = analyze_dataset(dataset_file)

Number of blocks in the dataset: 4

Dataset Statistics:
Number of samples: 1170
State dimension: 24
Maximum plan length: 13
Average plan length: 6.74
Recommended prediction length: 15


In [12]:
train_dataset, val_dataset, test_dataset = prepare_datasets(
    dataset_file,
    context_length=CONTEXT_LENGTH,
    prediction_length=stats["recommended_prediction_length"],
)

print(
    f"Train size: {len(train_dataset)}, Val size: {len(val_dataset)}, Test size: {len(test_dataset)}"
)

Train size: 819, Val size: 175, Test size: 176


In [13]:
# Initialize and train model
ttm = BlocksWorldTTM(
    context_length=CONTEXT_LENGTH,
    prediction_length=stats["recommended_prediction_length"],
    learning_rate=1e-4,
    batch_size=16,
    num_epochs=20,
)

print("Starting training...")
ttm.train(train_dataset, val_dataset)

INFO:p-41987:t-8430149376:get_model.py:get_model:Loading model from: ibm-granite/granite-timeseries-ttm-r2


Starting training...


INFO:p-41987:t-8430149376:get_model.py:get_model:Model loaded successfully from ibm-granite/granite-timeseries-ttm-r2, revision = 360-60-ft-l1-r2.1.
INFO:p-41987:t-8430149376:get_model.py:get_model:[TTM] context_length = 360, prediction_length = 60


Epoch,Training Loss,Validation Loss
1,No log,0.000224
2,No log,0.000232
3,No log,0.000165
4,No log,9.7e-05
5,No log,0.000101
6,No log,6.9e-05
7,No log,9.6e-05
8,No log,0.000269
9,No log,4.1e-05
10,0.000300,7.4e-05


[TrackingCallback] Mean Epoch Time = 56.357141852378845 seconds, Total Train Time = 838.2185382843018


In [14]:
# Save & Load Model
save_path = f"../models/blocks_world_ttm_{num_blocks}"

ttm.save(save_path)
print(f"Saved to {save_path}")

Model saved to ../models/blocks_world_ttm_4
Saved to ../models/blocks_world_ttm_4


In [15]:
ttm = BlocksWorldTTM.load(save_path)
print(f"Loaded from {save_path}")

INFO:p-41987:t-8430149376:get_model.py:get_model:Loading model from: ibm-granite/granite-timeseries-ttm-r2
INFO:p-41987:t-8430149376:get_model.py:get_model:Model loaded successfully from ibm-granite/granite-timeseries-ttm-r2, revision = 360-60-ft-l1-r2.1.
INFO:p-41987:t-8430149376:get_model.py:get_model:[TTM] context_length = 360, prediction_length = 60


Model loaded from ../models/blocks_world_ttm_4
Loaded from ../models/blocks_world_ttm_4


In [16]:
# Evaluate model
print("\nEvaluating model performance...")
metrics = evaluate_model(ttm, test_dataset)

# Analyze error patterns
print("\nAnalyzing error patterns...")
analysis = analyze_error_patterns(ttm, test_dataset)
successes, failures = analysis["successes"], analysis["failures"]

# Examine the inital states, goal states, and actions
gen = BlocksWorldGenerator(num_blocks=num_blocks)

print("\n--------------------------------------------------")
print(f"Length of the test dataset: {len(test_dataset)}")
print("\nExample Successes:")
for i, case in enumerate(successes[:3]):
    print(f"\nCase {i + 1}:")
    print(f"Initial State: {gen.decode_vector_to_blocks(case['initial_state'])}")
    print(f"Goal State: {gen.decode_vector_to_blocks(case['goal_state'])}")
    print(f"Predicted Goal: {gen.decode_vector_to_blocks(case['predicted_goal'])}")

print("\nExample Failures:")
for i, case in enumerate(failures[:3]):
    print(f"\nCase {i + 1}:")
    print(f"Initial State: {gen.decode_vector_to_blocks(case['initial_state'])}")
    print(f"Goal State: {gen.decode_vector_to_blocks(case['goal_state'])}")
    print(f"Predicted Goal: {gen.decode_vector_to_blocks(case['predicted_goal'])}")
    print(f"Target Goal: {gen.decode_vector_to_blocks(case['target_goal'])}")
    print(f"Number of Errors: {case['num_errors']}")


Evaluating model performance...

Detailed Model Evaluation Metrics:
--------------------------------------------------
Total number of test samples: 176
Number of exact goal state matches: 7
Exact match rate: 0.0398
Number of partial matches (>50% correct): 176
Partial match rate: 1.0000
Bit-level accuracy: 0.7434

Analyzing error patterns...

Error Pattern Analysis:
--------------------------------------------------
Number of successful predictions: 7
Number of failed predictions: 169
Success rate: 0.0398

Most common error positions:
Bit 17: 95 errors
Bit 19: 94 errors
Bit 3: 85 errors
Bit 1: 80 errors
Bit 16: 80 errors

--------------------------------------------------
Length of the test dataset: 176

Example Successes:

Case 1:
Initial State: BlockState(clear={'B'}, on_table={'A'}, on={'B': 'D', 'C': 'A', 'D': 'C'}, holding=None)
Goal State: BlockState(clear={'B'}, on_table={'A'}, on={'B': 'D', 'C': 'A', 'D': 'C'}, holding=None)
Predicted Goal: BlockState(clear={'B'}, on_table={'