# TTML Classification Examples

This notebook demonstrates using the TTML model for various classification tasks. We'll cover:

1. Adult Income Classification
   - Binary classification predicting income >50K
   - Handling mixed categorical and numerical features
   - Feature importance analysis

2. Titanic Survival Classification
   - Advanced preprocessing techniques
   - Model interpretation
   - Performance visualization

In [1]:
import sys
import os
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Import TTML modules
from tabular_transformer.models import TabularTransformer
from tabular_transformer.models.task_heads import ClassificationHead
from tabular_transformer.training import Trainer
from tabular_transformer.inference import predict
from tabular_transformer.explainability import global_explanations, local_explanations
from tabular_transformer.utils.config import TransformerConfig
from tabular_transformer.data.dataset import TabularDataset

# Import data utilities
from data_utils import download_adult_dataset, download_titanic_dataset

## Part 1: Adult Income Classification

First, we'll work with the Adult Income dataset to predict whether an individual's income exceeds $50K/year.

In [2]:
# Download Adult dataset
adult_df = download_adult_dataset(save_csv=False)
print("Adult dataset shape:", adult_df.shape)
print("\nFeature types:")
print(adult_df.dtypes)
print("\nClass distribution:")
print(adult_df['class'].value_counts(normalize=True))

Adult dataset shape: (32561, 15)

Feature types:
age               int64
workclass        object
fnlwgt            int64
education        object
education-num     int64
marital-status   object
occupation       object
relationship     object
race             object
sex              object
capital-gain      int64
capital-loss      int64
hours-per-week    int64
native-country   object
class            object
dtype: object

Class distribution:
class
<=50K    0.759281
>50K     0.240719
Name: proportion, dtype: float64


In [3]:
# Identify numeric and categorical columns
numeric_features = adult_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = adult_df.select_dtypes(include=['object']).columns.tolist()

# Remove target column from features
target_column = 'class'
if target_column in numeric_features:
    numeric_features.remove(target_column)
if target_column in categorical_features:
    categorical_features.remove(target_column)

# Create train/test datasets
# Preprocess the class column to convert string labels to numeric
print("Original class values:", adult_df['class'].unique())
adult_df['class'] = adult_df['class'].map({'>50K': 1, '<=50K': 0})
print("Converted class values:", adult_df['class'].unique())

# Preprocess the class column to convert string labels to numeric
print("Original class values:", adult_df['class'].unique())
adult_df['class'] = adult_df['class'].map({'>50K': 1, '<=50K': 0})
print("Converted class values:", adult_df['class'].unique())

train_dataset_adult, test_dataset_adult, _ = TabularDataset.from_dataframe(
    dataframe=adult_df,
    numeric_columns=numeric_features,
    categorical_columns=categorical_features,
    target_columns={'main': [target_column]},
    validation_split=0.2,
    random_state=42
)

2025-03-17 13:17:54,694 - tabular_transformer.FeaturePreprocessor - INFO - Fitted numeric scaler to 6 columns
2025-03-17 13:17:54,701 - tabular_transformer.FeaturePreprocessor - INFO - Column workclass: 9 categories, embedding dim 4
2025-03-17 13:17:54,703 - tabular_transformer.FeaturePreprocessor - INFO - Column education: 16 categories, embedding dim 8
2025-03-17 13:17:54,705 - tabular_transformer.FeaturePreprocessor - INFO - Column marital-status: 7 categories, embedding dim 4
2025-03-17 13:17:54,707 - tabular_transformer.FeaturePreprocessor - INFO - Column occupation: 15 categories, embedding dim 7
2025-03-17 13:17:54,709 - tabular_transformer.FeaturePreprocessor - INFO - Column relationship: 6 categories, embedding dim 3
2025-03-17 13:17:54,711 - tabular_transformer.FeaturePreprocessor - INFO - Column race: 5 categories, embedding dim 3
2025-03-17 13:17:54,713 - tabular_transformer.FeaturePreprocessor - INFO - Column sex: 2 categories, embedding dim 2
2025-03-17 13:17:54,715 - tab

In [4]:
# Get feature dimensions from preprocessor
feature_dims = train_dataset_adult.preprocessor.get_feature_dimensions()
numeric_dim = feature_dims['numeric_dim']
categorical_dims = feature_dims['categorical_dims']
categorical_embedding_dims = feature_dims['categorical_embedding_dims']

# Model configuration
config = TransformerConfig(
    embed_dim=128,
    num_heads=8,
    num_layers=4,
    dropout=0.2,
    variational=False
)

# Initialize transformer encoder
encoder_adult = TabularTransformer(
    numeric_dim=numeric_dim,
    categorical_dims=categorical_dims,
    categorical_embedding_dims=categorical_embedding_dims,
    config=config
)

# Initialize classification head
task_head_adult = ClassificationHead(
    name="main",  # Task name should match the key in target_columns
    input_dim=128,  # Should match config.embed_dim
    num_classes=2  # Binary classification for income
)

In [5]:
# Create data loaders
train_loader_adult = train_dataset_adult.create_dataloader(batch_size=64, shuffle=True)
test_loader_adult = test_dataset_adult.create_dataloader(batch_size=64, shuffle=False)

# Initialize trainer
trainer_adult = Trainer(
    encoder=encoder_adult,
    task_head={'main': task_head_adult},  # Map task head to task name
    optimizer=None,  # Will be created by trainer
    device=None  # Will use CUDA if available
)

# Train the model
history_adult = trainer_adult.train(
    train_loader=train_loader_adult,
    val_loader=test_loader_adult,
    num_epochs=20,
    early_stopping_patience=3
)

2025-03-17 13:18:27,694 - tabular_transformer.Trainer - INFO - Starting training for 20 epochs
2025-03-17 13:18:27,696 - tabular_transformer.Trainer - INFO - Using device: cpu
2025-03-17 13:18:27,697 - tabular_transformer.Trainer - INFO - Epoch 1/20
2025-03-17 13:19:14,654 - tabular_transformer.Trainer - INFO - Training loss: 0.6821, Validation loss: 0.6801
2025-03-17 13:19:14,656 - tabular_transformer.Trainer - INFO - Epoch 2/20
2025-03-17 13:20:01,544 - tabular_transformer.Trainer - INFO - Training loss: 0.6611, Validation loss: 0.6593
2025-03-17 13:20:01,546 - tabular_transformer.Trainer - INFO - Epoch 3/20
2025-03-17 13:20:48,345 - tabular_transformer.Trainer - INFO - Training loss: 0.6401, Validation loss: 0.6384
2025-03-17 13:20:48,347 - tabular_transformer.Trainer - INFO - Epoch 4/20
2025-03-17 13:21:35,246 - tabular_transformer.Trainer - INFO - Training loss: 0.6192, Validation loss: 0.6176
2025-03-17 13:21:35,248 - tabular_transformer.Trainer - INFO - Epoch 5/20
2025-03-17 13:

In [6]:
# Make predictions
predictions_adult = trainer_adult.predict(test_loader_adult)

# Get predictions for the main task
y_pred_adult = predictions_adult['main']['predicted_class'].numpy()
y_test_adult = test_dataset_adult.targets['main']

# Print metrics
print("Adult Income Classification Results:")
print(f"Accuracy: {accuracy_score(y_test_adult, y_pred_adult):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_adult, y_pred_adult))

Adult Income Classification Results:
Accuracy: 0.8489

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.91      0.90      4945
           1       0.71      0.65      0.68      1568

    accuracy                           0.85      6513
   macro avg       0.80      0.78      0.79      6513
weighted avg       0.85      0.85      0.85      6513



## Feature Importance Analysis

Let's analyze which features are most important for income prediction.

In [7]:
# Calculate feature importance
feature_importance = global_explanations.calculate_feature_importance(
    encoder=encoder_adult,
    task_head=task_head_adult,
    dataset=test_dataset_adult,
    feature_names=numeric_features + categorical_features
)

## Part 2: Titanic Survival Classification

Now we'll demonstrate some advanced techniques with the Titanic dataset.

In [8]:
# Download Titanic dataset
titanic_df = download_titanic_dataset(save_csv=False)
print("Titanic dataset shape:", titanic_df.shape)
print("\nFeature types:")
print(titanic_df.dtypes)
print("\nSurvival distribution:")
print(titanic_df['survived'].value_counts(normalize=True))

In [9]:
# Identify numeric and categorical columns
numeric_features = titanic_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = titanic_df.select_dtypes(include=['object']).columns.tolist()

# Remove target column from features
target_column = 'survived'
if target_column in numeric_features:
    numeric_features.remove(target_column)
if target_column in categorical_features:
    categorical_features.remove(target_column)

# Create train/test datasets
train_dataset_titanic, test_dataset_titanic, _ = TabularDataset.from_dataframe(
    dataframe=titanic_df,
    numeric_columns=numeric_features,
    categorical_columns=categorical_features,
    target_columns={'main': [target_column]},
    validation_split=0.2,
    random_state=42
)

In [10]:
# Get feature dimensions from preprocessor
feature_dims = train_dataset_titanic.preprocessor.get_feature_dimensions()
numeric_dim = feature_dims['numeric_dim']
categorical_dims = feature_dims['categorical_dims']
categorical_embedding_dims = feature_dims['categorical_embedding_dims']

# Model configuration
config = TransformerConfig(
    embed_dim=64,
    num_heads=4,
    num_layers=2,
    dropout=0.1,
    variational=False
)

# Initialize transformer encoder
encoder_titanic = TabularTransformer(
    numeric_dim=numeric_dim,
    categorical_dims=categorical_dims,
    categorical_embedding_dims=categorical_embedding_dims,
    config=config
)

# Initialize classification head
task_head_titanic = ClassificationHead(
    name="main",  # Task name should match the key in target_columns
    input_dim=64,  # Should match config.embed_dim
    num_classes=2  # Binary classification for survival
)

In [11]:
# Create data loaders
train_loader_titanic = train_dataset_titanic.create_dataloader(batch_size=32, shuffle=True)
test_loader_titanic = test_dataset_titanic.create_dataloader(batch_size=32, shuffle=False)

# Initialize trainer
trainer_titanic = Trainer(
    encoder=encoder_titanic,
    task_head={'main': task_head_titanic},  # Map task head to task name
    optimizer=None,  # Will be created by trainer
    device=None  # Will use CUDA if available
)

# Train the model
history_titanic = trainer_titanic.train(
    train_loader=train_loader_titanic,
    val_loader=test_loader_titanic,
    num_epochs=15,
    early_stopping_patience=3
)

In [12]:
# Make predictions
predictions_titanic = trainer_titanic.predict(test_loader_titanic)

# Get predictions for the main task
y_pred_titanic = predictions_titanic['main']['predicted_class'].numpy()
y_test_titanic = test_dataset_titanic.targets['main']

print("Titanic Survival Classification Results:")
print(f"Accuracy: {accuracy_score(y_test_titanic, y_pred_titanic):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_titanic, y_pred_titanic))

## Local Explanations

Let's examine individual predictions to understand the model's decision-making process.

In [13]:
# Get local explanations for a few examples
sample_indices = np.random.choice(len(test_dataset_titanic), 3, replace=False)
for idx in sample_indices:
    explanation = local_explanations.explain_prediction(
        encoder=encoder_titanic,
        task_head=task_head_titanic,
        instance_idx=idx,
        dataset=test_dataset_titanic,
        feature_names=numeric_features + categorical_features
    )
    
    print(f"\nExample {idx+1}:")
    print(f"True class: {y_test_titanic[idx]}")
    print(f"Predicted class: {y_pred_titanic[idx]}")
    print("\nFeature contributions:")
    for feature, contribution in explanation.items():
        print(f"{feature}: {contribution:.4f}")

## Conclusion

This notebook demonstrated advanced classification techniques using the TTML model on two different datasets:

1. Adult Income Classification
   - Achieved good performance on income prediction
   - Identified key features through global importance analysis

2. Titanic Survival Classification
   - Demonstrated strong predictive performance
   - Provided local explanations for individual predictions

The TTML model showed its versatility in handling different types of classification tasks and its ability to provide interpretable results through various explanation techniques.