# Lab 1: Chess Move Evaluation - Knowledge Distillation Training

## Introduction

In this lab, you will train a smaller "student" model to evaluate chess moves using knowledge distillation. This builds on Lab 0, where you generated teacher model logits from the Qwen3-30B-A3B model.

**Task**: Train a student model to classify which chess move is better (MoveA or MoveB)

**Why Knowledge Distillation for Chess?**
- **Cost Reduction**: 50x smaller model (30B → 0.6B parameters)
- **Faster Inference**: ~20-50x faster move evaluation
- **Deployment Flexibility**: Can run on smaller instances or edge devices
- **Maintained Performance**: Retains much of the teacher's chess understanding

**Training Approach:**

The `KnowledgeDistillationTrainer` combines two loss functions:
1. **Hard Loss**: Cross-entropy with true labels (MoveA or MoveB)
2. **Soft Loss**: KL divergence between teacher and student logits

Combined loss: `total_loss = α × soft_loss + (1 - α) × hard_loss`

Where α=0.7 means 70% weight on learning from teacher, 30% on correct answers.

**Models:**
- **Teacher**: Qwen3-30B-A3B (30 billion parameters)
- **Student**: Qwen3-0.6B (600 million parameters)

**Prerequisites:**
- Completed Lab 0 with chess logits saved to `data/chess_output.json`
- AWS Trainium instance (trn1.32xlarge recommended)
- AWS Neuron SDK installed
- Virtual environment: `/opt/aws_neuronx_venv_pytorch_2_8_nxd_inference`

## Download Student Model

Download the Qwen3-0.6B model weights from HuggingFace.

In [None]:
%pip install -q neuronx-distributed datasets optimum-neuron[training]==0.4.1

In [None]:
!hf download Qwen/Qwen3-0.6B

## Environment Setup

Configure environment variables for optimal Neuron performance.

In [None]:
import os

# Neuron compiler and runtime settings
os.environ['NEURON_CC_FLAGS'] = "--model-type transformer --retry_failed_compilation"
os.environ['NEURON_FUSE_SOFTMAX'] = "1"
os.environ['NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS'] = "3"
os.environ['MALLOC_ARENA_MAX'] = "64"
os.environ['WORLD_SIZE'] = "8"
os.environ['WANDB_DISABLED'] = "true"  # Disable wandb logging

## Training Configuration

Define hyperparameters for the distillation training.

In [None]:
# Training parameters
PROCESSES_PER_NODE = 4  # Distributed training processes
NUM_EPOCHS = 3  # Number of training epochs
TP_DEGREE = 2  # Tensor parallelism degree
BS = 1  # Batch size per device
GRADIENT_ACCUMULATION_STEPS = 16  # Effective batch size = 16
LOGGING_STEPS = 1  # Log every step
MODEL_NAME = "Qwen/Qwen3-0.6B"
OUTPUT_DIR = "Qwen3-0.6B-chess-finetuned"
DATASET_PATH = "data/chess_output.json"

# Distillation hyperparameters
TEMPERATURE = 4.0  # Softness of probability distributions
ALPHA = 0.7  # Weight for soft loss (0.7 = 70% teacher, 30% labels)

# Set max steps (use -1 for full training)
MAX_STEPS = -1  # Train for full epochs

print(f"Model: {MODEL_NAME}")
print(f"Dataset: {DATASET_PATH}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Temperature: {TEMPERATURE}, Alpha: {ALPHA}")

## Verify Chess Dataset

Check that the chess logits data from Lab 0 is available.

In [None]:
import json
from pathlib import Path

if not Path(DATASET_PATH).exists():
    print(f"ERROR: {DATASET_PATH} not found!")
    print("Please run Lab0_generate_teacher_logits_chess.ipynb first.")
else:
    with open(DATASET_PATH, 'r') as f:
        chess_data = json.load(f)
    
    valid_samples = [s for s in chess_data if 'error' not in s]
    print(f"✓ Found {len(valid_samples)} valid chess samples")
    print(f"✓ Average logit positions: {sum(len(s['response']['token_logits']) for s in valid_samples) / len(valid_samples):.1f}")
    
    # Show example
    sample = valid_samples[0]
    print(f"\nExample:")
    print(f"  Input: {sample['input'][:100]}...")
    print(f"  Expected: {sample['expected_output']}")
    print(f"  Generated: {sample['response']['generated_text']}")

## Run Training

Execute the distributed training using `torchrun`.

**Note**: First run will compile the model (~20-30 minutes). Subsequent runs use cached compilation.

**Training Process:**
1. **Compilation** (first run only): Neuron compiler optimizes model for Trainium
2. **Training**: Student learns from teacher logits
3. **Checkpointing**: Model saved to OUTPUT_DIR

**Expected Time:**
- Compilation: ~20-30 minutes (one-time)
- Training (100 samples, 3 epochs): ~10-15 minutes

In [None]:
# Build the training command
training_cmd = f"""
torchrun  \\
    --nproc_per_node {PROCESSES_PER_NODE} \\
    src/distill_chess_neuron_torchrun.py \\
    --model_id {MODEL_NAME} \\
    --dataset_path {DATASET_PATH} \\
    --output_model_path ./final_chess_model \\
    --temperature {TEMPERATURE} \\
    --alpha {ALPHA} \\
    --num_train_epochs {NUM_EPOCHS} \\
    --do_train \\
    --max_steps {MAX_STEPS} \\
    --per_device_train_batch_size {BS} \\
    --gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \\
    --learning_rate 1e-4 \\
    --bf16 \\
    --zero_1 False \\
    --tensor_parallel_size {TP_DEGREE} \\
    --warmup_steps 5 \\
    --pipeline_parallel_size 1 \\
    --logging_steps {LOGGING_STEPS} \\
    --output_dir {OUTPUT_DIR} \\
    --overwrite_output_dir
"""

print("Starting training...")
print("This will take ~30-45 minutes on first run (includes compilation)")
print("\nCommand:")
print(training_cmd)

# Run training
!{training_cmd}

## Consolidate the shards

The distilled model is saved as part of the script as a sharded checkpoint, where each model parallel worker is resposible for saving its shard of the model weights. In order to use the model for inference, we need to consolidate the model shards

In [None]:
!optimum-cli neuron consolidate ./final_chess_model ./final_chess_model

## Training Results

Check the training output and saved model.

In [None]:
# Check if model was saved
final_model_path = "./final_chess_model"

if Path(final_model_path).exists():
    print(f"✓ Model saved to {final_model_path}")
    print(f"\nModel files:")
    !ls -lh {final_model_path}
else:
    print(f"✗ Model not found at {final_model_path}")
    print("Training may have failed. Check the output above for errors.")

## Summary

You have successfully:
- ✓ Loaded chess move evaluation dataset with teacher logits
- ✓ Configured knowledge distillation training
- ✓ Trained a 0.6B student model from a 30B teacher
- ✓ Saved the trained model for inference

**Next Steps:**
- Proceed to Lab 2 to test the trained model
- Compare student vs teacher predictions
- Measure inference speed improvements

**Model Compression:**
- Teacher: 30B parameters
- Student: 0.6B parameters
- **Reduction**: 50x smaller!

