# SageMaker Sequence-to-Sequence Exercise

This notebook walks you through training Amazon SageMaker's **Sequence-to-Sequence (Seq2Seq)** algorithm for machine translation.

## What You'll Learn
1. How to prepare data in Seq2Seq's required RecordIO-Protobuf format
2. How to configure and train a Seq2Seq model on SageMaker
3. How to deploy and query the model for translations
4. How to evaluate translation quality

## What is Sequence-to-Sequence?
Seq2Seq is a supervised learning algorithm that transforms an input sequence of tokens into an output sequence of tokens. It uses an **encoder-decoder architecture** with attention mechanisms.

**Common Use Cases:**
- **Machine Translation**: Convert text from one language to another
- **Text Summarization**: Condense longer text into shorter summaries
- **Speech-to-Text**: Convert audio sequences to text

**Architecture Options:**
- **RNN-based**: Uses LSTM or GRU cells with attention
- **CNN-based**: Uses convolutional layers (faster training)

## Prerequisites
- SageMaker notebook instance or Studio, or local environment with AWS credentials
- IAM role with S3 and SageMaker permissions
- **GPU instance required** (P2, P3, G4dn, or G5 family)

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
import json
import os
import struct
from datetime import datetime
from dotenv import load_dotenv
from collections import Counter

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

# Use environment variable for role, or fall back to execution role if running in SageMaker
if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "seq2seq-translation"

# Dataset parameters
NUM_SAMPLES = 5000
VAL_RATIO = 0.1
MAX_SEQ_LEN = 50
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Translation Data

We'll create a synthetic "translation" dataset that maps simple English patterns to a transformed version. This simulates a translation task without requiring a real bilingual corpus.

**Note:** For real applications, you would use actual parallel corpora (e.g., English-German, English-French).

In [None]:
def generate_translation_data(num_samples=5000, seed=42):
    """
    Generate synthetic "translation" pairs.
    
    We create simple English sentences and transform them
    into a "target language" using consistent rules:
    - Word order changes (subject-verb-object -> subject-object-verb)
    - Word substitutions
    - Prefix/suffix additions
    
    This simulates the structure of real translation without
    requiring a bilingual corpus.
    """
    np.random.seed(seed)
    
    # Vocabulary components
    subjects = ['the cat', 'the dog', 'the bird', 'the man', 'the woman', 
                'the child', 'the teacher', 'the doctor', 'the student', 'the artist']
    verbs = ['sees', 'likes', 'wants', 'finds', 'takes', 
             'helps', 'knows', 'loves', 'needs', 'gives']
    objects = ['the ball', 'the book', 'the food', 'the water', 'the house',
               'the car', 'the flower', 'the key', 'the phone', 'the pen']
    adjectives = ['big', 'small', 'red', 'blue', 'old', 'new', 'fast', 'slow', 'good', 'bad']
    adverbs = ['quickly', 'slowly', 'carefully', 'happily', 'sadly']
    
    # Translation mappings (simulating word-level translation)
    word_map = {
        'the': 'el', 'cat': 'gato', 'dog': 'perro', 'bird': 'pajaro',
        'man': 'hombre', 'woman': 'mujer', 'child': 'nino', 'teacher': 'maestro',
        'doctor': 'medico', 'student': 'estudiante', 'artist': 'artista',
        'sees': 've', 'likes': 'gusta', 'wants': 'quiere', 'finds': 'encuentra',
        'takes': 'toma', 'helps': 'ayuda', 'knows': 'conoce', 'loves': 'ama',
        'needs': 'necesita', 'gives': 'da',
        'ball': 'pelota', 'book': 'libro', 'food': 'comida', 'water': 'agua',
        'house': 'casa', 'car': 'coche', 'flower': 'flor', 'key': 'llave',
        'phone': 'telefono', 'pen': 'pluma',
        'big': 'grande', 'small': 'pequeno', 'red': 'rojo', 'blue': 'azul',
        'old': 'viejo', 'new': 'nuevo', 'fast': 'rapido', 'slow': 'lento',
        'good': 'bueno', 'bad': 'malo',
        'quickly': 'rapidamente', 'slowly': 'lentamente', 'carefully': 'cuidadosamente',
        'happily': 'felizmente', 'sadly': 'tristemente'
    }
    
    source_sentences = []
    target_sentences = []
    
    for _ in range(num_samples):
        # Generate source sentence with random structure
        structure = np.random.choice(['SVO', 'SVO_adj', 'SVO_adv', 'SVO_adj_adv'])
        
        subj = np.random.choice(subjects)
        verb = np.random.choice(verbs)
        obj = np.random.choice(objects)
        adj = np.random.choice(adjectives)
        adv = np.random.choice(adverbs)
        
        if structure == 'SVO':
            source = f"{subj} {verb} {obj}"
        elif structure == 'SVO_adj':
            source = f"{subj} {verb} the {adj} {obj.split()[-1]}"
        elif structure == 'SVO_adv':
            source = f"{subj} {adv} {verb} {obj}"
        else:
            source = f"{subj} {adv} {verb} the {adj} {obj.split()[-1]}"
        
        # Translate to target
        target_words = []
        for word in source.split():
            target_words.append(word_map.get(word, word))
        target = ' '.join(target_words)
        
        source_sentences.append(source)
        target_sentences.append(target)
    
    return source_sentences, target_sentences, word_map

In [None]:
# Generate the dataset
print("Generating synthetic translation data...")
source_sentences, target_sentences, word_map = generate_translation_data(NUM_SAMPLES, RANDOM_STATE)

print(f"\nGenerated {len(source_sentences)} sentence pairs")
print(f"\nSample pairs:")
for i in range(5):
    print(f"  Source: {source_sentences[i]}")
    print(f"  Target: {target_sentences[i]}")
    print()

In [None]:
# Split into train and validation
np.random.seed(RANDOM_STATE)
indices = np.random.permutation(len(source_sentences))
val_size = int(len(source_sentences) * VAL_RATIO)

val_idx = indices[:val_size]
train_idx = indices[val_size:]

train_source = [source_sentences[i] for i in train_idx]
train_target = [target_sentences[i] for i in train_idx]
val_source = [source_sentences[i] for i in val_idx]
val_target = [target_sentences[i] for i in val_idx]

print(f"Training set: {len(train_source)} pairs")
print(f"Validation set: {len(val_source)} pairs")

## Step 3: Build Vocabularies

Seq2Seq requires:
- Source vocabulary (`vocab.src.json`): Maps source language tokens to integers
- Target vocabulary (`vocab.trg.json`): Maps target language tokens to integers

**Special Tokens:**
- `<pad>` (0): Padding token
- `<unk>` (1): Unknown token
- `<s>` (2): Start of sequence
- `</s>` (3): End of sequence

In [None]:
def build_vocabulary(sentences, min_freq=1):
    """
    Build vocabulary from sentences.
    
    Returns a dictionary mapping tokens to integer IDs.
    Special tokens are reserved at the beginning.
    """
    # Count word frequencies
    word_counts = Counter()
    for sentence in sentences:
        word_counts.update(sentence.lower().split())
    
    # Start with special tokens
    vocab = {
        '<pad>': 0,
        '<unk>': 1,
        '<s>': 2,
        '</s>': 3
    }
    
    # Add words that meet minimum frequency
    idx = 4
    for word, count in word_counts.most_common():
        if count >= min_freq:
            vocab[word] = idx
            idx += 1
    
    return vocab

# Build vocabularies
source_vocab = build_vocabulary(train_source)
target_vocab = build_vocabulary(train_target)

print(f"Source vocabulary size: {len(source_vocab)}")
print(f"Target vocabulary size: {len(target_vocab)}")

print(f"\nSample source vocab: {dict(list(source_vocab.items())[:10])}")
print(f"Sample target vocab: {dict(list(target_vocab.items())[:10])}")

In [None]:
def tokenize_sentence(sentence, vocab, max_len=None):
    """
    Convert sentence to list of integer token IDs.
    """
    tokens = [vocab.get(word, vocab['<unk>']) for word in sentence.lower().split()]
    
    # Truncate if needed
    if max_len and len(tokens) > max_len:
        tokens = tokens[:max_len]
    
    return tokens

# Test tokenization
sample_source = train_source[0]
sample_tokens = tokenize_sentence(sample_source, source_vocab)
print(f"Original: {sample_source}")
print(f"Tokenized: {sample_tokens}")

## Step 4: Prepare Data in RecordIO-Protobuf Format

SageMaker Seq2Seq requires data in RecordIO-Protobuf format with integer-encoded tokens.

Each record contains:
- Source sequence (integer tokens)
- Target sequence (integer tokens)

In [None]:
# We'll use SageMaker's built-in serialization utilities
import io
import struct

def write_recordio(data, filename):
    """
    Write data to RecordIO format.
    
    Each record is: [4-byte magic number][4-byte length][data][padding]
    """
    # RecordIO magic number
    RECORDIO_MAGIC = 0xCED7230A
    
    with open(filename, 'wb') as f:
        for record in data:
            # Serialize the record
            record_bytes = record.encode('utf-8') if isinstance(record, str) else record
            
            # Write magic number and length
            f.write(struct.pack('I', RECORDIO_MAGIC))
            f.write(struct.pack('I', len(record_bytes)))
            
            # Write data
            f.write(record_bytes)
            
            # Pad to 4-byte boundary
            padding = (4 - len(record_bytes) % 4) % 4
            f.write(b'\x00' * padding)

In [None]:
# For Seq2Seq, we need to use the protobuf format
# Let's use a simpler approach with the integer token files

def create_seq2seq_data(source_sentences, target_sentences, source_vocab, target_vocab, max_len=50):
    """
    Create parallel integer-encoded sequences for Seq2Seq.
    """
    source_data = []
    target_data = []
    
    for src, tgt in zip(source_sentences, target_sentences):
        src_tokens = tokenize_sentence(src, source_vocab, max_len)
        tgt_tokens = tokenize_sentence(tgt, target_vocab, max_len)
        
        source_data.append(src_tokens)
        target_data.append(tgt_tokens)
    
    return source_data, target_data

# Create tokenized data
train_source_tokens, train_target_tokens = create_seq2seq_data(
    train_source, train_target, source_vocab, target_vocab, MAX_SEQ_LEN
)
val_source_tokens, val_target_tokens = create_seq2seq_data(
    val_source, val_target, source_vocab, target_vocab, MAX_SEQ_LEN
)

print(f"Training samples: {len(train_source_tokens)}")
print(f"Validation samples: {len(val_source_tokens)}")
print(f"\nSample tokenized pair:")
print(f"  Source tokens: {train_source_tokens[0]}")
print(f"  Target tokens: {train_target_tokens[0]}")

In [None]:
# Create data directory
os.makedirs('data', exist_ok=True)

# Save vocabularies in JSON format (required by Seq2Seq)
with open('data/vocab.src.json', 'w') as f:
    json.dump(source_vocab, f, indent=2)

with open('data/vocab.trg.json', 'w') as f:
    json.dump(target_vocab, f, indent=2)

print("Vocabularies saved:")
print(f"  - data/vocab.src.json ({len(source_vocab)} tokens)")
print(f"  - data/vocab.trg.json ({len(target_vocab)} tokens)")

In [None]:
# For SageMaker Seq2Seq, we need to create RecordIO-Protobuf files
# Using MXNet's recordio format

try:
    import mxnet as mx
    MXNET_AVAILABLE = True
except ImportError:
    MXNET_AVAILABLE = False
    print("MXNet not available. Using alternative format.")

def write_seq2seq_recordio(source_tokens, target_tokens, filename):
    """
    Write Seq2Seq data to RecordIO format using MXNet.
    
    Each record contains source and target sequences as integer arrays.
    """
    if not MXNET_AVAILABLE:
        # Fallback: write as JSON lines for batch transform
        with open(filename.replace('.rec', '.jsonl'), 'w') as f:
            for src, tgt in zip(source_tokens, target_tokens):
                record = {'source': src, 'target': tgt}
                f.write(json.dumps(record) + '\n')
        return
    
    record_writer = mx.recordio.MXRecordIO(filename, 'w')
    
    for src, tgt in zip(source_tokens, target_tokens):
        # Pack source and target as numpy arrays
        src_array = np.array(src, dtype=np.int32)
        tgt_array = np.array(tgt, dtype=np.int32)
        
        # Create header with shape info
        header = mx.recordio.IRHeader(0, [len(src), len(tgt)], 0, 0)
        
        # Pack the record
        packed = mx.recordio.pack(
            header,
            np.concatenate([src_array, tgt_array]).tobytes()
        )
        record_writer.write(packed)
    
    record_writer.close()

# Write training and validation data
write_seq2seq_recordio(train_source_tokens, train_target_tokens, 'data/train.rec')
write_seq2seq_recordio(val_source_tokens, val_target_tokens, 'data/val.rec')

if MXNET_AVAILABLE:
    print("RecordIO files created:")
    print(f"  - data/train.rec ({os.path.getsize('data/train.rec') / 1024:.1f} KB)")
    print(f"  - data/val.rec ({os.path.getsize('data/val.rec') / 1024:.1f} KB)")
else:
    print("JSON Lines files created (MXNet not available):")
    print(f"  - data/train.jsonl")
    print(f"  - data/val.jsonl")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

# Upload training data
if MXNET_AVAILABLE:
    s3_client.upload_file('data/train.rec', BUCKET_NAME, f"{PREFIX}/train/train.rec")
    s3_client.upload_file('data/val.rec', BUCKET_NAME, f"{PREFIX}/validation/val.rec")
else:
    s3_client.upload_file('data/train.jsonl', BUCKET_NAME, f"{PREFIX}/train/train.jsonl")
    s3_client.upload_file('data/val.jsonl', BUCKET_NAME, f"{PREFIX}/validation/val.jsonl")

# Upload vocabularies
s3_client.upload_file('data/vocab.src.json', BUCKET_NAME, f"{PREFIX}/vocab/vocab.src.json")
s3_client.upload_file('data/vocab.trg.json', BUCKET_NAME, f"{PREFIX}/vocab/vocab.trg.json")

train_s3_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
val_s3_uri = f"s3://{BUCKET_NAME}/{PREFIX}/validation"
vocab_s3_uri = f"s3://{BUCKET_NAME}/{PREFIX}/vocab"

print("Data uploaded to S3:")
print(f"  Train: {train_s3_uri}")
print(f"  Validation: {val_s3_uri}")
print(f"  Vocab: {vocab_s3_uri}")

## Step 5: Configure and Train the Seq2Seq Model

### Key Hyperparameters

**Encoder/Decoder Architecture**

| Parameter | Description | Default |
|-----------|-------------|---------|
| `encoder_type` | Encoder architecture: `rnn` or `cnn` | rnn |
| `decoder_type` | Decoder architecture: `rnn` or `cnn` | rnn |
| `num_layers_encoder` | Number of encoder layers | 1 |
| `num_layers_decoder` | Number of decoder layers | 1 |
| `rnn_num_hidden` | Hidden units in RNN layers | 1024 |
| `rnn_cell_type` | RNN cell type: `lstm` or `gru` | lstm |

**Attention Mechanism**

| Parameter | Description | Default |
|-----------|-------------|---------|
| `rnn_attention_type` | Attention type: `dot`, `mlp`, `bilinear`, `fixed` | mlp |
| `rnn_attention_num_hidden` | Hidden units in attention layer | rnn_num_hidden |

**Embedding**

| Parameter | Description | Default |
|-----------|-------------|---------|
| `num_embed_source` | Source embedding dimension | 512 |
| `num_embed_target` | Target embedding dimension | 512 |
| `embed_dropout_source` | Source embedding dropout | 0 |
| `embed_dropout_target` | Target embedding dropout | 0 |

**Training**

| Parameter | Description | Default |
|-----------|-------------|---------|
| `batch_size` | Mini-batch size | 64 |
| `learning_rate` | Initial learning rate | 0.0003 |
| `optimizer_type` | Optimizer: `adam`, `sgd`, `rmsprop` | adam |
| `max_num_epochs` | Maximum training epochs | 10 |
| `clip_gradient` | Gradient clipping threshold | 1 |

**Sequence Handling**

| Parameter | Description | Default |
|-----------|-------------|---------|
| `max_seq_len_source` | Maximum source sequence length | 100 |
| `max_seq_len_target` | Maximum target sequence length | 100 |
| `bucketing_enabled` | Enable sequence length bucketing | true |

**Inference**

| Parameter | Description | Default |
|-----------|-------------|---------|
| `beam_size` | Beam search width for decoding | 5 |

In [None]:
# Get the Seq2Seq container image
seq2seq_image = retrieve(
    framework='seq2seq',
    region=region,
    version='1'
)

print(f"Seq2Seq Image URI: {seq2seq_image}")

In [None]:
# Define the estimator
# NOTE: Seq2Seq requires GPU instances (P2, P3, G4dn, G5)
seq2seq_estimator = Estimator(
    image_uri=seq2seq_image,
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',  # GPU required!
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='seq2seq-translation'
)

In [None]:
# Set hyperparameters
hyperparameters = {
    # Encoder configuration
    "encoder_type": "rnn",
    "num_layers_encoder": 2,
    "rnn_num_hidden": 256,
    "rnn_cell_type": "lstm",
    
    # Decoder configuration
    "decoder_type": "rnn",
    "num_layers_decoder": 2,
    "rnn_attention_type": "mlp",
    
    # Embedding
    "num_embed_source": 256,
    "num_embed_target": 256,
    "embed_dropout_source": 0.1,
    "embed_dropout_target": 0.1,
    
    # Sequence handling
    "max_seq_len_source": MAX_SEQ_LEN,
    "max_seq_len_target": MAX_SEQ_LEN,
    "bucketing_enabled": "true",
    
    # Training
    "batch_size": 64,
    "learning_rate": 0.0003,
    "optimizer_type": "adam",
    "max_num_epochs": 10,
    "clip_gradient": 1.0,
    
    # Checkpointing
    "checkpoint_frequency_num_batches": 500,
    "checkpoint_threshold": 3,
    
    # Inference
    "beam_size": 5,
}

seq2seq_estimator.set_hyperparameters(**hyperparameters)

print("Hyperparameters configured:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Define data channels - Seq2Seq requires train, validation, and vocab channels
data_channels = {
    "train": train_s3_uri,
    "validation": val_s3_uri,
    "vocab": vocab_s3_uri
}

print("Starting training job...")
print("NOTE: This requires a GPU instance and will take 15-30 minutes.\n")

# Start training
seq2seq_estimator.fit(inputs=data_channels, wait=True, logs=True)

In [None]:
# Get training job info
training_job_name = seq2seq_estimator.latest_training_job.name
print(f"Training job completed: {training_job_name}")
print(f"Model artifacts: {seq2seq_estimator.model_data}")

## Step 6: Deploy the Model

Unlike batch-oriented algorithms, Seq2Seq is typically deployed for real-time inference.

In [None]:
# Deploy to an endpoint
print("Deploying model to endpoint...")
print("This will take approximately 5-7 minutes.\n")

predictor = seq2seq_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',  # CPU is fine for inference
    endpoint_name=f'seq2seq-translation-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {predictor.endpoint_name}")

## Step 7: Make Predictions

Seq2Seq accepts JSON input with the source text.

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Configure predictor
predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

def translate(text):
    """
    Translate text using the deployed model.
    """
    # Seq2Seq expects the source text as space-separated tokens
    request = {"instances": [{"data": text}]}
    response = predictor.predict(request)
    return response

# Test translations
test_sentences = [
    "the cat sees the ball",
    "the dog likes the food",
    "the woman helps the child",
    "the teacher quickly finds the book",
    "the artist loves the big flower"
]

print("Translation Results:")
print("=" * 60)
for sentence in test_sentences:
    result = translate(sentence)
    print(f"Source:  {sentence}")
    print(f"Target:  {result}")
    print()

## Step 8: Evaluate Translation Quality

### Understanding Translation Metrics

**BLEU Score (Bilingual Evaluation Understudy)**
- Standard metric for machine translation
- Measures n-gram overlap between prediction and reference
- Range: 0-100 (higher is better)
- Interpretation:
  - < 10: Almost useless
  - 10-19: Hard to understand
  - 20-29: Clear meaning, grammatical errors
  - 30-40: Understandable, good quality
  - 40-50: High quality
  - 50-60: Very high quality
  - > 60: Better than human (rare)

**Exact Match**
- Percentage of predictions exactly matching reference
- Very strict metric

In [None]:
from collections import Counter
import math

def calculate_bleu(reference, candidate, max_n=4):
    """
    Calculate BLEU score between reference and candidate.
    """
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    # Calculate n-gram precisions
    precisions = []
    for n in range(1, max_n + 1):
        ref_ngrams = Counter([tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1)])
        cand_ngrams = Counter([tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens)-n+1)])
        
        matches = sum((ref_ngrams & cand_ngrams).values())
        total = sum(cand_ngrams.values())
        
        if total > 0:
            precisions.append(matches / total)
        else:
            precisions.append(0)
    
    # Brevity penalty
    if len(cand_tokens) == 0:
        return 0
    bp = min(1, math.exp(1 - len(ref_tokens) / len(cand_tokens)))
    
    # Geometric mean of precisions
    if all(p > 0 for p in precisions):
        log_precisions = [math.log(p) for p in precisions]
        bleu = bp * math.exp(sum(log_precisions) / len(log_precisions))
    else:
        bleu = 0
    
    return bleu * 100  # Return as percentage

# Evaluate on validation set
bleu_scores = []
exact_matches = 0

print("Evaluating on validation set...")
for i, (src, ref) in enumerate(zip(val_source[:100], val_target[:100])):
    try:
        result = translate(src)
        pred = result.get('predictions', [{}])[0].get('target', '')
        
        bleu = calculate_bleu(ref, pred)
        bleu_scores.append(bleu)
        
        if pred.lower().strip() == ref.lower().strip():
            exact_matches += 1
        
        if i < 5:
            print(f"\nSource: {src}")
            print(f"Reference: {ref}")
            print(f"Predicted: {pred}")
            print(f"BLEU: {bleu:.2f}")
    except Exception as e:
        print(f"Error: {e}")
        continue

if bleu_scores:
    print("\n" + "=" * 60)
    print("EVALUATION RESULTS")
    print("=" * 60)
    print(f"Average BLEU: {np.mean(bleu_scores):.2f}")
    print(f"Median BLEU: {np.median(bleu_scores):.2f}")
    print(f"Exact Match: {exact_matches}/{len(bleu_scores)} ({100*exact_matches/len(bleu_scores):.1f}%)")

## Step 9: Clean Up Resources

**IMPORTANT**: Always delete endpoints when done to avoid ongoing charges!

In [None]:
# Delete the endpoint
print(f"Deleting endpoint: {predictor.endpoint_name}")
predictor.delete_endpoint()
print("Endpoint deleted successfully!")

In [None]:
# Optionally, clean up S3 data
# Uncomment the following lines if you want to delete the S3 data

# import boto3
# s3 = boto3.resource('s3')
# bucket = s3.Bucket(BUCKET_NAME)
# bucket.objects.filter(Prefix=PREFIX).delete()
# print(f"Deleted all objects under s3://{BUCKET_NAME}/{PREFIX}")

---

## Summary

In this exercise, you learned:

1. **Data Format**: Seq2Seq requires RecordIO-Protobuf format with integer-encoded tokens, plus JSON vocabulary files

2. **Three Required Channels**:
   - `train`: Training data
   - `validation`: Validation data
   - `vocab`: Source and target vocabulary files (`vocab.src.json`, `vocab.trg.json`)

3. **Key Hyperparameters**:
   - `encoder_type`, `decoder_type`: Architecture choice (rnn or cnn)
   - `rnn_num_hidden`: Hidden units (model capacity)
   - `num_layers_encoder`, `num_layers_decoder`: Depth
   - `rnn_attention_type`: Attention mechanism
   - `beam_size`: Decoding beam width

4. **Hardware Requirements**:
   - **GPU required** for training (P2, P3, G4dn, G5)
   - CPU can be used for inference

5. **Evaluation**: Use BLEU score for translation quality

## Use Cases

| Use Case | Description |
|----------|-------------|
| Machine Translation | Translate text between languages |
| Text Summarization | Generate summaries from documents |
| Question Answering | Generate answers from context |
| Chatbots | Generate responses to user queries |
| Code Generation | Generate code from descriptions |

## Next Steps

- Try different encoder/decoder architectures (CNN is faster)
- Experiment with attention types (dot, mlp, bilinear)
- Use larger hidden dimensions for complex tasks
- Add more training data for better generalization
- Try transfer learning from pre-trained models