# SageMaker Latent Dirichlet Allocation (LDA) Exercise

This notebook demonstrates Amazon SageMaker's **Latent Dirichlet Allocation (LDA)** algorithm for topic modeling.

## What You'll Learn
1. How to prepare document data for LDA
2. How to train an LDA topic model
3. How to interpret topic-word distributions

## What is LDA?

LDA is an **unsupervised** probabilistic model that discovers hidden topics in document collections. It assumes:
- Each document is a mixture of topics
- Each topic is a distribution over words

**SageMaker's Implementation:**
- Uses tensor spectral decomposition (not Gibbs sampling)
- Provides theoretical guarantees on results
- Highly parallelizable

## LDA vs NTM

| Aspect | LDA | NTM |
|--------|-----|-----|
| Method | Probabilistic | Neural network |
| Instance | CPU only | CPU and GPU |
| Parallelization | Single CPU | Multi-GPU, distributed |
| Topic coherence | Often better | Good |
| Perplexity | Better | Good |

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
from dotenv import load_dotenv
from collections import Counter
import matplotlib.pyplot as plt

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "lda"

# Dataset parameters
NUM_DOCUMENTS = 1000
NUM_TOPICS = 5
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Document Data

In [None]:
def generate_topic_documents(num_docs=1000, num_topics=5, seed=42):
    """
    Generate synthetic documents with known topic structure.
    """
    np.random.seed(seed)
    
    # Define topic vocabularies
    topic_words = {
        0: ['movie', 'film', 'actor', 'director', 'scene', 'character', 
            'cinema', 'award', 'performance', 'screen', 'drama', 'comedy',
            'hollywood', 'studio', 'premiere', 'cast', 'sequel', 'script', 'review', 'rating'],
        1: ['music', 'song', 'album', 'artist', 'concert', 'band', 
            'guitar', 'drums', 'piano', 'melody', 'lyrics', 'rhythm',
            'singer', 'tour', 'release', 'track', 'genre', 'festival', 'hit', 'record'],
        2: ['food', 'recipe', 'cooking', 'chef', 'restaurant', 'ingredient', 
            'kitchen', 'meal', 'taste', 'flavor', 'dish', 'cuisine',
            'baking', 'dinner', 'lunch', 'breakfast', 'vegetable', 'spice', 'delicious', 'homemade'],
        3: ['travel', 'destination', 'hotel', 'flight', 'vacation', 'tourism', 
            'beach', 'mountain', 'city', 'country', 'adventure', 'explore',
            'passport', 'trip', 'journey', 'sightseeing', 'culture', 'resort', 'booking', 'itinerary'],
        4: ['fashion', 'style', 'clothing', 'designer', 'trend', 'outfit', 
            'collection', 'runway', 'model', 'brand', 'accessories', 'shoes',
            'dress', 'fabric', 'boutique', 'wardrobe', 'elegant', 'casual', 'luxury', 'season']
    }
    
    topic_names = ['Movies', 'Music', 'Food', 'Travel', 'Fashion']
    
    # Build vocabulary
    all_words = []
    for words in topic_words.values():
        all_words.extend(words)
    vocab = sorted(set(all_words))
    word_to_idx = {word: idx for idx, word in enumerate(vocab)}
    
    documents = []
    doc_topics = []
    
    for _ in range(num_docs):
        dominant_topic = np.random.randint(0, num_topics)
        doc_topics.append(dominant_topic)
        
        doc_length = np.random.randint(50, 150)
        doc_words = []
        
        for _ in range(doc_length):
            if np.random.random() < 0.8:
                word = np.random.choice(topic_words[dominant_topic])
            else:
                other_topic = np.random.choice([t for t in range(num_topics) if t != dominant_topic])
                word = np.random.choice(topic_words[other_topic])
            doc_words.append(word)
        
        documents.append(' '.join(doc_words))
    
    return documents, doc_topics, vocab, word_to_idx, topic_names

# Generate documents
documents, doc_topics, vocab, word_to_idx, topic_names = generate_topic_documents(
    NUM_DOCUMENTS, NUM_TOPICS, RANDOM_STATE
)

print(f"Generated {len(documents)} documents")
print(f"Vocabulary size: {len(vocab)}")
print(f"Topics: {topic_names}")
print(f"\nTopic distribution: {Counter(doc_topics)}")

## Step 3: Prepare Data for LDA

LDA expects **bag-of-words** representation in CSV or RecordIO-protobuf format.

In [None]:
def documents_to_bow(documents, word_to_idx):
    """
    Convert documents to bag-of-words matrix.
    """
    vocab_size = len(word_to_idx)
    bow_matrix = np.zeros((len(documents), vocab_size), dtype=np.float32)
    
    for doc_idx, doc in enumerate(documents):
        words = doc.lower().split()
        for word in words:
            if word in word_to_idx:
                bow_matrix[doc_idx, word_to_idx[word]] += 1
    
    return bow_matrix

# Convert to bag-of-words
bow_matrix = documents_to_bow(documents, word_to_idx)

print(f"Bag-of-words matrix shape: {bow_matrix.shape}")
print(f"Non-zero entries: {np.count_nonzero(bow_matrix)}")

In [None]:
# Save as CSV
os.makedirs('data/lda', exist_ok=True)

np.savetxt('data/lda/train.csv', bow_matrix, delimiter=',')

# Save vocabulary
with open('data/lda/vocab.json', 'w') as f:
    json.dump({'vocab': vocab, 'word_to_idx': word_to_idx}, f)

print(f"Saved: data/lda/train.csv ({os.path.getsize('data/lda/train.csv') / 1024:.1f} KB)")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

train_s3_key = f"{PREFIX}/train/train.csv"
s3_client.upload_file('data/lda/train.csv', BUCKET_NAME, train_s3_key)

train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
print(f"Data uploaded to: {train_uri}")

## Step 4: Train LDA Model

### Key Hyperparameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `num_topics` | Number of topics | Required |
| `feature_dim` | Vocabulary size | Required |
| `mini_batch_size` | Batch size | 256 |
| `alpha0` | Initial guess for topic concentration | 1.0 |

**Note:** SageMaker LDA only supports single-instance CPU training.

In [None]:
# Get LDA container image
lda_image = retrieve(
    framework='lda',
    region=region,
    version='1'
)

print(f"LDA Image URI: {lda_image}")

In [None]:
# Create LDA estimator
lda_estimator = Estimator(
    image_uri=lda_image,
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge',  # CPU only
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='lda'
)

In [None]:
# Set hyperparameters
hyperparameters = {
    "num_topics": NUM_TOPICS,
    "feature_dim": len(vocab),
    "mini_batch_size": 128,
    "alpha0": 1.0,
}

lda_estimator.set_hyperparameters(**hyperparameters)

print("LDA hyperparameters:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training
print("Starting LDA training job...")
print("This will take approximately 3-5 minutes.\n")

lda_estimator.fit(
    {'train': train_uri},
    wait=True,
    logs=True
)

In [None]:
# Get training job info
job_name = lda_estimator.latest_training_job.name
print(f"Training job completed: {job_name}")
print(f"Model artifacts: {lda_estimator.model_data}")

## Step 5: Deploy and Infer Topics

In [None]:
# Deploy the model
print("Deploying LDA model...")
print("This will take approximately 5-7 minutes.\n")

lda_predictor = lda_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=f'lda-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {lda_predictor.endpoint_name}")

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Configure predictor
lda_predictor.serializer = CSVSerializer()
lda_predictor.deserializer = JSONDeserializer()

def get_topic_mixtures(data, predictor, batch_size=100):
    """
    Get topic mixtures for documents.
    """
    all_mixtures = []
    
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        response = predictor.predict(batch)
        
        for pred in response['predictions']:
            all_mixtures.append(pred['topic_mixture'])
    
    return np.array(all_mixtures)

In [None]:
# Get topic mixtures
print("Getting topic mixtures...")
topic_mixtures = get_topic_mixtures(bow_matrix, lda_predictor)

print(f"Topic mixtures shape: {topic_mixtures.shape}")
print(f"\nSample mixture (first document):")
for i, weight in enumerate(topic_mixtures[0]):
    print(f"  Topic {i}: {weight:.4f}")

## Step 6: Analyze Topics

In [None]:
# Assign dominant topics
predicted_topics = np.argmax(topic_mixtures, axis=1)

print("Predicted Topic Distribution:")
print(Counter(predicted_topics))

print("\nTrue Topic Distribution:")
print(Counter(doc_topics))

In [None]:
# Visualize topic mixtures
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Average topic weights
avg_weights = topic_mixtures.mean(axis=0)
axes[0].bar(range(NUM_TOPICS), avg_weights)
axes[0].set_xlabel('Topic')
axes[0].set_ylabel('Average Weight')
axes[0].set_title('Average Topic Weights')

# Sample document mixtures
sample_docs = np.random.choice(len(topic_mixtures), 20, replace=False)
im = axes[1].imshow(topic_mixtures[sample_docs], aspect='auto', cmap='YlOrRd')
axes[1].set_xlabel('Topic')
axes[1].set_ylabel('Document')
axes[1].set_title('Topic Mixtures for Sample Documents')
plt.colorbar(im, ax=axes[1])

plt.tight_layout()
plt.show()

In [None]:
# Identify top words per topic
print("Top Words per Discovered Topic:")
print("=" * 60)

for topic_id in range(NUM_TOPICS):
    topic_docs_mask = predicted_topics == topic_id
    topic_docs = bow_matrix[topic_docs_mask]
    
    word_counts = topic_docs.sum(axis=0)
    top_word_indices = np.argsort(word_counts)[::-1][:10]
    top_words = [vocab[idx] for idx in top_word_indices]
    
    print(f"\nTopic {topic_id} ({topic_docs_mask.sum()} documents):")
    print(f"  Top words: {', '.join(top_words)}")

## Step 7: Clean Up Resources

In [None]:
# Delete the endpoint
print(f"Deleting endpoint: {lda_predictor.endpoint_name}")
lda_predictor.delete_endpoint()
print("Endpoint deleted successfully!")

---

## Summary

In this exercise, you learned:

1. **Data Format**: Bag-of-words (CSV or RecordIO)

2. **Key Hyperparameters**:
   - `num_topics`: Number of topics
   - `feature_dim`: Vocabulary size
   - `alpha0`: Topic concentration

3. **Output**: Topic mixture (probability distribution over topics per document)

4. **Limitations**:
   - Single-instance CPU only
   - Less flexible than NTM

### When to Use LDA vs NTM

| Scenario | Recommendation |
|----------|----------------|
| Small dataset, need coherent topics | LDA |
| Large dataset, need speed | NTM |
| Need distributed training | NTM |
| Academic/research (interpretability) | LDA |

### Instance Recommendations

| Task | Instance Types |
|------|----------------|
| Training | ml.c5.xlarge, ml.m5.large (CPU only) |
| Inference | ml.m5.large, ml.c5.large |

### Next Steps

- Compare results with NTM on same data
- Experiment with different `num_topics` values
- Use topic mixtures for document similarity
- Apply to real document collections