# SageMaker BlazingText Exercise

This notebook walks you through Amazon SageMaker's **BlazingText** algorithm for both **Word2Vec embeddings** and **text classification**.

## What You'll Learn
1. How to train Word2Vec embeddings using BlazingText
2. How to train a text classifier using BlazingText's supervised mode
3. How to prepare data in the required formats
4. How to use the trained models for inference

## What is BlazingText?
BlazingText is a highly optimized implementation of:
- **Word2Vec**: Creates semantic word embeddings (unsupervised)
- **Text Classification**: Fast text classification based on fastText (supervised)

**Key Features:**
- GPU and multi-core CPU acceleration
- Train on 1+ billion words in minutes
- Subword embeddings for out-of-vocabulary words
- Compatible with Gensim and fastText

## Modes of Operation

| Mode | Type | Description |
|------|------|-------------|
| `cbow` | Word2Vec | Continuous Bag of Words - predicts word from context |
| `skipgram` | Word2Vec | Predicts context from word |
| `batch_skipgram` | Word2Vec | Distributed skipgram across multiple CPUs |
| `supervised` | Classification | Text classification (like fastText) |

## Prerequisites
- SageMaker notebook instance or Studio, or local environment with AWS credentials
- IAM role with S3 and SageMaker permissions

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
from dotenv import load_dotenv
from collections import Counter
import re

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

# Use environment variable for role, or fall back to execution role if running in SageMaker
if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "blazingtext"

# Dataset parameters
NUM_SAMPLES = 10000
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

---

# Part A: Word2Vec Embeddings

## Step 2A: Generate Training Corpus

For Word2Vec, we need a large text corpus. We'll generate synthetic sentences that simulate real-world text patterns.

In [None]:
def generate_word2vec_corpus(num_sentences=10000, seed=42):
    """
    Generate a synthetic corpus for Word2Vec training.
    
    Creates sentences with semantic relationships:
    - Technology domain
    - Sports domain
    - Food domain
    - Business domain
    """
    np.random.seed(seed)
    
    # Domain vocabularies with semantic relationships
    domains = {
        'technology': {
            'subjects': ['the computer', 'the software', 'the application', 'the system', 'the network',
                        'the database', 'the server', 'the algorithm', 'the program', 'the code'],
            'verbs': ['processes', 'stores', 'analyzes', 'computes', 'transfers',
                     'encrypts', 'optimizes', 'executes', 'compiles', 'debugs'],
            'objects': ['the data', 'the files', 'the information', 'the queries', 'the requests',
                       'the packets', 'the records', 'the transactions', 'the messages', 'the bytes'],
            'adjectives': ['fast', 'secure', 'efficient', 'scalable', 'reliable', 'modern', 'advanced']
        },
        'sports': {
            'subjects': ['the player', 'the team', 'the athlete', 'the coach', 'the champion',
                        'the runner', 'the goalkeeper', 'the striker', 'the defender', 'the referee'],
            'verbs': ['wins', 'scores', 'plays', 'trains', 'competes',
                     'defeats', 'leads', 'practices', 'runs', 'kicks'],
            'objects': ['the game', 'the match', 'the tournament', 'the championship', 'the race',
                       'the ball', 'the goal', 'the medal', 'the trophy', 'the title'],
            'adjectives': ['strong', 'fast', 'skilled', 'talented', 'professional', 'competitive', 'athletic']
        },
        'food': {
            'subjects': ['the chef', 'the restaurant', 'the cook', 'the baker', 'the kitchen',
                        'the waiter', 'the menu', 'the recipe', 'the dish', 'the meal'],
            'verbs': ['prepares', 'serves', 'cooks', 'bakes', 'tastes',
                     'seasons', 'grills', 'fries', 'roasts', 'mixes'],
            'objects': ['the food', 'the ingredients', 'the sauce', 'the vegetables', 'the meat',
                       'the dessert', 'the bread', 'the soup', 'the salad', 'the wine'],
            'adjectives': ['delicious', 'fresh', 'organic', 'healthy', 'tasty', 'gourmet', 'homemade']
        },
        'business': {
            'subjects': ['the company', 'the manager', 'the executive', 'the investor', 'the startup',
                        'the entrepreneur', 'the corporation', 'the firm', 'the director', 'the ceo'],
            'verbs': ['invests', 'manages', 'grows', 'acquires', 'launches',
                     'develops', 'markets', 'sells', 'expands', 'profits'],
            'objects': ['the business', 'the market', 'the product', 'the service', 'the brand',
                       'the revenue', 'the strategy', 'the customers', 'the profit', 'the shares'],
            'adjectives': ['successful', 'innovative', 'profitable', 'global', 'growing', 'competitive', 'strategic']
        }
    }
    
    sentences = []
    domain_names = list(domains.keys())
    
    for _ in range(num_sentences):
        # Pick a random domain
        domain = domains[np.random.choice(domain_names)]
        
        # Generate sentence with random structure
        structure = np.random.choice(['SVO', 'SVO_adj', 'adj_SVO'])
        
        subj = np.random.choice(domain['subjects'])
        verb = np.random.choice(domain['verbs'])
        obj = np.random.choice(domain['objects'])
        adj = np.random.choice(domain['adjectives'])
        
        if structure == 'SVO':
            sentence = f"{subj} {verb} {obj}"
        elif structure == 'SVO_adj':
            sentence = f"{subj} {verb} {adj} {obj.split()[-1]}"
        else:
            sentence = f"{adj} {subj.split()[-1]} {verb} {obj}"
        
        sentences.append(sentence.lower())
    
    return sentences

In [None]:
# Generate Word2Vec corpus
print("Generating Word2Vec training corpus...")
w2v_corpus = generate_word2vec_corpus(NUM_SAMPLES, RANDOM_STATE)

print(f"\nGenerated {len(w2v_corpus)} sentences")
print(f"\nSample sentences:")
for i in range(5):
    print(f"  {w2v_corpus[i]}")

In [None]:
# Analyze vocabulary
all_words = []
for sentence in w2v_corpus:
    all_words.extend(sentence.split())

word_counts = Counter(all_words)
print(f"Total words: {len(all_words)}")
print(f"Unique words: {len(word_counts)}")
print(f"\nMost common words:")
for word, count in word_counts.most_common(15):
    print(f"  {word}: {count}")

## Step 3A: Prepare Data for Word2Vec

BlazingText Word2Vec expects:
- Single text file
- One sentence per line
- Space-separated tokens

In [None]:
# Create data directory
os.makedirs('data/word2vec', exist_ok=True)

# Save corpus to file
with open('data/word2vec/corpus.txt', 'w') as f:
    for sentence in w2v_corpus:
        f.write(sentence + '\n')

print(f"Corpus saved: data/word2vec/corpus.txt ({os.path.getsize('data/word2vec/corpus.txt') / 1024:.1f} KB)")

# Show sample of file
print("\nFile contents (first 5 lines):")
with open('data/word2vec/corpus.txt', 'r') as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        print(f"  {line.strip()}")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

w2v_s3_path = f"{PREFIX}/word2vec/train/corpus.txt"
s3_client.upload_file('data/word2vec/corpus.txt', BUCKET_NAME, w2v_s3_path)

w2v_train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/word2vec/train"
print(f"Data uploaded to: {w2v_train_uri}")

## Step 4A: Train Word2Vec Model

### Key Word2Vec Hyperparameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `mode` | Architecture: `cbow`, `skipgram`, or `batch_skipgram` | Required |
| `vector_dim` | Dimension of word vectors | 100 |
| `window_size` | Context window size | 5 |
| `epochs` | Training passes through data | 5 |
| `learning_rate` | Step size for updates | 0.05 |
| `min_count` | Minimum word frequency | 5 |
| `negative_samples` | Negative samples per word | 5 |
| `subwords` | Learn subword embeddings | False |
| `min_char` / `max_char` | Character n-gram range for subwords | 3 / 6 |

### CBOW vs Skip-gram

| Aspect | CBOW | Skip-gram |
|--------|------|----------|
| Predicts | Word from context | Context from word |
| Speed | Faster | Slower |
| Rare words | Less accurate | More accurate |
| Best for | Large datasets | Smaller datasets, rare words |

In [None]:
# Get BlazingText container image
blazingtext_image = retrieve(
    framework='blazingtext',
    region=region,
    version='1'
)

print(f"BlazingText Image URI: {blazingtext_image}")

In [None]:
# Create Word2Vec estimator
w2v_estimator = Estimator(
    image_uri=blazingtext_image,
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge',  # CPU is sufficient for small datasets
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/word2vec/output',
    sagemaker_session=sagemaker_session,
    base_job_name='blazingtext-word2vec'
)

In [None]:
# Set Word2Vec hyperparameters
w2v_hyperparameters = {
    "mode": "skipgram",         # skipgram works better for smaller datasets
    "vector_dim": 100,          # Embedding dimension
    "window_size": 5,           # Context window
    "epochs": 10,               # Training passes
    "learning_rate": 0.05,      # Learning rate
    "min_count": 2,             # Minimum word frequency (lower for small corpus)
    "negative_samples": 5,      # Negative samples per word
    "subwords": "True",         # Learn subword embeddings for OOV words
    "min_char": 3,              # Min character n-gram
    "max_char": 6,              # Max character n-gram
    "evaluation": "False",      # Skip WordSim-353 evaluation (our vocab is synthetic)
}

w2v_estimator.set_hyperparameters(**w2v_hyperparameters)

print("Word2Vec hyperparameters:")
for k, v in w2v_hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training
print("Starting Word2Vec training job...")
print("This will take approximately 3-5 minutes.\n")

w2v_estimator.fit({'train': w2v_train_uri}, wait=True, logs=True)

In [None]:
# Get training job info
w2v_job_name = w2v_estimator.latest_training_job.name
print(f"Training job completed: {w2v_job_name}")
print(f"Model artifacts: {w2v_estimator.model_data}")

## Step 5A: Download and Explore Word Vectors

BlazingText produces:
- `vectors.txt`: Human-readable word vectors (Gensim compatible)
- `vectors.bin`: Binary vectors for deployment

In [None]:
# Download model artifacts
import tarfile

model_path = w2v_estimator.model_data
model_key = model_path.replace(f's3://{BUCKET_NAME}/', '')

os.makedirs('models/word2vec', exist_ok=True)
s3_client.download_file(BUCKET_NAME, model_key, 'models/word2vec/model.tar.gz')

# Extract
with tarfile.open('models/word2vec/model.tar.gz', 'r:gz') as tar:
    tar.extractall('models/word2vec')

print("Model files:")
for f in os.listdir('models/word2vec'):
    size = os.path.getsize(f'models/word2vec/{f}') / 1024
    print(f"  {f} ({size:.1f} KB)")

In [None]:
# Load vectors using Gensim (if available)
try:
    from gensim.models import KeyedVectors
    GENSIM_AVAILABLE = True
except ImportError:
    GENSIM_AVAILABLE = False
    print("Gensim not available. Install with: pip install gensim")

if GENSIM_AVAILABLE and os.path.exists('models/word2vec/vectors.txt'):
    # Load word vectors
    word_vectors = KeyedVectors.load_word2vec_format('models/word2vec/vectors.txt', binary=False)
    
    print(f"Loaded {len(word_vectors)} word vectors")
    print(f"Vector dimension: {word_vectors.vector_size}")
    
    # Show sample vectors
    print("\nSample words in vocabulary:")
    for word in list(word_vectors.key_to_index.keys())[:10]:
        print(f"  {word}")

In [None]:
if GENSIM_AVAILABLE and os.path.exists('models/word2vec/vectors.txt'):
    # Test semantic relationships
    print("Semantic Similarity Tests:")
    print("=" * 50)
    
    # Similar words
    test_words = ['computer', 'player', 'chef', 'company']
    
    for word in test_words:
        if word in word_vectors:
            similar = word_vectors.most_similar(word, topn=5)
            print(f"\nWords similar to '{word}':")
            for sim_word, score in similar:
                print(f"  {sim_word}: {score:.4f}")
        else:
            print(f"\n'{word}' not in vocabulary")

In [None]:
if GENSIM_AVAILABLE and os.path.exists('models/word2vec/vectors.txt'):
    # Word analogies (if vocabulary supports it)
    print("\nWord Analogy Tests:")
    print("=" * 50)
    
    # Try some analogies
    analogies = [
        ('computer', 'data', 'chef'),      # computer:data :: chef:?
        ('player', 'game', 'chef'),        # player:game :: chef:?
        ('company', 'business', 'team'),   # company:business :: team:?
    ]
    
    for w1, w2, w3 in analogies:
        if all(w in word_vectors for w in [w1, w2, w3]):
            try:
                result = word_vectors.most_similar(positive=[w2, w3], negative=[w1], topn=3)
                print(f"\n{w1}:{w2} :: {w3}:?")
                for word, score in result:
                    print(f"  {word}: {score:.4f}")
            except Exception as e:
                print(f"Error with analogy: {e}")

---

# Part B: Text Classification

## Step 2B: Generate Classification Dataset

We'll create a sentiment classification dataset with labeled text samples.

In [None]:
def generate_classification_data(num_samples=5000, seed=42):
    """
    Generate synthetic text classification data.
    
    Labels:
    - 1: Positive sentiment
    - 2: Negative sentiment
    - 3: Neutral/Informational
    """
    np.random.seed(seed)
    
    # Templates for each class
    templates = {
        1: [  # Positive
            "this is an excellent {noun} that {verb} perfectly",
            "i absolutely love this {noun} it works great",
            "amazing {noun} highly recommend to everyone",
            "best {noun} i have ever used truly outstanding",
            "fantastic {noun} exceeded all my expectations",
            "wonderful experience with this {noun} very happy",
            "great quality {noun} worth every penny spent",
            "superb {noun} will definitely buy again soon",
            "perfect {noun} exactly what i was looking for",
            "brilliant {noun} makes everything so much easier",
        ],
        2: [  # Negative
            "terrible {noun} completely waste of money",
            "this {noun} is awful do not buy it",
            "worst {noun} i have ever purchased very disappointed",
            "horrible experience with this {noun} avoid at all costs",
            "poor quality {noun} broke after one week",
            "disappointing {noun} does not work as advertised",
            "awful {noun} returning it immediately for refund",
            "bad {noun} very unhappy with this purchase",
            "useless {noun} complete waste of time and money",
            "defective {noun} stopped working after few days",
        ],
        3: [  # Neutral
            "the {noun} arrived yesterday and seems okay",
            "received the {noun} it is as described",
            "the {noun} has standard features nothing special",
            "average {noun} meets basic requirements only",
            "the {noun} works but nothing impressive",
            "got the {noun} it does what it should",
            "standard {noun} no complaints no praises",
            "the {noun} is acceptable for the price",
            "ordinary {noun} serves its purpose adequately",
            "the {noun} functions as expected nothing more",
        ]
    }
    
    nouns = ['product', 'service', 'item', 'device', 'software', 'application',
             'tool', 'equipment', 'gadget', 'purchase', 'order', 'delivery']
    
    data = []
    labels = [1, 2, 3]
    
    for _ in range(num_samples):
        label = np.random.choice(labels, p=[0.4, 0.3, 0.3])  # Slight positive bias
        template = np.random.choice(templates[label])
        noun = np.random.choice(nouns)
        
        text = template.format(noun=noun, verb='works')
        data.append((label, text))
    
    return data

In [None]:
# Generate classification data
print("Generating text classification data...")
classification_data = generate_classification_data(NUM_SAMPLES, RANDOM_STATE)

# Split into train and validation
np.random.seed(RANDOM_STATE)
np.random.shuffle(classification_data)

val_size = int(len(classification_data) * 0.1)
val_data = classification_data[:val_size]
train_data = classification_data[val_size:]

print(f"\nTraining samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")

# Show class distribution
train_labels = [d[0] for d in train_data]
label_counts = Counter(train_labels)
print(f"\nClass distribution (training):")
label_names = {1: 'Positive', 2: 'Negative', 3: 'Neutral'}
for label, count in sorted(label_counts.items()):
    print(f"  {label_names[label]}: {count} ({100*count/len(train_data):.1f}%)")

# Show samples
print(f"\nSample data:")
for label, text in train_data[:6]:
    print(f"  [{label_names[label]}] {text}")

## Step 3B: Prepare Data for Text Classification

BlazingText supervised mode expects:
- One sample per line
- Label prefixed with `__label__`
- Space-separated tokens

Format: `__label__<label> token1 token2 token3 ...`

In [None]:
# Create data directory
os.makedirs('data/classification', exist_ok=True)

def save_blazingtext_classification(data, filepath):
    """
    Save data in BlazingText classification format.
    Format: __label__<label> token1 token2 ...
    """
    with open(filepath, 'w') as f:
        for label, text in data:
            # Normalize text
            text = text.lower().strip()
            # Write in BlazingText format
            f.write(f"__label__{label} {text}\n")

# Save training and validation data
save_blazingtext_classification(train_data, 'data/classification/train.txt')
save_blazingtext_classification(val_data, 'data/classification/validation.txt')

print("Data saved:")
print(f"  - data/classification/train.txt ({os.path.getsize('data/classification/train.txt') / 1024:.1f} KB)")
print(f"  - data/classification/validation.txt ({os.path.getsize('data/classification/validation.txt') / 1024:.1f} KB)")

# Show sample of file
print("\nFile contents (first 5 lines):")
with open('data/classification/train.txt', 'r') as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        print(f"  {line.strip()}")

In [None]:
# Upload to S3
clf_train_path = f"{PREFIX}/classification/train/train.txt"
clf_val_path = f"{PREFIX}/classification/validation/validation.txt"

s3_client.upload_file('data/classification/train.txt', BUCKET_NAME, clf_train_path)
s3_client.upload_file('data/classification/validation.txt', BUCKET_NAME, clf_val_path)

clf_train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/classification/train"
clf_val_uri = f"s3://{BUCKET_NAME}/{PREFIX}/classification/validation"

print("Data uploaded to S3:")
print(f"  Train: {clf_train_uri}")
print(f"  Validation: {clf_val_uri}")

## Step 4B: Train Text Classification Model

### Key Classification Hyperparameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `mode` | Must be `supervised` | Required |
| `vector_dim` | Embedding dimension | 100 |
| `epochs` | Training passes | 5 |
| `learning_rate` | Step size | 0.05 |
| `min_count` | Minimum word frequency | 5 |
| `word_ngrams` | N-gram features (1=unigrams, 2=bigrams) | 2 |
| `early_stopping` | Stop when validation accuracy plateaus | False |
| `patience` | Epochs to wait before early stopping | 4 |

In [None]:
# Create classification estimator
clf_estimator = Estimator(
    image_uri=blazingtext_image,
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/classification/output',
    sagemaker_session=sagemaker_session,
    base_job_name='blazingtext-classification'
)

In [None]:
# Set classification hyperparameters
clf_hyperparameters = {
    "mode": "supervised",
    "vector_dim": 100,
    "epochs": 15,
    "learning_rate": 0.05,
    "min_count": 2,
    "word_ngrams": 2,          # Use bigrams for better accuracy
    "early_stopping": "True",
    "patience": 4,
    "min_epochs": 5,
}

clf_estimator.set_hyperparameters(**clf_hyperparameters)

print("Classification hyperparameters:")
for k, v in clf_hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training with validation channel for early stopping
print("Starting text classification training job...")
print("This will take approximately 3-5 minutes.\n")

clf_estimator.fit(
    {
        'train': clf_train_uri,
        'validation': clf_val_uri
    },
    wait=True,
    logs=True
)

In [None]:
# Get training job info
clf_job_name = clf_estimator.latest_training_job.name
print(f"Training job completed: {clf_job_name}")
print(f"Model artifacts: {clf_estimator.model_data}")

## Step 5B: Deploy and Test Classification Model

In [None]:
# Deploy the model
print("Deploying classification model...")
print("This will take approximately 5-7 minutes.\n")

clf_predictor = clf_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=f'blazingtext-clf-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {clf_predictor.endpoint_name}")

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Configure predictor
clf_predictor.serializer = JSONSerializer()
clf_predictor.deserializer = JSONDeserializer()

def classify_text(texts, k=1):
    """
    Classify text using the deployed model.
    
    Args:
        texts: List of text strings
        k: Number of top predictions to return
    """
    # Normalize texts
    texts = [t.lower().strip() for t in texts]
    
    payload = {
        "instances": texts,
        "configuration": {"k": k}
    }
    
    response = clf_predictor.predict(payload)
    return response

In [None]:
# Test classification
test_texts = [
    "this product is absolutely amazing i love it",
    "terrible quality waste of money do not buy",
    "the item arrived and it works as expected",
    "best purchase ever highly recommended to everyone",
    "awful experience the product broke immediately",
    "average product nothing special about it",
]

label_names = {1: 'Positive', 2: 'Negative', 3: 'Neutral'}

print("Classification Results:")
print("=" * 70)

results = classify_text(test_texts, k=3)

for text, result in zip(test_texts, results):
    labels = result['label']
    probs = result['prob']
    
    # Get top prediction
    top_label = int(labels[0].replace('__label__', ''))
    top_prob = probs[0]
    
    print(f"\nText: {text}")
    print(f"Prediction: {label_names[top_label]} ({top_prob:.2%})")
    
    # Show all probabilities
    print("  All scores:")
    for lbl, prob in zip(labels, probs):
        lbl_num = int(lbl.replace('__label__', ''))
        print(f"    {label_names[lbl_num]}: {prob:.2%}")

## Step 6B: Evaluate Classification Performance

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Evaluate on validation set
val_texts = [d[1] for d in val_data]
val_labels = [d[0] for d in val_data]

# Get predictions in batches
batch_size = 100
all_predictions = []

print("Evaluating on validation set...")
for i in range(0, len(val_texts), batch_size):
    batch = val_texts[i:i+batch_size]
    results = classify_text(batch, k=1)
    
    for result in results:
        pred_label = int(result['label'][0].replace('__label__', ''))
        all_predictions.append(pred_label)

# Calculate metrics
accuracy = accuracy_score(val_labels, all_predictions)

print("\n" + "=" * 60)
print("CLASSIFICATION RESULTS")
print("=" * 60)
print(f"\nAccuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(
    val_labels, 
    all_predictions, 
    target_names=['Positive', 'Negative', 'Neutral']
))

print("Confusion Matrix:")
cm = confusion_matrix(val_labels, all_predictions)
print(cm)

## Step 7: Clean Up Resources

In [None]:
# Delete the classification endpoint
print(f"Deleting endpoint: {clf_predictor.endpoint_name}")
clf_predictor.delete_endpoint()
print("Endpoint deleted successfully!")

In [None]:
# Optionally clean up S3 data
# Uncomment to delete:

# s3 = boto3.resource('s3')
# bucket = s3.Bucket(BUCKET_NAME)
# bucket.objects.filter(Prefix=PREFIX).delete()
# print(f"Deleted all objects under s3://{BUCKET_NAME}/{PREFIX}")

---

## Summary

In this exercise, you learned:

### Part A: Word2Vec

1. **Data Format**: Plain text file, one sentence per line, space-separated tokens

2. **Modes**:
   - `cbow`: Fast, good for large datasets
   - `skipgram`: Better for rare words, smaller datasets
   - `batch_skipgram`: Distributed training across CPUs

3. **Key Hyperparameters**:
   - `vector_dim`: Embedding dimension (50-300 typical)
   - `window_size`: Context window (5-10 typical)
   - `subwords`: Enable for OOV word handling

4. **Output**: `vectors.txt` (Gensim compatible), `vectors.bin` (deployment)

### Part B: Text Classification

1. **Data Format**: `__label__<label> token1 token2 ...` (one per line)

2. **Mode**: `supervised`

3. **Key Hyperparameters**:
   - `word_ngrams`: N-gram features (2 for bigrams)
   - `early_stopping`: Prevent overfitting
   - `vector_dim`: Embedding dimension

4. **Inference**: Returns top-k labels with probabilities

### Use Cases

| Use Case | Mode | Description |
|----------|------|-------------|
| Word Embeddings | skipgram/cbow | Pre-train embeddings for NLP tasks |
| Sentiment Analysis | supervised | Classify text sentiment |
| Topic Classification | supervised | Categorize documents by topic |
| Intent Detection | supervised | Classify user intents in chatbots |
| Spam Detection | supervised | Binary classification of spam |

### Instance Recommendations

| Task | Data Size | Recommended Instance |
|------|-----------|---------------------|
| Word2Vec | Small | ml.c5.xlarge (CPU) |
| Word2Vec | Large | ml.p3.2xlarge (GPU) |
| Classification | < 2GB | ml.c5.xlarge (CPU) |
| Classification | > 2GB | ml.p3.2xlarge (GPU) |

## Next Steps

- Try different Word2Vec architectures (cbow vs skipgram)
- Experiment with subword embeddings for rare words
- Use pre-trained embeddings for downstream tasks
- Try multi-label classification