# SageMaker Object2Vec Exercise

This notebook demonstrates Amazon SageMaker's **Object2Vec** algorithm for learning embeddings of pairs of objects.

## What You'll Learn
1. How to prepare paired data for Object2Vec
2. How to train embeddings for relationship learning
3. How to use embeddings for similarity search and classification

## What is Object2Vec?

Object2Vec is a general-purpose neural embedding algorithm that learns low-dimensional dense embeddings of high-dimensional objects. It generalizes Word2Vec to arbitrary object pairs.

**Key Features:**
- Learns embeddings from paired objects (e.g., user-item, sentence-sentence)
- Supports discrete tokens and sequences as inputs
- Can handle asymmetric pairs (e.g., query-document)
- Embeddings can be used for downstream tasks

## Use Cases

| Use Case | Pair Type | Example |
|----------|-----------|----------|
| Recommendation | (user, item) | Predict user ratings |
| Sentence similarity | (sentence, sentence) | Semantic similarity |
| Document classification | (document, label) | Multi-class classification |
| Entity resolution | (entity1, entity2) | Match duplicate records |

## Architecture

```
Input Pair: (obj1, obj2)
       ↓         ↓
   Encoder0   Encoder1
       ↓         ↓
   embed1     embed2
       ↘       ↙
       Comparator
           ↓
       Output (label/score)
```

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
from dotenv import load_dotenv
from collections import defaultdict

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "object2vec"

# Dataset parameters
NUM_USERS = 500
NUM_ITEMS = 200
NUM_INTERACTIONS = 10000
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Data

We'll create a synthetic user-item interaction dataset for a recommendation system scenario.

In [None]:
def generate_user_item_data(num_users=500, num_items=200, num_interactions=10000, seed=42):
    """
    Generate synthetic user-item interaction data.
    
    Creates users with preferences for certain item categories,
    simulating realistic interaction patterns.
    """
    np.random.seed(seed)
    
    # Define item categories (items belong to categories)
    num_categories = 5
    item_categories = np.random.randint(0, num_categories, num_items)
    
    # Define user preferences (users prefer certain categories)
    user_preferences = np.random.dirichlet(np.ones(num_categories), num_users)
    
    interactions = []
    
    for _ in range(num_interactions):
        # Sample a user
        user_id = np.random.randint(0, num_users)
        
        # Sample a category based on user preference
        preferred_category = np.random.choice(num_categories, p=user_preferences[user_id])
        
        # Get items in that category
        category_items = np.where(item_categories == preferred_category)[0]
        
        if len(category_items) > 0:
            # Sample an item from preferred category
            item_id = np.random.choice(category_items)
            # Positive interaction (user likes item from preferred category)
            label = 1
        else:
            # Random item (less likely to be liked)
            item_id = np.random.randint(0, num_items)
            label = 0
        
        # Add some noise
        if np.random.random() < 0.1:
            label = 1 - label  # Flip label with 10% probability
        
        interactions.append({
            'user_id': user_id,
            'item_id': item_id,
            'label': label
        })
    
    return pd.DataFrame(interactions), item_categories, user_preferences

In [None]:
# Generate data
print("Generating user-item interaction data...")
df, item_categories, user_preferences = generate_user_item_data(
    NUM_USERS, NUM_ITEMS, NUM_INTERACTIONS, RANDOM_STATE
)

print(f"\nDataset shape: {df.shape}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())

print(f"\nSample interactions:")
print(df.head(10))

In [None]:
# Split into train, validation, and test
np.random.seed(RANDOM_STATE)
indices = np.random.permutation(len(df))

train_size = int(0.8 * len(df))
val_size = int(0.1 * len(df))

train_idx = indices[:train_size]
val_idx = indices[train_size:train_size + val_size]
test_idx = indices[train_size + val_size:]

train_df = df.iloc[train_idx]
val_df = df.iloc[val_idx]
test_df = df.iloc[test_idx]

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")

## Step 3: Prepare Data for Object2Vec

Object2Vec expects data in JSON Lines format with specific structure:

```json
{"in0": [token_ids], "in1": [token_ids], "label": label_value}
```

**Input Types:**
- **Discrete token**: `[single_id]` - e.g., `[42]` for user_id 42
- **Sequence**: `[id1, id2, id3, ...]` - e.g., word IDs in a sentence

In [None]:
def convert_to_object2vec_format(df):
    """
    Convert DataFrame to Object2Vec JSON Lines format.
    
    Format: {"in0": [user_id], "in1": [item_id], "label": label}
    """
    records = []
    for _, row in df.iterrows():
        record = {
            "in0": [int(row['user_id'])],  # User as discrete token
            "in1": [int(row['item_id'])],  # Item as discrete token
            "label": int(row['label'])
        }
        records.append(json.dumps(record))
    return records

# Convert datasets
train_records = convert_to_object2vec_format(train_df)
val_records = convert_to_object2vec_format(val_df)
test_records = convert_to_object2vec_format(test_df)

print("Sample training records:")
for record in train_records[:5]:
    print(f"  {record}")

In [None]:
# Save to local files
os.makedirs('data/object2vec', exist_ok=True)

with open('data/object2vec/train.jsonl', 'w') as f:
    f.write('\n'.join(train_records))

with open('data/object2vec/validation.jsonl', 'w') as f:
    f.write('\n'.join(val_records))

with open('data/object2vec/test.jsonl', 'w') as f:
    f.write('\n'.join(test_records))

print("Data files created:")
for f in ['train.jsonl', 'validation.jsonl', 'test.jsonl']:
    size = os.path.getsize(f'data/object2vec/{f}') / 1024
    print(f"  data/object2vec/{f} ({size:.1f} KB)")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

for split in ['train', 'validation', 'test']:
    s3_key = f"{PREFIX}/{split}/{split}.jsonl"
    s3_client.upload_file(f'data/object2vec/{split}.jsonl', BUCKET_NAME, s3_key)
    print(f"Uploaded: s3://{BUCKET_NAME}/{s3_key}")

train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
val_uri = f"s3://{BUCKET_NAME}/{PREFIX}/validation"
test_uri = f"s3://{BUCKET_NAME}/{PREFIX}/test"

## Step 4: Train Object2Vec Model

### Key Hyperparameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `enc0_max_seq_len` | Max sequence length for encoder 0 | 1 |
| `enc1_max_seq_len` | Max sequence length for encoder 1 | 1 |
| `enc0_vocab_size` | Vocabulary size for encoder 0 | Required |
| `enc1_vocab_size` | Vocabulary size for encoder 1 | Required |
| `enc_dim` | Encoder embedding dimension | 4096 |
| `output_layer` | Comparator: `softmax` (classification) or `mean_squared_error` (regression) | softmax |
| `epochs` | Training epochs | 30 |
| `learning_rate` | Learning rate | 0.0004 |
| `mini_batch_size` | Batch size | 32 |
| `mlp_layers` | Hidden layer sizes in comparator | 512 |
| `mlp_activation` | Activation function | relu |
| `token_embedding_dim` | Token embedding dimension | 300 |
| `comparator_list` | Comparison operations | hadamard, concat |

### Encoder Types

| Encoder | Use Case |
|---------|----------|
| `pooled_embedding` | Simple pooling of token embeddings |
| `hcnn` | Hierarchical CNN for sequences |
| `bilstm` | Bidirectional LSTM for sequences |

In [None]:
# Get Object2Vec container image
object2vec_image = retrieve(
    framework='object2vec',
    region=region,
    version='1'
)

print(f"Object2Vec Image URI: {object2vec_image}")

In [None]:
# Create Object2Vec estimator
object2vec_estimator = Estimator(
    image_uri=object2vec_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',  # CPU instance for small datasets
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='object2vec'
)

In [None]:
# Set hyperparameters
hyperparameters = {
    # Input configuration
    "enc0_max_seq_len": 1,              # Single token (user_id)
    "enc1_max_seq_len": 1,              # Single token (item_id)
    "enc0_vocab_size": NUM_USERS,       # Number of unique users
    "enc1_vocab_size": NUM_ITEMS,       # Number of unique items
    
    # Encoder configuration
    "enc0_network": "pooled_embedding", # Simple embedding for discrete tokens
    "enc1_network": "pooled_embedding",
    "enc0_token_embedding_dim": 64,
    "enc1_token_embedding_dim": 64,
    
    # Output configuration
    "output_layer": "softmax",          # Binary classification
    "num_classes": 2,
    
    # Comparator configuration
    "comparator_list": "hadamard,concat,abs_diff",
    "mlp_layers": 128,
    "mlp_activation": "relu",
    "mlp_dim": 256,
    
    # Training configuration
    "epochs": 20,
    "learning_rate": 0.001,
    "mini_batch_size": 64,
    "early_stopping_patience": 3,
    "early_stopping_tolerance": 0.001,
}

object2vec_estimator.set_hyperparameters(**hyperparameters)

print("Object2Vec hyperparameters:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training
print("Starting Object2Vec training job...")
print("This will take approximately 5-10 minutes.\n")

object2vec_estimator.fit(
    {
        'train': train_uri,
        'validation': val_uri,
        'test': test_uri
    },
    wait=True,
    logs=True
)

In [None]:
# Get training job info
job_name = object2vec_estimator.latest_training_job.name
print(f"Training job completed: {job_name}")
print(f"Model artifacts: {object2vec_estimator.model_data}")

## Step 5: Deploy and Test Model

In [None]:
# Deploy the model
print("Deploying Object2Vec model...")
print("This will take approximately 5-7 minutes.\n")

predictor = object2vec_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=f'object2vec-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {predictor.endpoint_name}")

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Configure predictor
predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

def predict_interaction(user_ids, item_ids):
    """
    Predict whether users will like items.
    
    Args:
        user_ids: List of user IDs
        item_ids: List of item IDs
    
    Returns:
        Predictions with scores
    """
    instances = []
    for user_id, item_id in zip(user_ids, item_ids):
        instances.append({
            "in0": [int(user_id)],
            "in1": [int(item_id)]
        })
    
    payload = {"instances": instances}
    response = predictor.predict(payload)
    return response

In [None]:
# Test predictions
print("Testing predictions on sample user-item pairs:")
print("=" * 60)

# Get some test samples
test_samples = test_df.head(10)

user_ids = test_samples['user_id'].tolist()
item_ids = test_samples['item_id'].tolist()
true_labels = test_samples['label'].tolist()

predictions = predict_interaction(user_ids, item_ids)

print(f"{'User':<8} {'Item':<8} {'True':<8} {'Pred':<8} {'Score':<10}")
print("-" * 50)

for i, pred in enumerate(predictions['predictions']):
    scores = pred['scores']
    pred_label = 1 if scores[1] > scores[0] else 0
    confidence = max(scores)
    
    status = "correct" if pred_label == true_labels[i] else "WRONG"
    print(f"{user_ids[i]:<8} {item_ids[i]:<8} {true_labels[i]:<8} {pred_label:<8} {confidence:.4f} {status}")

## Step 6: Extract Embeddings

Object2Vec can also return embeddings for individual objects, which is useful for similarity search.

In [None]:
# Note: To get embeddings, you need to deploy with a different inference mode
# or use batch transform. For demonstration, we'll show the concept.

print("Embedding Extraction Options:")
print("="*60)
print("""
To extract embeddings instead of predictions, you can:

1. Set INFERENCE_PREFERRED_MODE environment variable:
   - 'embedding' - Returns embeddings instead of predictions
   - 'classification' - Returns classification predictions (default)

2. Provide only one input (in0 OR in1):
   - If only in0 is provided, returns enc0 embedding
   - If only in1 is provided, returns enc1 embedding

Example embedding request:
{"instances": [{"in0": [user_id]}]}  # Get user embedding
{"instances": [{"in1": [item_id]}]}  # Get item embedding

Embeddings can be used for:
- Finding similar users/items via cosine similarity
- Visualizing user/item clusters with t-SNE/UMAP
- Downstream ML tasks as features
""")

## Step 7: Evaluate Model Performance

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Evaluate on full test set
print("Evaluating on test set...")

all_predictions = []
all_scores = []
batch_size = 100

test_users = test_df['user_id'].tolist()
test_items = test_df['item_id'].tolist()
test_labels = test_df['label'].tolist()

for i in range(0, len(test_df), batch_size):
    batch_users = test_users[i:i+batch_size]
    batch_items = test_items[i:i+batch_size]
    
    preds = predict_interaction(batch_users, batch_items)
    
    for pred in preds['predictions']:
        scores = pred['scores']
        pred_label = 1 if scores[1] > scores[0] else 0
        all_predictions.append(pred_label)
        all_scores.append(scores[1])  # Probability of class 1

# Calculate metrics
accuracy = accuracy_score(test_labels, all_predictions)
precision = precision_score(test_labels, all_predictions)
recall = recall_score(test_labels, all_predictions)
f1 = f1_score(test_labels, all_predictions)
auc = roc_auc_score(test_labels, all_scores)

print("\n" + "="*60)
print("MODEL EVALUATION RESULTS")
print("="*60)
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")
print(f"AUC-ROC:   {auc:.4f}")

## Step 8: Clean Up Resources

In [None]:
# Delete the endpoint
print(f"Deleting endpoint: {predictor.endpoint_name}")
predictor.delete_endpoint()
print("Endpoint deleted successfully!")

In [None]:
# Optionally clean up S3 data
# Uncomment to delete:

# s3 = boto3.resource('s3')
# bucket = s3.Bucket(BUCKET_NAME)
# bucket.objects.filter(Prefix=PREFIX).delete()
# print(f"Deleted all objects under s3://{BUCKET_NAME}/{PREFIX}")

---

## Summary

In this exercise, you learned:

1. **Data Format**: JSON Lines with `{"in0": [...], "in1": [...], "label": ...}`

2. **Input Types**:
   - Discrete tokens: `[single_id]`
   - Sequences: `[id1, id2, ...]`

3. **Encoder Options**:
   - `pooled_embedding`: For discrete tokens
   - `hcnn`: For sequences (CNN-based)
   - `bilstm`: For sequences (RNN-based)

4. **Output Modes**:
   - Classification: `output_layer=softmax`
   - Regression: `output_layer=mean_squared_error`
   - Embeddings: Single input returns encoder embedding

5. **Comparator Operations**: `hadamard`, `concat`, `abs_diff`

### Use Cases Recap

| Application | in0 | in1 | Label |
|-------------|-----|-----|-------|
| Recommendation | User ID | Item ID | Rating/Click |
| Sentence Similarity | Sentence 1 tokens | Sentence 2 tokens | Similarity score |
| Document Classification | Document tokens | Category ID | Match |
| Entity Resolution | Entity 1 | Entity 2 | Same/Different |

### Instance Recommendations

| Task | Data Size | Instance |
|------|-----------|----------|
| Training (small) | < 1M pairs | ml.m5.xlarge |
| Training (large) | > 1M pairs | ml.p2.xlarge or ml.p3.2xlarge |
| Inference | Any | ml.m5.large or ml.p3.2xlarge |

## Next Steps

- Try sequence inputs (e.g., sentence pairs for similarity)
- Use pre-trained embeddings via auxiliary channel
- Extract embeddings for visualization and similarity search
- Combine with downstream classifiers