# SageMaker IP Insights Exercise

This notebook demonstrates Amazon SageMaker's **IP Insights** algorithm for detecting anomalous IP address usage patterns.

## What You'll Learn
1. How to prepare entity-IP pair data
2. How to train an IP Insights model
3. How to detect anomalous IP associations

## What is IP Insights?

IP Insights is an **unsupervised** algorithm that:
- Learns associations between entities (users, accounts) and IP addresses
- Creates embeddings for entities and IPs
- Detects when an entity uses an unusual IP

**Key Concept:**
- Learn latent vectors for entities and IPs
- Distance between vectors indicates association likelihood
- Anomalies are unexpected entity-IP pairs

## Use Cases

| Application | Description |
|-------------|-------------|
| Account takeover detection | User logging from unusual IP |
| Fraud prevention | Account created from suspicious IP |
| Bot detection | Automated access patterns |
| Compromised credentials | Login from attacker's IP |

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
from dotenv import load_dotenv
import matplotlib.pyplot as plt

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "ip-insights"

# Dataset parameters
NUM_USERS = 1000
NUM_EVENTS = 50000
ANOMALY_RATE = 0.02
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Login Data

In [None]:
def generate_ip():
    """Generate random IP address."""
    return f"{np.random.randint(1, 255)}.{np.random.randint(0, 255)}.{np.random.randint(0, 255)}.{np.random.randint(1, 255)}"

def generate_login_data(num_users=1000, num_events=50000, anomaly_rate=0.02, seed=42):
    """
    Generate synthetic user login data.
    
    Each user typically logs in from a small set of IPs.
    Anomalies are logins from completely new IPs.
    """
    np.random.seed(seed)
    
    # Generate user IDs
    user_ids = [f"user_{i:05d}" for i in range(num_users)]
    
    # Assign typical IPs to each user (1-3 IPs per user)
    user_ips = {}
    for user_id in user_ids:
        num_ips = np.random.randint(1, 4)
        user_ips[user_id] = [generate_ip() for _ in range(num_ips)]
    
    # Generate login events
    events = []
    anomaly_labels = []
    
    for _ in range(num_events):
        user_id = np.random.choice(user_ids)
        
        # Determine if this is an anomaly
        is_anomaly = np.random.random() < anomaly_rate
        
        if is_anomaly:
            # Use a completely random IP (anomaly)
            ip = generate_ip()
            # Make sure it's not accidentally a normal IP
            while ip in user_ips[user_id]:
                ip = generate_ip()
        else:
            # Use one of the user's typical IPs
            ip = np.random.choice(user_ips[user_id])
        
        events.append({'entity': user_id, 'ip': ip})
        anomaly_labels.append(is_anomaly)
    
    df = pd.DataFrame(events)
    df['is_anomaly'] = anomaly_labels
    
    return df, user_ips

# Generate data
df, user_normal_ips = generate_login_data(NUM_USERS, NUM_EVENTS, ANOMALY_RATE, RANDOM_STATE)

print(f"Dataset shape: {df.shape}")
print(f"Anomaly count: {df['is_anomaly'].sum()} ({100*df['is_anomaly'].mean():.1f}%)")
print(f"\nSample data:")
print(df.head(10))

In [None]:
# Show example user's normal IPs
sample_user = 'user_00000'
print(f"\nSample user '{sample_user}' normal IPs:")
print(f"  {user_normal_ips[sample_user]}")

print(f"\nSample user's events:")
user_events = df[df['entity'] == sample_user].head(10)
print(user_events)

## Step 3: Prepare Data for IP Insights

IP Insights expects CSV with two columns:
- Column 1: Entity identifier (string)
- Column 2: IPv4 address (decimal-dot notation)

In [None]:
# Split data (we'll train on normal data, test on all data)
# In production, you wouldn't know which are anomalies during training

# For training: use all data (IP Insights learns from patterns)
train_df = df[['entity', 'ip']].copy()

# For validation: hold out some data
np.random.seed(RANDOM_STATE)
val_mask = np.random.random(len(df)) < 0.1
val_df = df[val_mask][['entity', 'ip']].copy()

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")

In [None]:
# Save as CSV (no header)
os.makedirs('data/ip_insights', exist_ok=True)

train_df.to_csv('data/ip_insights/train.csv', index=False, header=False)
val_df.to_csv('data/ip_insights/validation.csv', index=False, header=False)

# Also save labels for later evaluation
df[val_mask][['entity', 'ip', 'is_anomaly']].to_csv('data/ip_insights/val_labels.csv', index=False)

print("Data files created:")
for f in os.listdir('data/ip_insights'):
    size = os.path.getsize(f'data/ip_insights/{f}') / 1024
    print(f"  data/ip_insights/{f} ({size:.1f} KB)")

print("\nFile preview (train.csv):")
with open('data/ip_insights/train.csv', 'r') as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        print(f"  {line.strip()}")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

for split in ['train', 'validation']:
    s3_key = f"{PREFIX}/{split}/{split}.csv"
    s3_client.upload_file(f'data/ip_insights/{split}.csv', BUCKET_NAME, s3_key)
    print(f"Uploaded: s3://{BUCKET_NAME}/{s3_key}")

train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
val_uri = f"s3://{BUCKET_NAME}/{PREFIX}/validation"

## Step 4: Train IP Insights Model

### Key Hyperparameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `num_entity_vectors` | Number of entity hash buckets | Required |
| `vector_dim` | Embedding dimension | 128 |
| `epochs` | Training epochs | 10 |
| `learning_rate` | Learning rate | 0.001 |
| `mini_batch_size` | Batch size | 10000 |
| `num_ip_encoder_layers` | IP encoder depth | 1 |
| `random_negative_sampling_rate` | Negative samples per positive | 1 |

In [None]:
# Get IP Insights container image
ip_insights_image = retrieve(
    framework='ipinsights',
    region=region,
    version='1'
)

print(f"IP Insights Image URI: {ip_insights_image}")

In [None]:
# Create IP Insights estimator
ip_insights_estimator = Estimator(
    image_uri=ip_insights_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='ip-insights'
)

In [None]:
# Set hyperparameters
hyperparameters = {
    "num_entity_vectors": NUM_USERS * 2,  # Hash space larger than users
    "vector_dim": 128,
    "epochs": 10,
    "learning_rate": 0.001,
    "mini_batch_size": 1000,
    "random_negative_sampling_rate": 1,
}

ip_insights_estimator.set_hyperparameters(**hyperparameters)

print("IP Insights hyperparameters:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training
print("Starting IP Insights training job...")
print("This will take approximately 5-10 minutes.\n")

ip_insights_estimator.fit(
    {
        'train': train_uri,
        'validation': val_uri
    },
    wait=True,
    logs=True
)

In [None]:
# Get training job info
job_name = ip_insights_estimator.latest_training_job.name
print(f"Training job completed: {job_name}")
print(f"Model artifacts: {ip_insights_estimator.model_data}")

## Step 5: Deploy and Score Associations

In [None]:
# Deploy the model
print("Deploying IP Insights model...")
print("This will take approximately 5-7 minutes.\n")

ip_predictor = ip_insights_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=f'ip-insights-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {ip_predictor.endpoint_name}")

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Configure predictor
ip_predictor.serializer = CSVSerializer()
ip_predictor.deserializer = JSONDeserializer()

def get_anomaly_scores(entities, ips, predictor, batch_size=500):
    """
    Get anomaly scores for entity-IP pairs.
    
    Higher scores = more anomalous (less likely association)
    """
    scores = []
    
    # Prepare data
    data = [[e, ip] for e, ip in zip(entities, ips)]
    
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        # Convert to CSV format
        csv_batch = '\n'.join([f"{row[0]},{row[1]}" for row in batch])
        
        response = predictor.predict(csv_batch)
        
        for pred in response['predictions']:
            # dot_product: lower = more anomalous
            # We negate to make higher = more anomalous
            scores.append(-pred['dot_product'])
    
    return np.array(scores)

In [None]:
# Load validation labels
val_labels_df = pd.read_csv('data/ip_insights/val_labels.csv')

# Get scores
print("Getting anomaly scores...")
scores = get_anomaly_scores(
    val_labels_df['entity'].tolist(),
    val_labels_df['ip'].tolist(),
    ip_predictor
)

val_labels_df['anomaly_score'] = scores

print(f"\nScore statistics:")
print(f"  Normal events - Mean: {scores[~val_labels_df['is_anomaly']].mean():.4f}")
print(f"  Anomaly events - Mean: {scores[val_labels_df['is_anomaly']].mean():.4f}")

## Step 6: Evaluate Anomaly Detection

In [None]:
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve

# Calculate AUC
true_labels = val_labels_df['is_anomaly'].astype(int)
auc = roc_auc_score(true_labels, scores)

print(f"AUC-ROC: {auc:.4f}")

# Plot score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Score distribution
axes[0].hist(scores[~val_labels_df['is_anomaly']], bins=50, alpha=0.5, label='Normal', color='blue')
axes[0].hist(scores[val_labels_df['is_anomaly']], bins=50, alpha=0.5, label='Anomaly', color='red')
axes[0].set_xlabel('Anomaly Score')
axes[0].set_ylabel('Count')
axes[0].set_title('Score Distribution')
axes[0].legend()

# ROC curve
fpr, tpr, _ = roc_curve(true_labels, scores)
axes[1].plot(fpr, tpr, label=f'AUC = {auc:.4f}')
axes[1].plot([0, 1], [0, 1], 'k--')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Show top anomalies
val_labels_df_sorted = val_labels_df.sort_values('anomaly_score', ascending=False)

print("Top 15 Highest Anomaly Scores:")
print("=" * 70)
print(val_labels_df_sorted[['entity', 'ip', 'anomaly_score', 'is_anomaly']].head(15).to_string())

## Step 7: Clean Up Resources

In [None]:
# Delete the endpoint
print(f"Deleting endpoint: {ip_predictor.endpoint_name}")
ip_predictor.delete_endpoint()
print("Endpoint deleted successfully!")

---

## Summary

In this exercise, you learned:

1. **Data Format**: CSV with entity (string) and IP (decimal-dot)

2. **Key Hyperparameters**:
   - `num_entity_vectors`: Hash space for entities
   - `vector_dim`: Embedding dimension
   - `random_negative_sampling_rate`: Negative samples

3. **Output**: `dot_product` score
   - Higher dot_product = more normal
   - Lower dot_product = more anomalous

4. **Threshold Selection**:
   - Based on business requirements
   - Use validation data with known anomalies
   - Consider precision/recall tradeoff

### Instance Recommendations

| Task | Instance Types |
|------|----------------|
| Training | ml.m5.large, ml.p3.2xlarge (GPU for large data) |
| Inference | ml.m5.large, ml.c5.large (CPU recommended) |

### Best Practices

- Hash space should be larger than unique entities
- Monitor score distribution over time
- Combine with other fraud signals
- Retrain periodically as patterns change

### Next Steps

- Apply to real login/access log data
- Combine with time-based features
- Set up alerting for high-score events
- Integrate with authentication systems