# PostgreSQL to DocumentDB Migration Strategies

<div style="background-color: #f8f9fa; border: 1px solid #e9ecef; border-radius: 8px; padding: 10px; margin: 10px;">
<strong>📋 Workshop Contents</strong>
<ul style="line-height: 1.2;">
<li><a href="#Overview">Overview</a></li>
<li><a href="#Migration-Patterns">Migration Patterns</a></li>
<li><a href="#Data-Model-Differences-RDBMS-to-Document-Database">Data Model Differences</a></li>
<li><a href="#Migration-Options">Migration Options</a></li>
<li><a href="#Best-Practices">Best Practices</a></li>
<li><a href=\"#Validation-and-Testing\">Validation and Testing</a></li>
<li><a href=\"#Performance-Comparison\">Performance Comparison</a></li>
<li><a href=\"#Conclusion\">Conclusion</a></li>
<li><a href=\"#Next-Steps\">Next Steps</a></li>
<li><a href=\"#Additional-Resources\">Additional Resources 📚</a></li>
</ul>
</div>

## Overview
This guide demonstrates different approaches to migrate data from PostgreSQL to Amazon DocumentDB, focusing on handling JSON data efficiently. Following our previous example ([Understanding the JSON in RDBMS Anti-Pattern](../7.2_Understanding-the-JSON-in-RDBMS-Antipattern/README.md)), where we identified performance limitations with JSON data storage in PostgreSQL, we'll explore migration strategies to DocumentDB.

### Why Consider Migration?

In our previous module, we encountered several challenges with PostgreSQL JSON storage:
- Slow query performance on complex JSON queries
- Limited indexing capabilities for nested JSON structures
- Increasing query times with growing data volume

Amazon DocumentDB offers several advantages for JSON workloads:
- Native JSON document storage
- Better query performance for nested JSON structures
- Horizontal scaling capabilities
- Purpose-built for JSON document workflows

> 💡 **Key Consideration:** Before proceeding with migration, evaluate your specific use case. While DocumentDB excels at JSON document storage, it might not be the best choice for all scenarios, especially if your workload involves complex joins or traditional relational data structures.

### Prerequisites
- ✅ Access to source PostgreSQL database
- ✅ Amazon DocumentDB cluster
- ✅ Appropriate network connectivity
- ✅ Required permissions on both systems
- ✅ **Jupyter Notebook**: You can launch a [free tier Amazon SageMaker Jupyter Notebook](../../1_Getting_Started_with_AWS/1.4_Setting_up_Your_Cookbook_Environment/README.MD)

### What We'll Cover
1. Different migration patterns and when to use them
2. Implementation approaches (AWS DMS and custom Python script)
3. Best practices and performance considerations
4. Post-migration validation and monitoring

## Migration Patterns

<table>
    <tr>
        <th>Full Load</th>
        <th>Full Load + CDC</th>
        <th>CDC Only</th>
    </tr>
    <tr>
        <td>One-time complete data copy</td>
        <td>Initial data copy followed by continuous synchronization</td>
        <td>Captures and replicates only data changes</td>
    </tr>
    <tr>
        <td colspan="3" align="center"><strong>Best For:</strong></td>
    </tr>
    <tr>
        <td>
            • Initial migrations<br>
            • Small to medium datasets<br>
            • When downtime is acceptable
        </td>
        <td>
            • Production systems<br>
            • Minimal downtime requirements<br>
            • Large datasets
        </td>
        <td>
            • Systems with existing data sync<br>
            • Continuous replication needs<br>
            • Real-time data synchronization
        </td>
    </tr>
</table>

## Data Model Differences: RDBMS to Document Database

When migrating from PostgreSQL (RDBMS) to DocumentDB (Document Database), we're not just moving data - we're transforming how it's stored and accessed:

## Data Model Comparison

| PostgreSQL (Source) | DocumentDB (Target) |
|-------------------|-------------------|
| Data normalized across multiple tables | Data stored as self-contained documents |
| Relationships maintained through foreign keys | Nested structures replace table relationships |
| JSON stored as a column type | Native JSON support with optimized querying |
| Schema is strictly defined | Flexible, schema-less design |
| Complex joins needed for data retrieval | No joins needed - data embedded in documents |


### Key Benefits
- Improved query performance on JSON data
- Better support for nested data structures
- Simplified data model without joins
- Flexible schema for evolving data structures
- Native indexing for JSON fields

This transformation allows for more efficient handling of document-oriented workloads while eliminating the overhead of JSON parsing and complex joins in PostgreSQL.

![pg-document.png](../images/7.2-postgresql-document-storage.png)

## Migration Options

### Option 1: AWS Database Migration Service (DMS)

AWS DMS is a fully managed service that enables you to migrate databases to AWS quickly and securely. It supports homogeneous migrations (like PostgreSQL to PostgreSQL) and heterogeneous migrations (like PostgreSQL to DocumentDB).

#### How AWS DMS Works
- Creates a replication instance that performs the migration
- Reads data from source database
- Formats the data for the target database
- Loads the data into the target database
- Maintains data consistency through CDC if configured
- Validates data migration completeness

#### High-Level Migration Steps
![ERD-Page-12.jpg](../images/7.1-database-selection-erd.jpg)

### Option 2: Custom Python Migration Script

Below is a Python implementation for data migration:

In [None]:
import psycopg2
from pymongo import MongoClient
import logging
import json
from datetime import datetime
from typing import Dict, List, Any

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class PostgreSQLToDocumentDBMigrator:
    def __init__(self, pg_config: Dict, docdb_config: Dict):
        self.pg_config = pg_config
        self.docdb_config = docdb_config
        self.pg_conn = None
        self.docdb_client = None
        
    def connect_postgresql(self):
        """Establish connection to PostgreSQL"""
        try:
            self.pg_conn = psycopg2.connect(**self.pg_config)
            logger.info("Connected to PostgreSQL successfully")
        except Exception as e:
            logger.error(f"Failed to connect to PostgreSQL: {e}")
            raise
    
    def connect_documentdb(self):
        """Establish connection to DocumentDB"""
        try:
            connection_string = f"mongodb://{self.docdb_config['username']}:{self.docdb_config['password']}@{self.docdb_config['host']}:{self.docdb_config['port']}/{self.docdb_config['database']}?ssl=true&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false"
            self.docdb_client = MongoClient(connection_string)
            # Test connection
            self.docdb_client.admin.command('ping')
            logger.info("Connected to DocumentDB successfully")
        except Exception as e:
            logger.error(f"Failed to connect to DocumentDB: {e}")
            raise
    
    def extract_data_from_postgresql(self, query: str) -> List[Dict]:
        """Extract data from PostgreSQL using provided query"""
        try:
            cursor = self.pg_conn.cursor()
            cursor.execute(query)
            
            # Get column names
            columns = [desc[0] for desc in cursor.description]
            
            # Fetch all rows and convert to dictionaries
            rows = cursor.fetchall()
            data = [dict(zip(columns, row)) for row in rows]
            
            logger.info(f"Extracted {len(data)} records from PostgreSQL")
            return data
            
        except Exception as e:
            logger.error(f"Failed to extract data from PostgreSQL: {e}")
            raise
        finally:
            cursor.close()
    
    def transform_data(self, data: List[Dict]) -> List[Dict]:
        """Transform PostgreSQL data for DocumentDB"""
        transformed_data = []
        
        for record in data:
            # Convert datetime objects to strings
            for key, value in record.items():
                if isinstance(value, datetime):
                    record[key] = value.isoformat()
                elif isinstance(value, str) and self._is_json_string(value):
                    # Parse JSON strings
                    try:
                        record[key] = json.loads(value)
                    except json.JSONDecodeError:
                        # Keep as string if not valid JSON
                        pass
            
            transformed_data.append(record)
        
        logger.info(f"Transformed {len(transformed_data)} records")
        return transformed_data
    
    def _is_json_string(self, value: str) -> bool:
        """Check if string is a valid JSON"""
        try:
            json.loads(value)
            return True
        except (json.JSONDecodeError, TypeError):
            return False
    
    def load_data_to_documentdb(self, data: List[Dict], collection_name: str):
        """Load transformed data into DocumentDB"""
        try:
            db = self.docdb_client[self.docdb_config['database']]
            collection = db[collection_name]
            
            if data:
                result = collection.insert_many(data)
                logger.info(f"Inserted {len(result.inserted_ids)} documents into {collection_name}")
            else:
                logger.warning("No data to insert")
                
        except Exception as e:
            logger.error(f"Failed to load data to DocumentDB: {e}")
            raise
    
    def migrate_table(self, table_name: str, collection_name: str = None, custom_query: str = None):
        """Migrate a single table from PostgreSQL to DocumentDB"""
        if collection_name is None:
            collection_name = table_name
        
        if custom_query is None:
            query = f"SELECT * FROM {table_name}"
        else:
            query = custom_query
        
        logger.info(f"Starting migration of {table_name} to {collection_name}")
        
        # Extract, Transform, Load
        data = self.extract_data_from_postgresql(query)
        transformed_data = self.transform_data(data)
        self.load_data_to_documentdb(transformed_data, collection_name)
        
        logger.info(f"Completed migration of {table_name}")
    
    def close_connections(self):
        """Close database connections"""
        if self.pg_conn:
            self.pg_conn.close()
            logger.info("PostgreSQL connection closed")
        
        if self.docdb_client:
            self.docdb_client.close()
            logger.info("DocumentDB connection closed")

# Example usage
if __name__ == "__main__":
    # Configuration
    pg_config = {
        'host': 'your-postgresql-host',
        'database': 'your-database',
        'user': 'your-username',
        'password': 'your-password',
        'port': 5432
    }
    
    docdb_config = {
        'host': 'your-documentdb-cluster-endpoint',
        'database': 'your-database',
        'username': 'your-username',
        'password': 'your-password',
        'port': 27017
    }
    
    # Initialize migrator
    migrator = PostgreSQLToDocumentDBMigrator(pg_config, docdb_config)
    
    try:
        # Connect to both databases
        migrator.connect_postgresql()
        migrator.connect_documentdb()
        
        # Migrate specific tables
        migrator.migrate_table('user_profiles', 'users')
        migrator.migrate_table('product_catalog', 'products')
        
        # Custom query example for complex transformations
        custom_query = """
        SELECT 
            id,
            name,
            email,
            preferences::text as preferences,
            created_at
        FROM user_profiles 
        WHERE active = true
        """
        migrator.migrate_table('user_profiles', 'active_users', custom_query)
        
    except Exception as e:
        logger.error(f"Migration failed: {e}")
    finally:
        migrator.close_connections()

## Best Practices

### 1. Pre-Migration Planning
- **Data Assessment**: Analyze your PostgreSQL schema and identify JSON columns
- **Performance Baseline**: Establish current query performance metrics
- **Schema Design**: Plan your DocumentDB collection structure
- **Index Strategy**: Identify required indexes for DocumentDB

### 2. Migration Execution
- **Batch Processing**: Process data in manageable chunks
- **Error Handling**: Implement robust error handling and retry logic
- **Progress Monitoring**: Track migration progress and performance
- **Data Validation**: Verify data integrity throughout the process

### 3. Post-Migration Optimization
- **Index Creation**: Create appropriate indexes based on query patterns
- **Query Optimization**: Optimize queries for DocumentDB
- **Performance Testing**: Compare performance with PostgreSQL baseline
- **Monitoring Setup**: Implement comprehensive monitoring

### 4. Common Pitfalls to Avoid
- **Over-normalization**: Don't replicate relational patterns in DocumentDB
- **Missing Indexes**: Ensure proper indexing for query performance
- **Large Documents**: Avoid documents larger than 16MB
- **Inefficient Queries**: Optimize queries for document structure

## Validation and Testing

After migration, it's crucial to validate that the data has been transferred correctly and that the new system performs as expected.

In [None]:
def validate_migration(migrator, pg_table: str, docdb_collection: str):
    """Validate migration by comparing record counts and sample data"""
    
    # Count records in PostgreSQL
    pg_cursor = migrator.pg_conn.cursor()
    pg_cursor.execute(f"SELECT COUNT(*) FROM {pg_table}")
    pg_count = pg_cursor.fetchone()[0]
    
    # Count documents in DocumentDB
    db = migrator.docdb_client[migrator.docdb_config['database']]
    collection = db[docdb_collection]
    docdb_count = collection.count_documents({})
    
    logger.info(f"PostgreSQL {pg_table}: {pg_count} records")
    logger.info(f"DocumentDB {docdb_collection}: {docdb_count} documents")
    
    if pg_count == docdb_count:
        logger.info("✅ Record counts match")
        return True
    else:
        logger.error("❌ Record counts don't match")
        return False

# Example validation
# validate_migration(migrator, 'user_profiles', 'users')

## Performance Comparison

Let's compare query performance between PostgreSQL and DocumentDB for JSON operations:

In [None]:
import time

def benchmark_queries():
    """Compare query performance between PostgreSQL and DocumentDB"""
    
    # PostgreSQL JSON query
    pg_query = """
    SELECT * FROM user_profiles 
    WHERE preferences->>'theme' = 'dark' 
    AND preferences->'notifications'->>'email' = 'true'
    """
    
    # DocumentDB equivalent query
    docdb_query = {
        "preferences.theme": "dark",
        "preferences.notifications.email": "true"
    }
    
    # Benchmark PostgreSQL
    start_time = time.time()
    # Execute PostgreSQL query here
    pg_time = time.time() - start_time
    
    # Benchmark DocumentDB
    start_time = time.time()
    # Execute DocumentDB query here
    docdb_time = time.time() - start_time
    
    logger.info(f"PostgreSQL query time: {pg_time:.3f}s")
    logger.info(f"DocumentDB query time: {docdb_time:.3f}s")
    logger.info(f"Performance improvement: {((pg_time - docdb_time) / pg_time * 100):.1f}%")

# benchmark_queries()

## Conclusion

Migrating from PostgreSQL to DocumentDB can provide significant benefits for JSON-heavy workloads:

### Key Takeaways
- **Performance Gains**: Native JSON support eliminates parsing overhead
- **Simplified Queries**: No complex joins needed for nested data
- **Better Scalability**: Horizontal scaling capabilities
- **Flexible Schema**: Easier to evolve data structures

### When to Consider Migration
- Heavy use of JSON data types in PostgreSQL
- Performance issues with JSON queries
- Need for horizontal scaling
- Document-oriented data access patterns

### Migration Action Plan
1. Assess your current PostgreSQL JSON usage
2. Plan your DocumentDB schema design
3. Set up a test migration environment
4. Validate performance improvements
5. Plan your production migration strategy

Remember that migration is not always the answer - evaluate your specific use case and requirements before proceeding.

## Additional Resources 📚

### PostgreSQL JSON Features
- [PostgreSQL JSON Types](https://www.postgresql.org/docs/current/datatype-json.html)
- [JSON Functions and Operators](https://www.postgresql.org/docs/current/functions-json.html)
- [GIN Indexes for JSON](https://www.postgresql.org/docs/current/gin-intro.html)

### Purpose-Built Databases
- [Amazon DocumentDB](https://docs.aws.amazon.com/documentdb/)
- [Amazon DynamoDB](https://docs.aws.amazon.com/dynamodb/)
- [Database Selection Guide](https://aws.amazon.com/products/databases/)

### Performance & Migration
- [Database Migration Best Practices](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_BestPractices.html)
- [Performance Tuning PostgreSQL](https://wiki.postgresql.org/wiki/Performance_Optimization)
- [AWS Database Migration Service](https://docs.aws.amazon.com/dms/)