AWS Database Services Comprehensive Notes

# Amazon RDS (Relational Database Service)

## Overview

Amazon RDS is a managed relational database service on AWS that simplifies database administration tasks. It supports multiple database engines and is optimized for transactional workloads.

## Supported Database Engines

RDS supports the following database engines:

- **Amazon Aurora**: AWS's proprietary database engine, compatible with MySQL and PostgreSQL
- **MySQL**: Open-source relational database
- **PostgreSQL**: Advanced open-source relational database
- **MariaDB**: Community-developed fork of MySQL
- **Oracle Database**: Enterprise-grade commercial database
- **SQL Server**: Microsoft's relational database management system

## Workload Characteristics

### Designed For: OLTP (Online Transaction Processing)

RDS is optimized for OLTP workloads that require:

- **Transactions with ACID properties**: Ensures data integrity and reliability
- **High consistency**: Maintains data accuracy across all operations
- **Low-latency reads and writes**: Fast response times for individual operations
- **Frequent small updates**: Handling many concurrent small transactions

### Not Designed For: OLAP (Online Analytical Processing)

RDS is not suitable for:

- **Large-scale table scans**: Reading entire large tables
- **Heavy aggregations**: Complex analytical queries across massive datasets
- **Big data analytics**: Data warehousing and business intelligence workloads

For analytical workloads, AWS offers Redshift instead.

## ACID Properties

RDS databases provide full ACID compliance, ensuring reliable transaction processing.

### Atomicity

A transaction is treated as a single, indivisible unit of work. Either all operations in the transaction complete successfully, or none of them do. This is an all-or-nothing approach.

**Example**: When transferring money between bank accounts, both the debit and credit must occur together, or neither should happen.

### Consistency

A transaction brings the database from one valid state to another valid state. All database rules, constraints, and triggers are maintained.

**Example**: If a database constraint requires account balances to be non-negative, any transaction that would violate this rule will be rejected.

### Isolation

Concurrent transactions execute independently without interfering with each other. Multiple transactions can run simultaneously without causing data corruption.

#### Isolation Levels

Different isolation levels provide varying degrees of protection:

**1. Read Uncommitted (Lowest isolation)**

- Allows dirty reads: Transaction can read uncommitted changes from other transactions
- Example scenario: Balance is 1000, Transaction T1 changes it to 500 but hasn't committed. Transaction T2 can read 500 even though it's not committed yet.
- **Risk**: If T1 rolls back, T2 has read invalid data

**2. Read Committed**

- Prevents dirty reads: Transaction can only read committed data
- Example scenario:
  - T1: Changes balance from 1000 to 500 and commits
  - T2: Checks balance and sees 500 (only after T1 commits)
  - T3: Transfer from A to C, deducting 200 and commits
  - T4: Now sees balance of 300
- **Issue**: Non-repeatable reads can occur

**3. Repeatable Read**

- Provides stable row reads within a transaction
- Example scenario: Balance is 1000. Transaction T1 starts, then Transaction T2 transfers 500 and commits. If T1 checks the balance again, it still sees 1000 (the value when T1 started).
- Uses shared locks (S-lock) to prevent modifications
- **Issue**: Phantom reads can still occur (new rows can appear)

**4. Serializable (Highest isolation)**

- Strongest isolation level: Full isolation from concurrent transactions
- Transactions execute as if they were running serially (one after another)
- Uses various locking mechanisms:
  - Shared locks (S-lock): Multiple transactions can read the same data
  - Exclusive locks (X-lock): Only one transaction can modify data
  - Range locks: Prevents phantom reads
  - Predicate locks: Locks based on query conditions
- **Trade-off**: Highest consistency but lowest concurrency

### Durability

Once a transaction is committed, the changes are permanent and will survive system failures. Data is persisted to disk and can be recovered even after crashes.

**Example**: After a successful bank transfer commits, the transaction is saved even if the database server crashes immediately afterward.

# Amazon Redshift

## Overview

Amazon Redshift is a fully managed, cloud-based data warehousing service designed for processing and analyzing large volumes of data. It uses a columnar storage architecture optimized for OLAP (Online Analytical Processing) workloads.

## Key Characteristics

- **Fully managed**: AWS handles infrastructure, maintenance, and updates
- **Cloud-based**: Scalable and accessible from anywhere
- **Columnar storage**: Data is stored by columns rather than rows
- **Optimized for OLAP**: Designed for analytical queries, not transactional workloads
- **Cost-effective**: Pay for what you use with various pricing options
- **High performance**: MPP architecture enables fast query processing

## Use Cases

### Primary Use Cases

1. **Accelerate analytics workloads**: Fast processing of complex analytical queries
2. **Unified data warehouses and data lakes**: Combine structured and semi-structured data
3. **Data warehouse modernization**: Migrate from legacy on-premises data warehouses
4. **Analyze global sales data**: Process sales data from multiple regions and sources
5. **Store historical stock trade data**: Maintain long-term financial data for analysis
6. **Analyze ad impressions and clicks**: Process digital advertising metrics
7. **Aggregate gaming data**: Analyze player behavior and game performance
8. **Analyze social trends**: Process social media data for insights

## Architecture

### MPP (Massively Parallel Processing) Architecture

Redshift uses an MPP architecture that distributes data and query processing across multiple nodes.

#### Components

**1. Leader Node**

- Coordinates query execution across compute nodes
- Manages client connections via JDBC/ODBC
- Parses and optimizes SQL queries
- Aggregates results from compute nodes
- Does not store user data

**2. Compute Nodes**

- Execute queries in parallel
- Store actual data in node slices
- Each compute node is divided into multiple slices
- Process their portion of data independently

**3. Node Slices**

- Logical partitions within compute nodes
- Each slice processes a portion of the workload
- Data is distributed across slices for parallel processing

#### How MPP Works with Columnar Storage

Consider this query:

```text
SELECT country, SUM(price)
FROM orders
GROUP BY country
```

With MPP and columnar storage:

1. Query is sent to the leader node
2. Leader node distributes work to compute nodes
3. Each node processes its data slices in parallel
4. Only the `country` and `price` columns are scanned (columnar benefit)
5. Partial aggregations happen on each node
6. Leader node combines results and returns final output

This architecture enables:

- **Parallel processing**: Multiple nodes work simultaneously
- **Column-based scanning**: Only read necessary columns
- **Efficient aggregations**: Distribute computation across nodes
- **High performance**: Dramatically faster than single-server processing

### Columnar Storage Benefits

Traditional row-based storage stores data like:

```text
Row 1: [id, name, country, price, date]
Row 2: [id, name, country, price, date]
```

Columnar storage stores data like:

```text
Column id: [val1, val2, val3, ...]
Column name: [val1, val2, val3, ...]
Column country: [val1, val2, val3, ...]
Column price: [val1, val2, val3, ...]
```

**Advantages for Analytics:**

1. **Reduced I/O**: Only read columns needed for the query
2. **Better compression**: Similar data types compress more efficiently
3. **Faster aggregations**: Process entire columns at once
4. **Cache efficiency**: Better CPU cache utilization

## Redshift Spectrum

Redshift Spectrum extends Redshift's querying capabilities to data stored in Amazon S3 without loading it into the cluster.

### Key Features

1. **Query exabytes of unstructured data in S3**: Access massive datasets directly
2. **No data loading required**: Query data where it lives in S3
3. **Limitless concurrency**: Scale query processing independently
4. **Horizontal scaling**: Add processing capacity as needed
5. **Separate storage and compute**: Scale each resource independently
6. **Wide variety of data formats**: Support for:
   - CSV
   - JSON
   - Parquet
   - ORC
   - Avro
7. **Compression support**: Gzip and Snappy compression

### Architecture

```text
Redshift Cluster → Spectrum Layer → Amazon S3
```

Spectrum nodes handle the processing of S3 data separately from the main cluster, allowing you to analyze data in S3 without moving it.

## Durability and Scaling

### Data Durability

**1. Replication within Cluster**

- Data is automatically replicated across nodes within the cluster
- Provides redundancy in case of node failure
- Enables quick recovery

**2. Backup to S3**

- Automated backups stored in Amazon S3
- Backups are asynchronously replicated to another AWS region
- Provides disaster recovery capabilities
- Cross-region replication protects against regional failures

**3. Automated Snapshots for Disaster Recovery**

- Automatic snapshots taken periodically
- Manual snapshots can be created on demand
- Snapshots stored in S3 for durability
- Can restore cluster from any snapshot

### Scaling Options

**1. Vertical Scaling**

- Increase or decrease node size (compute power and memory)
- Change node types (e.g., from dc2.large to dc2.8xlarge)

**2. Horizontal Scaling**

- Add or remove compute nodes
- Increase storage capacity and processing power
- Distribute workload across more nodes

**Scaling Process:**

When scaling operations occur:

1. New cluster is provisioned with desired configuration
2. Data is copied to new cluster
3. CNAME (DNS record) is flipped to point to new cluster
4. Minimal downtime during cutover
5. Old cluster is terminated

## Data Distribution Strategies

Data distribution determines how data is distributed across compute nodes and slices. Choosing the right distribution strategy is critical for performance.

### Distribution Styles

**1. AUTO Distribution**

- Redshift automatically determines the best distribution strategy
- Based on table size and query patterns
- Good default choice when unsure
- May change over time as data grows

**2. EVEN Distribution**

- Rows distributed across slices in round-robin fashion
- Each slice gets approximately the same number of rows
- Good for tables not joined with other tables
- Simple and balanced distribution

**When to use EVEN:**

- Tables that are not frequently joined
- Small dimension tables
- Temporary staging tables

**3. KEY Distribution**

- Rows distributed based on values in one column (distribution key)
- Rows with the same key value are stored on the same slice
- Enables collocated joins (joins without data movement)
- Critical for large fact tables

**When to use KEY:**

- Large fact tables joined with dimension tables
- Tables frequently joined on a specific column
- Example: Distribute both `orders` and `order_items` by `order_id`

**4. ALL Distribution**

- Entire table is copied to every compute node
- Each node has a complete copy of the table
- Typically used for small dimension tables
- Eliminates need to broadcast data during joins

**When to use ALL:**

- Small dimension tables (under a few million rows)
- Tables frequently joined with large tables
- Read-heavy tables where storage duplication is acceptable
- Example: Country codes, product categories

### Visual Comparison

**Distribution Key (KEY):**

- Data distributed by key value to specific locations
- Same keys go to same slices
- Enables efficient joins

**ALL Distribution:**

- Complete table copy on every node
- No data movement needed for joins
- Higher storage cost

**EVEN Distribution:**

- Round-robin distribution
- Balanced across all slices
- No optimization for joins

## Sort Keys

Sort keys determine the physical order in which data is stored in Redshift. Properly chosen sort keys dramatically improve query performance.

### Compound Sort Key

A compound sort key consists of multiple columns defined in a specific order. Data is sorted by the first column, then by the second column within each value of the first column, and so on.

**Syntax:**

```text
SORTKEY(customer_id, country_id)
```

**How it works:**

1. Data is first sorted by `customer_id`
2. Within each `customer_id`, data is sorted by `country_id`
3. Order matters: First column has highest priority

**When to use:**

- Queries frequently filter on the first column
- Range queries on the first column
- Queries that filter on multiple columns in order
- Example: `WHERE customer_id = 123 AND country_id = 'US'`

**Best practices:**

- Put most frequently filtered column first
- Use date columns if queries often filter by date ranges
- Columns used in ORDER BY clauses

**Example query that benefits:**

```text
SELECT * 
FROM orders 
WHERE customer_id = 123 
AND country_id = 'US'
GROUP BY customer_id, country_id
```

### Interleaved Sort Key

An interleaved sort key includes multiple columns where all columns are treated with equal importance. Redshift interleaves the data based on all columns simultaneously.

**Syntax:**

```text
INTERLEAVED SORTKEY(customer_id, country_id, order_date)
```

**How it works:**

- All columns in the sort key are weighted equally
- Data is distributed more evenly across all sort key columns
- Creates a balanced distribution suitable for multiple query patterns

**When to use:**

- Queries filter on different combinations of columns
- No single column is always used in queries
- Wide range of filter conditions
- Multiple query patterns with different predicates

**Trade-offs:**

- More expensive to maintain than compound sort keys
- VACUUM operations take longer
- Best for read-heavy workloads with diverse query patterns

**Example queries that benefit:**

```text
-- Query 1: Filter by customer
SELECT * FROM orders WHERE customer_id = 123

-- Query 2: Filter by country
SELECT * FROM orders WHERE country_id = 'US'

-- Query 3: Filter by date
SELECT * FROM orders WHERE order_date > '2024-01-01'
```

With interleaved sort keys, all three queries perform well.

### Compound vs Interleaved Comparison

| Aspect | Compound Sort Key | Interleaved Sort Key |
|--------|------------------|---------------------|
| Column priority | First column most important | All columns equally important |
| Query patterns | Predictable, consistent filters | Varied, unpredictable filters |
| Maintenance cost | Lower | Higher |
| VACUUM time | Faster | Slower |
| Best for | Single access pattern | Multiple access patterns |

## Data Import and Export

### COPY Command

The COPY command is the primary and most efficient way to load data into Redshift.

**Key Features:**

1. **Parallelized loading**: Distributes load across all nodes
2. **Highly efficient**: Optimized for bulk data loading
3. **Compression support**: Automatically detects and decompresses data
4. **Error handling**: Can skip errors and log bad records

**Data Sources:**

- **Amazon S3**: Most common source
- **Amazon EMR**: Load from Hadoop clusters
- **Amazon DynamoDB**: Import NoSQL data
- **Remote hosts**: SSH connection to external servers

**S3-Specific Requirements:**

- Requires IAM Role with appropriate S3 permissions
- Manifest file recommended for large-scale loads
- Can load from multiple files in parallel

**Example COPY command:**

```text
COPY table_name
FROM 's3://bucket-name/prefix/'
IAM_ROLE 'arn:aws:iam::account-id:role/role-name'
FORMAT AS PARQUET;
```

**Best Practices:**

- Split data into multiple files for parallel loading
- Use compressed files (gzip, bzip2)
- Use columnar formats (Parquet, ORC) when possible
- Specify proper delimiters and escape characters

### INSERT INTO ... SELECT

Load data from existing Redshift tables or query results.

**Characteristics:**

- Useful for transforming data already in Redshift
- Less efficient than COPY for large datasets
- Good for small data movements or transformations

**Example:**

```text
INSERT INTO target_table (col1, col2, col3)
SELECT col1, col2, col3
FROM source_table
WHERE condition;
```

### UNLOAD Command

Export data from Redshift to Amazon S3.

**Key Features:**

1. **Parallel unload**: Uses all nodes for fast export
2. **Compression support**: Can compress output files
3. **Encryption**: Supports encryption at rest
4. **Partitioning**: Split data into multiple files

**Example:**

```text
UNLOAD ('SELECT * FROM table_name')
TO 's3://bucket-name/prefix/'
IAM_ROLE 'arn:aws:iam::account-id:role/role-name'
PARALLEL ON
GZIP;
```

**Use Cases:**

- Archiving data
- Sharing data with other systems
- Creating backups
- Exporting for machine learning pipelines

### Auto-Copy from Amazon S3

Automatically load new files as they arrive in S3.

**Key Features:**

- Event-driven ingestion
- No manual intervention required
- Commonly used with streaming pipelines
- Integration with AWS Lambda for triggering

**Typical Architecture:**

```text
Data Source → S3 → S3 Event → Lambda → Redshift COPY
```

### Amazon Aurora Zero-ETL Integration

Seamlessly replicate data from Aurora to Redshift without custom ETL code.

**Data Flow:**

```text
Aurora → S3 → Glue Crawler → Glue Data Catalog → Athena
```

**Optional Path:**

```text
Aurora → Direct Replication → Redshift
```

**Benefits:**

- No ETL code to maintain
- Near real-time data availability
- Automatic schema synchronization
- Reduced operational overhead

### Redshift Streaming Ingestion

Ingest data in near real-time from streaming sources.

**Supported Sources:**

1. **Amazon Kinesis Data Streams**: High-throughput data streaming
2. **Amazon MSK (Managed Streaming for Apache Kafka)**: Kafka-compatible streaming

**Use Cases:**

- Real-time analytics
- IoT data processing
- Log analysis
- Clickstream analysis

**Architecture:**

```text
Application → Kinesis/MSK → Redshift Streaming Ingestion → Redshift Tables
```

## DBLink - PostgreSQL Connection

DBLink allows connecting Redshift to PostgreSQL databases for data synchronization and cross-database querying.

**Common Use Cases:**

1. **Incremental data sync**: Sync only changed data
2. **Small-to-medium reference tables**: Copy dimension tables
3. **Cross-database querying**: Query PostgreSQL from Redshift

### Setup Steps

**1. Enable Required Extensions:**

```text
CREATE EXTENSION postgres_fdw;
CREATE EXTENSION dblink;
```

**2. Create Foreign Server:**

```text
CREATE SERVER foreign_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
    host '<postgres_host>',
    port '<port>',
    dbname '<database_name>',
    sslmode 'require'
);
```

**3. Create User Mapping:**

```text
CREATE USER MAPPING FOR <redshift_user>
SERVER foreign_server
OPTIONS (
    user '<postgres_username>',
    password '<password>'
);
```

**4. Query Remote Data:**

```text
SELECT * FROM dblink(
    'foreign_server',
    'SELECT * FROM remote_table'
) AS remote_data(col1 int, col2 text);
```

**Use Cases:**

- Syncing reference data from operational databases
- Validating data between systems
- Temporary data access without full ETL

## VACUUM

VACUUM is a maintenance operation that reclaims disk space and reorganizes data for better performance.

### Purpose

1. **Reclaim disk space**: Remove space from deleted or updated rows
2. **Reorganize data**: Improve physical data layout
3. **Update statistics**: Refresh table metadata for query optimization
4. **Improve query performance**: Maintain optimal data organization

### VACUUM Types

**1. VACUUM FULL**

- Most comprehensive vacuum operation
- Reclaims space and resorts data
- Can be time-consuming
- Acquires exclusive table lock

**Syntax:**

```text
VACUUM FULL table_name;
```

**2. VACUUM DELETE ONLY**

- Only reclaims space from deleted rows
- Does not resort data
- Faster than VACUUM FULL
- Good for tables with frequent deletes

**Syntax:**

```text
VACUUM DELETE ONLY table_name;
```

**3. VACUUM SORT ONLY**

- Only resorts data according to sort key
- Does not reclaim space
- Improves query performance
- Good after bulk loads

**Syntax:**

```text
VACUUM SORT ONLY table_name;
```

**4. VACUUM REINDEX**

- Rebuilds interleaved sort key indexes
- Necessary for tables with interleaved sort keys
- Maintains query performance over time

**Syntax:**

```text
VACUUM REINDEX table_name;
```

### Best Practices

- Run VACUUM during low-traffic periods
- Use VACUUM DELETE ONLY for frequent delete operations
- Run VACUUM SORT ONLY after large data loads
- Schedule automatic vacuum operations
- Monitor vacuum progress and completion

## When NOT to Use Redshift

### 1. Small Datasets

**Issue**: Redshift is optimized for large-scale analytics

**Better Alternative**: Amazon RDS (PostgreSQL, MySQL, etc.)

**Reasoning:**

- Redshift has overhead for cluster management
- RDS is simpler and more cost-effective for small data
- RDS provides better performance for small datasets

### 2. OLTP Workloads

**Issue**: Redshift is designed for OLAP, not OLTP

**Better Alternatives**:

- Amazon RDS for relational OLTP
- Amazon DynamoDB for NoSQL OLTP

**Reasoning:**

- Redshift lacks optimizations for transactional workloads
- Not designed for high-frequency individual updates
- Better suited for batch processing and analytics

### 3. Unstructured Data

**Issue**: Redshift requires structured or semi-structured data

**Better Approach**: Perform ETL first

**ETL Tools**:

- Amazon EMR
- Apache Spark
- Apache Airflow

**Process:**

```text
Raw Unstructured Data → ETL (EMR/Spark) → Structured Data → Redshift
```

### 4. BLOB (Binary Large Object) Data

**Issue**: Storing large binary files in Redshift is inefficient

**Better Approach**: Store in S3, reference in Redshift

**Recommended Pattern:**

- Store files in Amazon S3
- Store only metadata in Redshift (file paths, IDs, descriptions)
- Query metadata in Redshift, retrieve files from S3

**Example Schema:**

```text
CREATE TABLE documents (
    document_id INT,
    document_name VARCHAR(255),
    s3_path VARCHAR(500),  -- S3 location
    file_size BIGINT,
    upload_date DATE
);
```

## Redshift Serverless

Redshift Serverless eliminates the need to manage clusters manually.

### Pricing Model

**Redshift Processing Units (RPUs)**:

- Unit of compute capacity in Redshift Serverless
- Billing based on RPU-hours
- Charged per second of usage
- Plus storage costs

### Base RPUs Configuration

**Capacity Range**: 32 to 512 RPUs

**Factors to Consider:**

1. **Query Complexity**: More complex queries need more RPUs
2. **Concurrency Requirements**: More concurrent users need more RPUs
3. **Performance Requirements**: Lower latency needs more RPUs
4. **Cost Optimization**: Balance performance with cost

**Scaling Behavior:**

- Automatically scales up during high demand
- Scales down during low demand
- Base RPUs set the minimum capacity
- Maximum capacity can be configured

**Benefits:**

- No cluster management
- Automatic scaling
- Pay only for what you use
- Simplified operations

# Amazon DynamoDB

## Overview

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

### Key Characteristics

1. **Serverless**: No servers to provision or manage
2. **Non-relational**: NoSQL database model
3. **Fully distributed**: Data replicated across multiple AZs
4. **Low cost**: Pay-per-request pricing with automatic scaling
5. **High performance**: Single-digit millisecond latency
6. **Flexible schema**: No rigid schema requirements

### Table Classes

**1. Standard Table Class**

- For frequently accessed data
- Higher throughput
- Standard pricing

**2. Infrequent Access (IA) Table Class**

- For rarely accessed data
- Lower storage costs
- Slightly higher per-request costs
- Cost-effective for archival data

## Data Model

### Database Structure

DynamoDB has a flat structure: **Database → Table** (No schema layer)

**Key Differences from Relational Databases:**

- No database/schema hierarchy like RDS
- Tables exist at the top level
- No foreign keys or relationships enforced by DynamoDB
- Schema-on-read rather than schema-on-write

### Table Components

**Tables:**

- Collection of items
- Each table requires a primary key definition
- No limit on number of tables per account

**Items (Rows):**

- Individual records in a table
- Equivalent to rows in relational databases
- Each table can have unlimited number of items
- Maximum size per item: 400 KB

**Attributes (Columns):**

- Individual data fields within an item
- Equivalent to columns in relational databases
- Each item can have different attributes (flexible schema)
- Only primary key attributes are required

### Data Types

**1. Scalar Types (Single Value)**

- **String**: Text data (UTF-8 encoded)
- **Number**: Numeric data (integers, decimals)
- **Binary**: Binary data (images, compressed data)
- **Boolean**: True or false
- **Null**: Represents absence of value

**2. Document Types (Nested Structures)**

- **List**: Ordered collection of values (like arrays)
  - Example: `["apple", "banana", "orange"]`
- **Map**: Unordered collection of key-value pairs (like JSON objects)
  - Example: `{"name": "John", "age": 30}`

**3. Set Types (Unordered Collections)**

- **String Set**: Collection of unique strings
  - Example: `{"red", "blue", "green"}`
- **Number Set**: Collection of unique numbers
  - Example: `{1, 2, 3, 5, 8}`
- **Binary Set**: Collection of unique binary values

### Example Item Structure

```text
{
    "user_id": "123",              // String (Partition Key)
    "username": "john_doe",        // String
    "age": 30,                     // Number
    "email": "john@example.com",   // String
    "active": true,                // Boolean
    "tags": ["developer", "aws"],  // List
    "preferences": {               // Map
        "theme": "dark",
        "language": "en"
    },
    "skills": {"Python", "SQL"}    // String Set
}
```

## Primary Keys

Primary keys uniquely identify items in a DynamoDB table. There are two types of primary keys.

### 1. Partition Key (HASH Key)

A simple primary key consisting of a single attribute.

**Characteristics:**

- Single attribute serves as unique identifier
- Every item must have a unique partition key value
- No two items can have the same partition key
- DynamoDB uses partition key to determine physical storage location

**Example:**

```text
Table: Users
Partition Key: user_id

Items:
- user_id: "123" (unique)
- user_id: "456" (unique)
- user_id: "789" (unique)
```

**When to use:**

- Simple access patterns (get item by ID)
- Items naturally have a unique identifier
- No need to query ranges of items

### 2. Composite Primary Key (Partition Key + Sort Key)

A primary key composed of two attributes: partition key and sort key.

**Characteristics:**

- Partition key determines which partition stores the item
- Sort key determines order within the partition
- Multiple items can share the same partition key
- Items with same partition key must have different sort key values
- Enables range queries on sort key

**Example:**

```text
Table: Orders
Partition Key: customer_id
Sort Key: order_date

Items:
- customer_id: "123", order_date: "2024-01-01"
- customer_id: "123", order_date: "2024-01-15"
- customer_id: "456", order_date: "2024-01-10"
```

**When to use:**

- One-to-many relationships (one customer, many orders)
- Need to query ranges (all orders for a customer in a date range)
- Need to sort results by a specific attribute

### Primary Key Design Considerations

**Partition Key Selection:**

- Choose high-cardinality attributes (many unique values)
- Ensure even distribution of data
- Avoid hot partitions (one partition getting too much traffic)
- Consider access patterns

**Sort Key Selection:**

- Choose attributes frequently used in range queries
- Consider how data should be ordered
- Often timestamps, dates, or sequential IDs

## Partitioning

DynamoDB automatically partitions data across multiple physical storage locations based on the partition key.

### How Partitioning Works

**Hash Function:**

1. DynamoDB applies a hash function to the partition key
2. Hash determines which partition stores the item
3. Hash ensures even distribution across partitions

**Partition Distribution:**

```text
Item with partition key "user_123"
    ↓ (hash function)
Partition 2

Item with partition key "user_456"
    ↓ (hash function)
Partition 1

Item with partition key "user_789"
    ↓ (hash function)
Partition 3
```

### Hot Partitions

A **hot partition** occurs when one partition receives disproportionately high traffic compared to others.

**Causes:**

1. **Poor partition key design**: Using low-cardinality attributes
   - Example: Using "country" as partition key when 90% of users are from one country
2. **Time-series data without proper design**: All recent data goes to same partition
   - Example: Using current date as partition key for real-time data
3. **Celebrity/popular item problem**: One item accessed much more than others
   - Example: Viral post in social media application

**Problems:**

- Performance degradation
- Request throttling
- Uneven resource utilization
- Increased latency

**Solutions:**

1. **Use high-cardinality partition keys**: Ensure many unique values
2. **Add randomness**: Append random suffix to partition key
3. **Composite keys**: Combine multiple attributes
4. **Write sharding**: Distribute writes across multiple partition key values

## When NOT to Use DynamoDB

### 1. Pre-written Applications Tied to Relational Databases

**Issue**: Application designed for SQL and relational model

**Better Alternative**: Amazon RDS

**Reasoning:**

- Application expects SQL query language
- Relies on joins, foreign keys, transactions
- Schema already defined in relational model
- Migration cost too high

### 2. Complex Joins

**Issue**: DynamoDB does not support joins between tables

**Better Alternative**: Amazon RDS

**How DynamoDB Handles Relationships:**

- Denormalize data (embed related data in single item)
- Multiple queries (application-level joins)
- Use adjacency list pattern
- Store related data in single table

**When RDS is Better:**

- Many-to-many relationships
- Complex multi-table joins
- Normalized data model required
- Ad-hoc analytical queries

### 3. Complex Transactions

**Issue**: Limited transaction support

**Better Alternative**: Amazon RDS

**DynamoDB Transaction Limitations:**

- Maximum 100 items per transaction
- All items must be in same region
- Performance impact for large transactions
- Additional cost per transaction

**When RDS is Better:**

- Multi-step workflows requiring ACID
- Complex business logic in transactions
- Need for isolation levels
- Rollback requirements across many tables

### 4. BLOB Data

**Issue**: 400 KB item size limit

**Better Solution**: Store in S3, metadata in DynamoDB

**Recommended Pattern:**

```text
DynamoDB Table: Files
- file_id (partition key)
- file_name
- s3_bucket
- s3_key
- file_size
- upload_date

Actual file stored in S3: s3://bucket/path/to/file
```

### 5. Large Data with Low I/O Rate

**Issue**: DynamoDB optimized for high-throughput access

**Better Alternative**: Amazon S3

**When to Use S3:**

- Large files (videos, backups, archives)
- Infrequent access
- Data lake scenarios
- Long-term archival

**Cost Comparison:**

- S3: Much cheaper storage, but higher latency
- DynamoDB: More expensive storage, low latency access

## Capacity Units

DynamoDB uses capacity units to measure and bill for throughput.

### Write Capacity Unit (WCU)

**Definition**: One write per second for an item up to 1 KB in size

**Calculation:**

- Item size ≤ 1 KB = 1 WCU
- Item size > 1 KB = Rounded up to nearest KB

**Examples:**

**Example 1: Write 10 items per second, item size 2 KB**

```text
Item size = 2 KB → 2 WCUs per item
10 items/second × 2 WCUs = 20 WCUs required
```

**Example 2: Write 6 items per second, item size 4.5 KB**

```text
Item size = 4.5 KB → Round up to 5 KB → 5 WCUs per item
6 items/second × 5 WCUs = 30 WCUs required
```

### Read Capacity Unit (RCU)

**Definition**: One RCU represents:

- 1 strongly consistent read per second, OR
- 2 eventually consistent reads per second
- For an item up to 4 KB in size

**Read Consistency:**

**1. Strongly Consistent Read**

- Returns most up-to-date data
- Reflects all successful writes
- Higher RCU cost (1 RCU per 4 KB)
- Slightly higher latency

**2. Eventually Consistent Read (Default)**

- May return stale data
- Eventually reflects all writes
- Lower RCU cost (0.5 RCU per 4 KB)
- Better performance

**Calculation:**

For **Strongly Consistent Reads**:

- Item size ≤ 4 KB = 1 RCU
- Item size > 4 KB = Round up to nearest 4 KB

For **Eventually Consistent Reads**:

- Item size ≤ 4 KB = 0.5 RCU
- Item size > 4 KB = Round up to nearest 4 KB, then divide by 2

**Examples:**

**Example 1: Read 10 items per second, item size 3 KB, strongly consistent**

```text
Item size = 3 KB → 1 RCU per item
10 items/second × 1 RCU = 10 RCUs required
```

**Example 2: Read 10 items per second, item size 3 KB, eventually consistent**

```text
Item size = 3 KB → 0.5 RCU per item
10 items/second × 0.5 RCU = 5 RCUs required
```

**Example 3: Read 5 items per second, item size 10 KB, strongly consistent**

```text
Item size = 10 KB → Round up to 12 KB (nearest 4 KB) → 3 RCUs per item
5 items/second × 3 RCUs = 15 RCUs required
```

# Interview Questions and Coding Exercise

## Q1: When Should You NOT Use DynamoDB?

**Detailed Answer:**

You should NOT use DynamoDB in the following scenarios:

**1. Pre-written Applications Tied to Relational Databases**

When your application is already built with SQL queries, joins, and relational schema, migrating to DynamoDB requires complete application rewrite. Use RDS instead to maintain compatibility.

**2. Complex Joins Required**

If your data model requires frequent multi-table joins, DynamoDB's lack of join support becomes a significant limitation. While you can denormalize data or perform application-level joins, this adds complexity. RDS with SQL is better suited for such use cases.

**3. Complex Transactions**

DynamoDB supports transactions but with limitations (100 items max, single region, performance overhead). Applications requiring complex multi-step ACID transactions across many entities should use RDS.

**4. BLOB Data Storage**

DynamoDB has a 400 KB item size limit, making it unsuitable for storing large binary objects like images, videos, or documents. Instead, store BLOBs in S3 and keep only metadata (S3 paths, file names) in DynamoDB.

**5. Large Data with Low I/O Rate**

For infrequently accessed large datasets (archives, backups, data lakes), S3 provides much lower storage costs than DynamoDB. DynamoDB is optimized for high-throughput, low-latency access patterns.

**6. Ad-hoc Analytical Queries**

DynamoDB requires predefined access patterns and doesn't support flexible SQL queries. For business intelligence and ad-hoc analytics, use Redshift or Athena with S3.

## Q2: What Causes a Hot Partition in DynamoDB?

**Detailed Answer:**

A hot partition occurs when one partition receives significantly more traffic than others, causing performance bottlenecks.

**Primary Causes:**

**1. Poor Partition Key Design**

Using low-cardinality attributes that create uneven data distribution:

- **Example**: Using "country" as partition key when 80% of users are from one country
- **Example**: Using "status" (active/inactive) as partition key
- **Impact**: Most requests go to same partition, causing throttling

**2. Time-Series Data Without Proper Design**

Using current timestamp or date as partition key:

- **Example**: Partition key = current_date, all today's data goes to one partition
- **Example**: IoT sensor data with timestamp as partition key
- **Impact**: All writes concentrated on single partition

**3. Celebrity/Popular Item Problem**

One item receiving disproportionate access:

- **Example**: Viral social media post getting millions of views
- **Example**: Popular product during flash sale
- **Impact**: Single partition overwhelmed while others idle

**Solutions:**

**1. Use High-Cardinality Partition Keys**

Choose attributes with many unique values that distribute evenly:

```text
Good: user_id, order_id, transaction_id
Bad: country, status, category
```

**2. Add Write Sharding**

Append random suffix to partition key:

```text
Original: date = "2024-01-01"
Sharded: date = "2024-01-01#0", "2024-01-01#1", "2024-01-01#2"
```

**3. Use Composite Keys**

Combine multiple attributes to increase cardinality:

```text
Partition Key: customer_id
Sort Key: timestamp
```

**4. Calculate Shard Number**

Use hash of attribute to determine shard:

```text
shard = hash(item_id) % num_shards
partition_key = f"{item_id}#{shard}"
```

## Q3: How Does DynamoDB Scale Automatically?

**Detailed Answer:**

DynamoDB provides two scaling modes: provisioned and on-demand.

**1. Provisioned Capacity with Auto Scaling**

**How it works:**

- Set target utilization (e.g., 70% of provisioned capacity)
- DynamoDB monitors actual usage
- When usage approaches target, automatically increases capacity
- Scales down during low traffic periods

**Scaling Triggers:**

- CloudWatch alarms based on consumed capacity
- Application Auto Scaling adjusts capacity
- Gradual scaling to prevent sudden changes

**Configuration:**

```text
Minimum RCU: 5
Maximum RCU: 100
Target Utilization: 70%
```

**Example:**

```text
Current: 50 RCUs provisioned
Usage: 40 RCUs consumed (80% utilization)
Action: Scale up to 60 RCUs
```

**2. On-Demand Capacity Mode**

**How it works:**

- No capacity planning required
- DynamoDB automatically handles any amount of traffic
- Instantly accommodates workload spikes
- Pay per request rather than provisioned capacity

**Automatic Scaling:**

- Instantaneous scaling up to 2x previous peak within 30 minutes
- No throttling for gradual traffic increases
- Handles unpredictable workloads

**3. Partition Splitting**

**How it works:**

- DynamoDB automatically splits partitions when:
  - Storage exceeds 10 GB per partition
  - Throughput exceeds partition limits
- New partitions created automatically
- Data redistributed across partitions
- Completely transparent to applications

**Partition Lifecycle:**

```text
Single Partition (10 GB reached)
    ↓
Split into 2 Partitions (5 GB each)
    ↓
Continue splitting as needed
```

**4. Adaptive Capacity**

**How it works:**

- Responds to uneven workload patterns
- Isolates frequently accessed items
- Boosts capacity for hot partitions
- Prevents throttling from hot partitions

**Benefits:**

- Handles temporary spikes
- Protects against hot key issues
- Automatic without configuration

## Q4: Difference Between RDS and DynamoDB?

**Comprehensive Comparison:**

### Data Model

**RDS:**

- Relational (SQL)
- Fixed schema with tables, rows, columns
- Relationships enforced via foreign keys
- Normalization encouraged

**DynamoDB:**

- Non-relational (NoSQL)
- Flexible schema
- No foreign keys
- Denormalization encouraged

### Query Language

**RDS:**

- SQL (Structured Query Language)
- Complex joins supported
- Rich query capabilities
- Aggregations, subqueries, window functions

**DynamoDB:**

- Key-value API
- No joins (application-level required)
- Limited query capabilities
- Query and Scan operations

### Scaling

**RDS:**

- Vertical scaling (larger instance)
- Read replicas for read scaling
- Manual intervention often required
- Limited by instance size

**DynamoDB:**

- Horizontal scaling (automatic partitioning)
- Unlimited scaling capacity
- Automatic with no intervention
- Distributed architecture

### Transactions

**RDS:**

- Full ACID compliance
- Complex multi-table transactions
- Multiple isolation levels
- Rollback capabilities

**DynamoDB:**

- Limited transactions (up to 100 items)
- ACID for supported operations
- Single-region transactions
- Higher latency for transactions

### Use Cases

**RDS:**

- OLTP workloads
- Complex queries and joins
- Traditional applications
- Business intelligence
- Data warehousing (Aurora)

**DynamoDB:**

- High-scale web applications
- Gaming leaderboards
- IoT data storage
- Mobile backends
- Session management

### Performance

**RDS:**

- Millisecond to second latency
- Depends on query complexity
- Limited by instance resources

**DynamoDB:**

- Single-digit millisecond latency
- Consistent performance at scale
- Predictable performance

### Pricing

**RDS:**

- Pay for instance hours
- Storage separately charged
- Backup storage
- Data transfer

**DynamoDB:**

- Pay per request (on-demand)
- Or provisioned capacity
- Storage per GB
- Lower cost at high scale

### Management

**RDS:**

- More management required
- Backups, patches, upgrades
- Performance tuning
- Replication configuration

**DynamoDB:**

- Fully managed serverless
- Automatic backups
- No maintenance windows
- Hands-off operation

### Consistency

**RDS:**

- Strongly consistent by default
- Eventual consistency for read replicas

**DynamoDB:**

- Eventually consistent by default
- Strongly consistent reads available
- Choice per read operation

## Q5: Explain How the COPY Command Works in Redshift

**Detailed Answer:**

The COPY command is the most efficient method for loading large amounts of data into Amazon Redshift.

### How COPY Works

**1. Parallel Loading**

COPY distributes the workload across all compute nodes:

```text
S3 (Multiple Files)
    ↓
Leader Node (coordinates)
    ↓
Compute Node 1, Compute Node 2, Compute Node 3, ...
(Each node loads different files in parallel)
```

**Process:**

- Leader node identifies all files to load
- Distributes file list across compute nodes
- Each compute node reads and loads its assigned files
- All nodes work simultaneously

**2. Automatic Compression Detection**

- Detects compressed files automatically
- Supports gzip, bzip2, lzo, zstd
- Decompresses during load
- No manual intervention required

**3. Data Type Conversion**

- Automatically converts data types
- Validates data against table schema
- Handles NULL values
- Parses delimited files

**4. Error Handling**

- Can continue loading despite errors
- Logs failed records
- Configurable error thresholds
- Allows investigation of bad data

### COPY Command Syntax

**Basic Syntax:**

```text
COPY table_name
FROM 's3://bucket/prefix/'
IAM_ROLE 'arn:aws:iam::account-id:role/role-name'
[FORMAT]
[OPTIONS];
```

**Common Options:**

```text
COPY orders
FROM 's3://my-bucket/data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS CSV
DELIMITER ','
IGNOREHEADER 1
REGION 'us-east-1'
MAXERROR 10;
```

### Data Sources

**1. Amazon S3:**

```text
COPY table_name
FROM 's3://bucket/prefix/'
IAM_ROLE 'role-arn'
FORMAT AS PARQUET;
```

**2. DynamoDB:**

```text
COPY table_name
FROM 'dynamodb://table-name'
IAM_ROLE 'role-arn'
READRATIO 50;
```

**3. Remote Host (SSH):**

```text
COPY table_name
FROM 'ssh://host/path/file'
IAM_ROLE 'role-arn';
```

### File Formats

**Supported Formats:**

- CSV (Comma-Separated Values)
- TSV (Tab-Separated Values)
- Fixed-width
- JSON
- Avro
- Parquet
- ORC

**Optimized Formats:**

Parquet and ORC provide best performance:

- Columnar storage aligns with Redshift
- Built-in compression
- Predicate pushdown
- Faster loads and queries

### Manifest Files

A manifest file lists S3 objects to load, providing precise control.

**Why Use Manifest:**

- Load specific files
- Ensure exactly-once loading
- Handle file updates
- Mandatory objects list

**Example Manifest:**

```text
{
  "entries": [
    {"url": "s3://bucket/data/file1.csv", "mandatory": true},
    {"url": "s3://bucket/data/file2.csv", "mandatory": true}
  ]
}
```

**COPY with Manifest:**

```text
COPY table_name
FROM 's3://bucket/manifest.json'
IAM_ROLE 'role-arn'
MANIFEST
FORMAT AS CSV;
```

### Best Practices

**1. Split Data into Multiple Files**

- Enables parallel loading
- Optimal file size: 1-125 MB after compression
- Number of files = multiple of number of slices

**2. Compress Files**

- Reduces data transfer time
- Saves S3 storage costs
- Gzip provides good balance

**3. Use Columnar Formats**

- Parquet or ORC preferred
- Faster loads
- Better compression

**4. Sort Data Before Loading**

- Pre-sort by table's sort key
- Reduces need for VACUUM
- Improves query performance immediately

**5. Use Appropriate Distribution Key**

- Distribute data evenly
- Minimize data movement during queries
- Choose based on join patterns

**6. Error Handling**

```text
COPY table_name
FROM 's3://bucket/data/'
IAM_ROLE 'role-arn'
MAXERROR 100
ACCEPTINVCHARS;
```

**7. Monitor Progress**

Query system tables to monitor:

```text
SELECT * FROM stl_load_errors
WHERE query = <query_id>;
```

## Coding Exercise: Latest Event Per User

### Problem Statement

**Table Schema:**

```text
CREATE TABLE events (
    event_id INT,
    user_id INT,
    event_type VARCHAR(50),
    event_time TIMESTAMP
);
```

**Task:**

Write a SQL query to get the latest event for each user.

### Solution

**Method 1: Using Window Function (Preferred)**

```text
SELECT 
    event_id,
    user_id,
    event_type,
    event_time
FROM (
    SELECT 
        event_id,
        user_id,
        event_type,
        event_time,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) as rn
    FROM events
) ranked
WHERE rn = 1;
```

**How it works:**

1. `ROW_NUMBER()` assigns a sequential number to each row within each user's partition
2. `PARTITION BY user_id` creates separate row number sequences for each user
3. `ORDER BY event_time DESC` orders events newest first
4. Filter `rn = 1` keeps only the latest event for each user

**Method 2: Using Subquery with MAX**

```text
SELECT e1.*
FROM events e1
INNER JOIN (
    SELECT 
        user_id,
        MAX(event_time) as max_time
    FROM events
    GROUP BY user_id
) e2 ON e1.user_id = e2.user_id 
    AND e1.event_time = e2.max_time;
```

**Method 3: Using Correlated Subquery**

```text
SELECT e1.*
FROM events e1
WHERE e1.event_time = (
    SELECT MAX(e2.event_time)
    FROM events e2
    WHERE e2.user_id = e1.user_id
);
```

### Follow-up: Why Expensive in Redshift & Optimization

**Why This Query Can Be Expensive:**

**1. Full Table Scan**

- Query must read all rows from the `events` table
- No filtering on partition key
- Scanning across all compute nodes

**2. Data Distribution Issues**

- If table not distributed by `user_id`, data shuffling required
- Events for same user may be on different nodes
- Network transfer between nodes

**3. Lack of Sort Key**

- If `event_time` or `user_id` not in sort key, no zone maps to skip blocks
- Can't eliminate blocks during scan
- More disk I/O required

**4. Window Function Overhead**

- `ROW_NUMBER()` requires sorting within each partition
- All user events must be collected and sorted
- Temporary storage needed

### Optimization Strategies

**1. Use Appropriate Distribution Key**

Distribute by `user_id` to colocate user events:

```text
CREATE TABLE events (
    event_id INT,
    user_id INT,
    event_type VARCHAR(50),
    event_time TIMESTAMP
)
DISTKEY(user_id)
SORTKEY(user_id, event_time);
```

**Benefits:**

- All events for a user on same node
- No data shuffling required
- Faster window function computation

**2. Add Compound Sort Key**

Sort by `user_id` and `event_time`:

```text
SORTKEY(user_id, event_time)
```

**Benefits:**

- Zone maps help skip irrelevant blocks
- Data physically ordered for better scans
- Window function benefits from pre-sorted data

**3. Partition Data by Date**

If table is very large, partition by date:

```text
-- Query only recent partitions
WHERE event_time >= DATEADD(day, -30, GETDATE())
```

**4. Materialized View**

Create and maintain latest events:

```text
CREATE MATERIALIZED VIEW latest_user_events AS
SELECT 
    event_id,
    user_id,
    event_type,
    event_time
FROM (
    SELECT 
        event_id,
        user_id,
        event_type,
        event_time,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) as rn
    FROM events
) ranked
WHERE rn = 1;
```

**Benefits:**

- Pre-computed results
- Query reads view instead of scanning full table
- Refresh periodically or on-demand

**5. Use Result Caching**

Enable result caching in Redshift:

- Subsequent identical queries return cached results
- Dramatically faster for repeated queries
- Automatic, no code changes needed

**6. Column Compression**

Ensure proper encoding:

```text
ANALYZE COMPRESSION events;
```

Then apply recommended encodings:

```text
ALTER TABLE events ALTER COLUMN event_type ENCODE LZO;
ALTER TABLE events ALTER COLUMN event_time ENCODE AZ64;
```

**7. Query Pattern Optimization**

If only need recent events, add time filter:

```text
SELECT 
    event_id,
    user_id,
    event_type,
    event_time
FROM (
    SELECT 
        event_id,
        user_id,
        event_type,
        event_time,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) as rn
    FROM events
    WHERE event_time >= DATEADD(day, -90, GETDATE())  -- Only last 90 days
) ranked
WHERE rn = 1;
```

**Performance Comparison:**

**Before Optimization:**

- Full table scan: 100M rows
- Execution time: 45 seconds
- Data shuffling: 20 GB transferred

**After Optimization:**

- Distributed by user_id, sorted by (user_id, event_time)
- Zone maps eliminate 80% of blocks
- Execution time: 5 seconds
- No data shuffling