# Understanding the JSON Antipattern with Example

<div style="background-color: #f8f9fa; border: 1px solid #e9ecef; border-radius: 8px; padding: 10px; margin: 10px;">
<strong>📋 Workshop Contents</strong>
<ul style="line-height: 1.2;">
<li><a href="#1-Initial-Setup">1. Initial Setup</a>
  <ul>
    <li><a href="#Creating-the-Table-Structure">Creating the Table Structure</a></li>
    <li><a href="#Whats-Happening-Here">What's Happening Here?</a></li>
    <li><a href="#GIN-Generalized-Inverted-Index">GIN (Generalized Inverted Index)</a></li>
    <li><a href="#Field-Specific-Indexing">Field-Specific Indexing</a></li>
    <li><a href="#Create-Supporting-Functions">Create Supporting Functions</a></li>
  </ul>
</li>
<li><a href="#2-Data-Generation">2. Data Generation</a>
  <ul>
    <li><a href="#Executing-the-Data-Generation-Procedure">Executing the Data Generation Procedure</a></li>
    <li><a href="#Data-Visualization">Data Visualization</a></li>
  </ul>
</li>
<li><a href="#3-Performance-Analysis">3. Performance Analysis</a></li>
<li><a href="#Conclusion-The-Case-for-Purpose-Built-Databases">Conclusion: The Case for Purpose-Built Databases</a></li>
<li><a href="#Next-Steps">Next Steps</a></li>
<li><a href="#Additional-Resources">Additional Resources 📚</a></li>
</ul>
</div>

## 1. Initial Setup

Let's walk through an example to demonstrate how storing and querying JSON data in PostgreSQL can lead to performance challenges. We'll create a table that stores customer activity data as JSON, populate it with test data, and analyze query performance.

**Important:** Open your favorite SQL editor and connect to a PostgreSQL database (RDS/Aurora) that you have access to test the code examples below. 

⚠️ **Precaution:** Do not use production databases for this exercise. Use development databases and schemas that you are responsible for to avoid any disruptions to live systems.

### Creating the Table Structure
First, let's create a table that stores customer activities as JSON documents:

Create a customer activities table with JSONB storage and appropriate indexes to demonstrate JSON data handling in PostgreSQL. The table structure includes a primary key, customer identifier, JSON activity data, and timestamp fields with GIN and expression indexes for query optimization with table creation confirmation and index setup.

In [None]:
CREATE TABLE customer_activities (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    activity_data JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create indexes for better performance
CREATE INDEX idx_activity_data_gin ON customer_activities USING GIN (activity_data);
CREATE INDEX idx_activity_type ON customer_activities ((activity_data->>'activity_type'));

<img src="../images/7.2-postgresql-table-structure.png" alt="pg-table" width="500"/>

![JSON Document Structure|100](../images/7.2-json-data-example.png)

![JSON Document Structure](../images/7.2-json-data-example.png)

## What's Happening Here?

### Table Structure
| Column | Type | Description |
|--------|------|-------------|
| `id` | SERIAL | Auto-incrementing primary key |
| `customer_id` | INTEGER | Field to identify customers |
| `activity_data` | JSONB | Field to store activity information |
| `created_at` | TIMESTAMP | When the record was created |

### GIN (Generalized Inverted Index)

GIN indexes are ideal for JSONB columns containing complex JSON documents that need to be queried.

In [None]:
CREATE INDEX idx_activity_data_gin ON customer_activities USING GIN (activity_data);

This index type:
- Enables efficient full document searches
- Supports containment operators (`@>`, `?`, `?&`, `?|`)
- Works well with JSONB but has higher insertion overhead

### Field-Specific Indexing

For frequently queried specific JSON fields, create targeted indexes:

In [None]:
CREATE INDEX idx_activity_type ON customer_activities ((activity_data->>'activity_type'));

Benefits:
- Faster queries on specific JSON fields
- Smaller index size compared to full GIN indexes
- Better performance for equality and range queries on the indexed field

⚠️ **Warning**: While PostgreSQL's JSON capabilities are powerful, using JSON/JSONB for large-scale document storage may impact performance.

### Create Supporting Functions
Now we'll create a procedure that generates millions of sample JSON records to simulate a production environment. This will help us demonstrate how performance degrades as data volume increases.

Generate realistic JSON activity data with randomized activity types, devices, timestamps, and values to simulate production-like customer behavior patterns. The function uses PostgreSQL's jsonb_build_object to create structured JSON documents with consistent schema for testing purposes and can generate millions of test records.

In [None]:
CREATE OR REPLACE FUNCTION generate_random_activity_data()
RETURNS JSONB AS $$
DECLARE
    activities TEXT[] := ARRAY['login', 'purchase', 'view_product'];
    devices TEXT[] := ARRAY['mobile', 'desktop', 'tablet'];
BEGIN
    RETURN jsonb_build_object(
        'activity_type', activities[floor(random() * 3 + 1)],
        'device', devices[floor(random() * 3 + 1)],
        'timestamp', now(),
        'value', random() * 1000
    );
END;
$$ LANGUAGE plpgsql;

Efficiently generate large volumes of test data by inserting specified numbers of customer activity records with randomized customer IDs and JSON activity data. The procedure uses a loop to create realistic datasets for performance testing, distributing activities across 1000 different customers with procedure creation confirmation.

In [None]:
-- Create procedure for data generation
CREATE OR REPLACE PROCEDURE generate_test_data(p_records INTEGER)
LANGUAGE plpgsql AS $$
BEGIN
    FOR i IN 1..p_records LOOP
        INSERT INTO customer_activities (customer_id, activity_data)
        VALUES (
            floor(random() * 1000),
            generate_random_activity_data()
        );
    END LOOP;
END;
$$;

## 2. Data Generation

### Executing the Data Generation Procedure

After creating our procedures, we now need to execute it to populate our table. We'll insert one million records to simulate a realistic production dataset where performance issues typically emerge. 
A million records is actually modest compared to many production systems, but enough to highlight performance characteristics. Depending on your hardware, this operation may take several minutes to complete.

Execute the data generation procedure to insert one million customer activity records, creating a substantial dataset for performance analysis and demonstrating real-world scale challenges. The operation may take several minutes depending on hardware capabilities and will populate the table with diverse JSON documents.

In [None]:
-- Execute procedure to insert 1 million records
CALL generate_test_data(1000000);

### Data Visualization

Below is a screenshot showing how our test data appears in PostgreSQL. Notice how the `activity_data` column stores complex JSON documents within a traditional relational table structure:

As you can see, each row contains a JSON document with multiple nested fields. While PostgreSQL displays this data neatly in the query results, internally it's processing this semi-structured data differently than traditional relational data. This storage approach creates additional overhead for:


<img src="../images/7.2-json-data-example.png" alt="JSON Document Structure" width="1000"/>

## 3. Performance Analysis

Now that we have a substantial dataset, let's run some analytical queries to demonstrate how PostgreSQL handles complex JSON data operations. As the complexity of queries increases, we'll see performance degradation that would be less pronounced in purpose-built document databases.

### Query Example: User Activity Analysis by Device Type

The following query analyzes customer activities by finding all mobile device activities, grouping them by customer and activity type, and counting occurrences. This represents a common analytical scenario in real-world applications:

Demonstrate PostgreSQL's JSON query performance by filtering mobile device activities, grouping by customer and activity type, and counting occurrences with execution plan analysis. The EXPLAIN ANALYZE command reveals query costs, execution time, and index usage patterns to illustrate performance characteristics of JSON operations at scale with detailed execution statistics.

In [None]:
-- Example of slow-performing query with execution plan
EXPLAIN ANALYZE
SELECT 
    customer_id,
    activity_data->>'activity_type' as activity_type,
    COUNT(*) as activity_count
FROM customer_activities
WHERE activity_data->>'device' = 'mobile'
GROUP BY customer_id, activity_data->>'activity_type';

## Conclusion: The Case for Purpose-Built Databases

As our example demonstrates, while PostgreSQL can store and query JSON data, this approach shows clear limitations when dealing with:

- Large volumes of document data
- Complex nested JSON structures
- Analytical queries with multiple JSON operations
- High-throughput applications

The performance issues we've observed aren't due to PostgreSQL being a poor database system—quite the contrary. The challenge stems from using a relational database for workloads it wasn't optimized to handle. This is the essence of the antipattern we're addressing.

### Looking Forward: Choosing the Right Database

In the next section, we'll explore how to select appropriate purpose-built databases for different data workloads. We'll cover migration strategies, performance comparisons, and architectural best practices to help you implement a multi-database strategy that aligns with modern application needs.

> 💡 **Key Takeaway**: The "one database for everything" approach often leads to performance and scalability issues. Modern applications benefit from using the right database for the right workload, even if it means managing multiple database systems.

Stay tuned as we dive deeper into database selection criteria and migration pathways to move from antipatterns to optimized architectures.

## Next Steps

Now that you've experienced the JSON antipattern firsthand, proceed to explore purpose-built database solutions:

- **[7.3 Choosing the Right Database for Your Workload](../7.3_Migration-Strategies/README.md)** - Learn database selection criteria and decision frameworks
- **[7.3 Migration Strategies: PostgreSQL to DocumentDB](../7.3_Migration-Strategies/migrate-pg-docdb.ipynb)** - Learn practical migration approaches and implementation strategies

## Additional Resources 📚

### PostgreSQL JSON Features
- [PostgreSQL JSON Types](https://www.postgresql.org/docs/current/datatype-json.html)
- [JSON Functions and Operators](https://www.postgresql.org/docs/current/functions-json.html)
- [GIN Indexes for JSON](https://www.postgresql.org/docs/current/gin-intro.html)

### Purpose-Built Databases
- [Amazon DocumentDB](https://docs.aws.amazon.com/documentdb/)
- [Amazon DynamoDB](https://docs.aws.amazon.com/dynamodb/)
- [Database Selection Guide](https://aws.amazon.com/products/databases/)

### Performance & Migration
- [Database Migration Best Practices](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_BestPractices.html)
- [Performance Tuning PostgreSQL](https://wiki.postgresql.org/wiki/Performance_Optimization)
- [AWS Database Migration Service](https://docs.aws.amazon.com/dms/)