# Synthetic Campaign Data Generator

This notebook demonstrates generating realistic synthetic advertising campaign data using the `dbldatagen` library. The generated data mimics real campaign schemas with proper distributions and realistic identifier patterns.

## What this notebook does:
1. **Setup**: Installs required dependencies and imports libraries
2. **Reference**: Displays an image showing the target data schema
3. **Generation**: Creates 10,000 rows of synthetic campaign data with 24 columns
4. **Storage**: Writes data to Unity Catalog volumes as Parquet files
5. **Verification**: Validates the written data
6. **Alternative Storage**: Demonstrates saving as Delta tables

## Use Cases:
- Testing ETL pipelines with realistic data
- Demo environments without exposing real customer data
- Performance testing with large datasets
- Development and staging environments

## 1. Environment Setup

First, we'll install the required `dbldatagen` library if it's not already available:

In [None]:
# Install dbldatagen if not already installed
%pip install dbldatagen

Now we'll import the necessary libraries for data generation and Spark operations:

## 2. Schema Reference

Display the reference image that shows the target data schema we want to replicate:

## 3. Synthetic Data Generation

Now we'll create a comprehensive data generator specification that produces realistic campaign data. This includes:

- **Campaign identifiers**: Account IDs, campaign IDs, creative IDs (using realistic numeric patterns)
- **Campaign metadata**: Types, reach estimates, run status, bid types
- **Timestamps**: Creation and update times (Unix format)
- **Configuration flags**: Donation settings, duplication scenarios
- **Creative properties**: Media types, backend creative types, locations

The generator uses weighted distributions to mimic real-world data patterns:

## 4. Data Quality Verification

Let's examine the generated data to ensure it meets our expectations:

## 5. Unity Catalog Volume Storage

Configure the Unity Catalog destination where we'll store the synthetic data. Update these paths according to your environment:

Write the synthetic data to Unity Catalog as Parquet files (efficient columnar format):

## 6. Alternative Storage: Delta Tables

Delta Lake provides additional benefits over Parquet including:
- **ACID transactions**: Ensures data consistency
- **Time Travel**: Query historical versions of data  
- **Schema Evolution**: Handle schema changes gracefully
- **Optimized Performance**: Better query performance with automatic optimization

This approach is recommended for production workloads:

In [None]:
import dbldatagen as dg
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, TimestampType
from datetime import datetime
import json

Create a managed Delta table in Unity Catalog:

Query the Delta table to verify it was created successfully:

In [None]:
# Read the image file (for reference)
image_path = '/Users/ashwin.srikant/Downloads/554207516_1143861547080971_5209848696419928616_n (2).png'

# Display the image
from IPython.display import Image, display
display(Image(filename=image_path))

In [None]:
# Generate synthetic campaign data based on the schema seen in the image
num_rows = 10000
seed = 42

# Create synthetic data generator
data_spec = (
    dg.DataGenerator(spark, name="campaign_data", rows=num_rows, seedMethod='hash_fieldname', random=True)
    .withColumn("type", "string", values=["33", "7", "12"], weights=[5, 3, 2])
    .withColumn("campaign_reach_estimate", "integer", minValue=0, maxValue=1000000)
    .withColumn("ad_duplication_scenario", "integer", values=[0, 1], weights=[7, 3])
    .withColumn("bid_type", "string", values=["7", "3", "5"], random=True)
    .withColumn("time_updated", "long", minValue=1750000000, maxValue=1760000000)
    .withColumn("time_created", "long", minValue=1750000000, maxValue=1760000000)
    .withColumn("parent_campaign_id", "string", template=r"\d{20}", random=True)
    .withColumn("account_id", "string", template=r"\d{15}", random=True)
    .withColumn("oid", "string", template=r"\d{20}", random=True)
    .withColumn("account_group_id", "string", template=r"\d{20}", random=True)
    .withColumn("creative_media_type", "string", values=["3", "1", "2"], random=True)
    .withColumn("creative_id", "string", template=r"\d{15}", random=True)
    .withColumn("account_admarket_id", "string", template=r"\d{20}", random=True)
    .withColumn("is_donation_enabled", "string", values=["0", "1"], weights=[8, 2])
    .withColumn("www_request_id", "string", template=r"AJX7_VksS[A-Za-z0-9]{10}", random=True)
    .withColumn("run_status", "string", values=["17", "8", "4"], weights=[5, 3, 2])
    .withColumn("audit_version", "string", values=["4", "5", "3"], random=True)
    .withColumn("target_spec_id", "string", template=r"\d{20}", random=True)
    .withColumn("backend_creative_type", "string", values=["0", "1", "2"], random=True)
    .withColumn("location", "string", values=["3", "1", "2", "4"], random=True)
    .withColumn("delivery_id", "string", template=r"\d{20}", random=True)
    .withColumn("creator_id", "string", template=r"\d{15}", random=True)
    .withColumn("parent_campaign_group_id", "string", template=r"\d{20}", random=True)
    .withColumn("parent_adgroup_id", "string", template=r"\d{20}", random=True)
)

# Build the dataframe
df_synthetic = data_spec.build()

# Show sample data
display(df_synthetic.limit(10))

In [None]:
# Show data statistics
print(f"Total rows generated: {df_synthetic.count()}")
df_synthetic.printSchema()

In [None]:
# Define Unity Catalog volume path
# Update these values according to your Unity Catalog setup
catalog_name = "main"  # Update with your catalog name
schema_name = "default"  # Update with your schema name
volume_name = "synthetic_data"  # Update with your volume name

# Full volume path
volume_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/campaign_data"

print(f"Writing data to: {volume_path}")

In [None]:
# Write data to Unity Catalog volume as parquet
df_synthetic.write \
    .format("parquet") \
    .mode("overwrite") \
    .save(volume_path)

print(f"Successfully wrote {num_rows} rows to {volume_path}")

In [None]:
# Verify the data was written correctly
df_read = spark.read.format("parquet").load(volume_path)
print(f"Rows read from volume: {df_read.count()}")
display(df_read.limit(5))

## Optional: Write as Delta Table

You can also write the data as a Delta table in Unity Catalog for better performance and features.

In [None]:
# Alternative: Write as a Delta table in Unity Catalog
table_name = f"{catalog_name}.{schema_name}.campaign_synthetic_data"

df_synthetic.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(table_name)

print(f"Successfully wrote data to table: {table_name}")

In [None]:
# Query the table
df_table = spark.table(table_name)
display(df_table.limit(10))