# Synthetic Marketing Data Generation
**Project:** Marketing BI Dashboard  
**Role:** Data Engineering & ETL Simulation  
**Output:** Star Schema (Fact + Dimensions)

This notebook generates a realistic dataset for marketing analytics. It simulates performance across Programmatic, Search, and Social channels, applying specific seasonality rules and "insight spikes" to test dashboard capabilities.

In [6]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os

# CONFIGURATION
# Set the output directory here. Use '.' for current directory.
OUTPUT_DIR = '../docs' 

print("Libraries loaded. Ready to generate schema.")

Libraries loaded. Ready to generate schema.


## 1. Generate Dimensions (Metadata)
We establish the "Nouns" of our analysis: **Time, Source, and Campaign**. 
* **Dim_Date:** A continuous 15-month range.
* **Dim_Source:** 8 unique platforms mapped to 3 major channels.
* **Dim_Campaign:** 8 campaigns with specific objectives.

In [7]:
print("Generating Dimensions...")

# --- DIM_DATE (15 Months: Jan 2023 - Mar 2024) ---
start_date = datetime(2023, 1, 1)
dates = [start_date + timedelta(days=x) for x in range(456)] 
dim_date = pd.DataFrame({'date': dates})

# Creating an Integer Date Key (YYYYMMDD) is standard practice for Fact tables
dim_date['date_key'] = dim_date['date'].dt.strftime('%Y%m%d').astype(int)
dim_date['year'] = dim_date['date'].dt.year
dim_date['month'] = dim_date['date'].dt.month
dim_date['month_name'] = dim_date['date'].dt.strftime('%b')
dim_date['quarter'] = dim_date['date'].dt.quarter
dim_date['is_weekend'] = dim_date['date'].dt.dayofweek >= 5

# --- DIM_SOURCE (8 Sources, Mapped to Channels) ---
sources_data = [
    (1, 'Amazon Ad Server', 'Programmatic'), 
    (2, 'StackAdapt', 'Programmatic'), 
    (3, 'DV360', 'Programmatic'), 
    (4, 'Search Ads 360', 'Paid Search'), 
    (5, 'Bing Ads', 'Paid Search'), 
    (6, 'Facebook', 'Paid Social'), 
    (7, 'LinkedIn Ads', 'Paid Social'), 
    (8, 'Organic Search', 'Organic')
]
dim_source = pd.DataFrame(sources_data, columns=['source_id', 'source_name', 'channel'])

# --- DIM_CAMPAIGN (Mapped to Channel & Objective) ---
campaign_config = [
    ("Business-focused zero tolerance", "Programmatic", "Brand Awareness"), 
    ("Profound intangible policy", "Programmatic", "Brand Awareness"),
    ("Networked value-added time-frame", "Programmatic", "Consideration"), 
    ("Persistent 24/7 attitude", "Paid Social", "Lead Gen"), 
    ("Centralized modular throughput", "Paid Social", "Conversion"), 
    ("Integrated dedicated contingency", "Paid Search", "Conversion"), 
    ("Automated uniform software", "Paid Search", "Lead Gen"), 
    ("Cross-platform static hierarchy", "Organic", "Traffic")
]

campaign_rows = []
id_counter = 1

for name, scope, objective in campaign_config:
    # Create 2 Ad Sets per Campaign to enable hierarchy testing
    for tier in ['_Tier1', '_Tier2']:
        campaign_rows.append({
            'campaign_id': id_counter,
            'campaign_name': name, 
            'ad_set_name': f"{name}{tier}", 
            'channel_scope': scope, 
            'objective': objective
        })
        id_counter += 1
dim_campaign = pd.DataFrame(campaign_rows)

Generating Dimensions...


## 2. Generate Fact Table
We create the skeleton of our data by cross-joining Campaigns, Sources, and Dates. This results in ~20k rows of daily data.

In [8]:
# Step A: Link Campaigns to Sources (Logical Join)
# This prevents "Search" campaigns from erroneously appearing on "Facebook"
schema_link = dim_campaign.merge(dim_source, left_on='channel_scope', right_on='channel')

# Step B: Create the Skeleton (Cross Join with Date)
df_fact = schema_link.merge(dim_date[['date_key', 'is_weekend', 'year', 'month']], how='cross')

N = len(df_fact)
print(f"Fact Table Skeleton Created: {N} rows.")

Fact Table Skeleton Created: 16416 rows.


## 3. Apply Vectorized Metric Logic
Here we inject "realism" into the data. Instead of random noise, we use:
1.  **Seasonality:** Weekend dips in traffic.
2.  **Channel Distributions:** "Search" has high CPC/CTR, while "Programmatic" has high volume/low CTR.
3.  **Storytelling:** We inject specific "spikes" (e.g., an August traffic boom) to ensure the dashboard has insights to find.

In [9]:
# 1. Seasonality Mask (Lower on weekends)
seasonality = np.where(df_fact['is_weekend'], 0.7, 1.1)

# 2. Initialize Arrays
base_imps = np.zeros(N)
ctrs = np.zeros(N)
cpcs = np.zeros(N)

# 3. Apply Channel-Specific Distributions
# Programmatic (High Volume, Low CTR/CPC)
mask_prog = df_fact['channel'] == 'Programmatic'
s_prog = mask_prog.sum()
base_imps[mask_prog] = np.random.randint(5000, 15000, s_prog)
ctrs[mask_prog] = np.random.uniform(0.003, 0.007, s_prog) 
cpcs[mask_prog] = np.random.uniform(0.30, 0.90, s_prog)

# Search (Low Volume, High CTR/CPC)
mask_search = df_fact['channel'] == 'Paid Search'
s_search = mask_search.sum()
base_imps[mask_search] = np.random.randint(300, 1200, s_search)
ctrs[mask_search] = np.random.uniform(0.08, 0.12, s_search) 
cpcs[mask_search] = np.random.uniform(2.50, 6.00, s_search)

# Social (Mid Volume, Mid CTR/CPC)
mask_social = df_fact['channel'] == 'Paid Social'
s_social = mask_social.sum()
base_imps[mask_social] = np.random.randint(1000, 4000, s_social)
ctrs[mask_social] = np.random.uniform(0.015, 0.035, s_social) 
cpcs[mask_social] = np.random.uniform(1.50, 3.50, s_social)

# Organic (No Cost)
mask_org = df_fact['channel'] == 'Organic'
s_org = mask_org.sum()
base_imps[mask_org] = np.random.randint(1000, 3000, s_org)
ctrs[mask_org] = np.random.uniform(0.05, 0.08, s_org)
cpcs[mask_org] = 0.0

# 4. Inject Strategic Insights (The "Story" Layer)
final_imps = base_imps * seasonality

# INSIGHT 1: "The August Spike" - Programmatic impressions triple in Aug 2023
mask_spike = (df_fact['year'] == 2023) & (df_fact['month'] == 8) & (df_fact['channel'] == 'Programmatic')
final_imps[mask_spike] *= 3.0

# INSIGHT 2: "December Efficiency" - Search CPC drops by 30% in Dec 2023
mask_effic = (df_fact['year'] == 2023) & (df_fact['month'] == 12) & (df_fact['channel'] == 'Paid Search')
cpcs[mask_effic] *= 0.7

# 5. Final Calculations
df_fact['impressions'] = final_imps.astype(int)
df_fact['clicks'] = (df_fact['impressions'] * ctrs).astype(int)
df_fact['spend'] = (df_fact['clicks'] * cpcs).round(2)

# Conversion Rate (Randomized)
conv_rates = np.random.uniform(0.05, 0.15, N)
df_fact['conversions'] = (df_fact['clicks'] * conv_rates).astype(int)

# Video Views (Only relevant for Display/Social)
df_fact['video_views'] = 0
mask_video = df_fact['channel'].isin(['Programmatic', 'Paid Social'])
df_fact.loc[mask_video, 'video_views'] = (
    df_fact.loc[mask_video, 'impressions'] * 0.40 * np.random.uniform(0.8, 1.2, mask_video.sum())
).astype(int)

print("Metric Logic Applied.")

Metric Logic Applied.


## 4. Export Data
We select the relevant columns for our Fact table and export the Star Schema files.

In [10]:
# Select only keys and metrics for the Fact Table
fact_columns = ['date_key', 'source_id', 'campaign_id', 'impressions', 'clicks', 'spend', 'conversions', 'video_views']
fact_final = df_fact[fact_columns]

# Clean up dimensions (remove helper columns)
dim_campaign_final = dim_campaign.drop(columns=['channel_scope'])

# Export
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

dim_date.to_csv(os.path.join(OUTPUT_DIR, 'dim_date.csv'), index=False)
dim_source.to_csv(os.path.join(OUTPUT_DIR, 'dim_source.csv'), index=False)
dim_campaign_final.to_csv(os.path.join(OUTPUT_DIR, 'dim_campaign.csv'), index=False)
fact_final.to_csv(os.path.join(OUTPUT_DIR, 'fact_performance.csv'), index=False)

print("SUCCESS: 4 Star Schema files generated!")
print(f"Files ready in: {os.path.abspath(OUTPUT_DIR)}")

SUCCESS: 4 Star Schema files generated!
Files ready in: d:\VSCode\synthetic_data_gen\docs
