# Payer Medallion Data Pipeline
**Clean, Modular Healthcare Data Pipeline**

This notebook demonstrates a clean medallion architecture (Bronze → Silver → Gold) for healthcare payer data using modular utilities.


## Setup Parameters


In [5]:
# Setup catalog and database parameters
dbutils.widgets.text("catalog", "my_catalog", "Catalog")
dbutils.widgets.text("bronze_db", "payer_bronze", "Bronze DB")
dbutils.widgets.text("silver_db", "payer_silver", "Silver DB")
dbutils.widgets.text("gold_db", "payer_gold", "Gold DB")

catalog = dbutils.widgets.get("catalog")
bronze_db = dbutils.widgets.get("bronze_db")
silver_db = dbutils.widgets.get("silver_db")
gold_db = dbutils.widgets.get("gold_db")

print(f"Catalog: {catalog}")
print(f"Bronze DB: {bronze_db}")
print(f"Silver DB: {silver_db}")
print(f"Gold DB: {gold_db}")


Box(children=(Label(value='Catalog'), Text(value='my_catalog')))

Box(children=(Label(value='Bronze DB'), Text(value='payer_bronze')))

Box(children=(Label(value='Silver DB'), Text(value='payer_silver')))

Box(children=(Label(value='Gold DB'), Text(value='payer_gold')))

Catalog: my_catalog
Bronze DB: payer_bronze
Silver DB: payer_silver
Gold DB: payer_gold


## Initialize Catalogs and Schemas


In [6]:
# Create catalog and databases
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {bronze_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {silver_db}")
spark.sql(f"CREATE DATABASE IF NOT EXISTS {gold_db}")

print("✅ Catalog and databases initialized")


✅ Catalog and databases initialized


## Step 1: Extract Data Files
**One-line file extraction using modular utilities**


In [7]:
# Import and execute file extraction with a single function call
import sys
import os

# Add src directory to Python path
src_path = os.path.join(os.getcwd(), 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

# Import and execute the file extraction utility
from file_utils import extract_payer_data

extraction_results = extract_payer_data(
    spark=spark,
    catalog=catalog,
    bronze_db=bronze_db
)

# Simple success check
if extraction_results['success']:
    print("🎉 File extraction completed successfully!")
else:
    print("❌ File extraction failed. Check logs above.")


🚀 Starting payer data extraction...
📁 Creating volume directories...
✅ Created 6 directories

📦 Extracting files using UDF...


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

UDF Results:
Successfully extracted ZIP to temporary directory
Successfully copied claims.csv to /Volumes/my_catalog/payer_bronze/payer/files/claims/claims.csv
Successfully copied diagnoses.csv to /Volumes/my_catalog/payer_bronze/payer/files/diagnosis/diagnosis.csv
Successfully copied procedures.csv to /Volumes/my_catalog/payer_bronze/payer/files/procedures/procedures.csv
Successfully copied member.csv to /Volumes/my_catalog/payer_bronze/payer/files/members/members.csv
Successfully copied providers.csv to /Volumes/my_catalog/payer_bronze/payer/files/providers/providers.csv
🎉 File extraction completed successfully!


## Step 2: Bronze Layer - Raw Data Ingestion
**Create bronze tables from extracted files**


In [9]:
print("Creating bronze tables using DBSQL COPY INTO...")


Creating bronze tables using DBSQL COPY INTO...


In [None]:
%sql
-- Create Claims Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.claims_raw;
COPY INTO payer_bronze.claims_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/claims/claims.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,8,8,0


In [12]:
%sql
-- Create Members Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.members_raw;
COPY INTO payer_bronze.members_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/members/members.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,15,15,0


In [13]:
%sql
-- Create Providers Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.providers_raw;
COPY INTO payer_bronze.providers_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/providers/providers.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,5,5,0


In [14]:
%sql
-- Create Diagnosis Bronze Table  
CREATE TABLE IF NOT EXISTS payer_bronze.diagnosis_raw;
COPY INTO payer_bronze.diagnosis_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/diagnosis/diagnosis.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,9,9,0


In [15]:
%sql
-- Create Procedures Bronze Table
CREATE TABLE IF NOT EXISTS payer_bronze.procedures_raw;
COPY INTO payer_bronze.procedures_raw FROM '/Volumes/my_catalog/payer_bronze/payer/files/procedures/procedures.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true', 'delimiter' = ',')
COPY_OPTIONS ('mergeSchema' = 'true');


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,7,7,0


## Step 3: Silver Layer - Cleaned and Conformed Data
**Transform bronze data into clean, analytics-ready tables**


In [16]:
%sql
-- Clean and transform members data
CREATE OR REPLACE TABLE payer_silver.members AS
SELECT
  DISTINCT CAST(member_id AS STRING) AS member_id,
  TRIM(first_name) AS first_name,
  TRIM(last_name) AS last_name,
  CAST(birth_date AS DATE) AS birth_date,
  gender,
  plan_id,
  CAST(effective_date AS DATE) AS effective_date
FROM payer_bronze.members_raw
WHERE member_id IS NOT NULL


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


In [17]:
%sql
-- Clean and transform claims data
CREATE OR REPLACE TABLE payer_silver.claims AS
SELECT
  DISTINCT claim_id,
  member_id,
  provider_id,
  CAST(claim_date AS DATE) AS claim_date,
  ROUND(total_charge, 2) AS total_charge,
  LOWER(claim_status) AS claim_status
FROM payer_bronze.claims_raw
WHERE claim_id IS NOT NULL AND total_charge > 0


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


In [18]:
%sql
-- Clean and transform providers data
CREATE OR REPLACE TABLE payer_silver.providers AS
SELECT
  DISTINCT provider_id,
  npi,
  provider_name,
  specialty,
  address,
  city,
  state
FROM payer_bronze.providers_raw
WHERE provider_id IS NOT NULL


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


## Step 4: Gold Layer - Business-Ready Analytics Tables
**Create enriched, aggregated tables for analytics and reporting**


In [19]:
%sql
-- Create enriched claims table with member and provider details
CREATE OR REPLACE TABLE payer_gold.claims_enriched AS
SELECT
  c.claim_id,
  c.claim_date,
  c.total_charge,
  c.claim_status,
  m.member_id,
  m.first_name,
  m.last_name,
  m.gender,
  m.plan_id,
  p.provider_id,
  p.provider_name,
  p.specialty,
  p.city,
  p.state
FROM payer_silver.claims c
INNER JOIN payer_silver.members m ON c.member_id = m.member_id
INNER JOIN payer_silver.providers p ON c.provider_id = p.provider_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


In [20]:
%sql
-- Create member summary table for analytics
CREATE OR REPLACE TABLE payer_gold.member_claim_summary AS
SELECT
  member_id,
  COUNT(DISTINCT claim_id) AS total_claims,
  SUM(total_charge) AS sum_claims,
  AVG(total_charge) AS avg_claim_amount,
  MAX(total_charge) AS max_claim,
  MIN(total_charge) AS min_claim
FROM payer_silver.claims
GROUP BY member_id


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Unnamed: 0,num_affected_rows,num_inserted_rows


## Step 5: Verify Pipeline Results
**Quick validation of the medallion pipeline**


In [21]:
# Display pipeline summary
bronze_count = spark.sql(f"SELECT COUNT(*) as count FROM {bronze_db}.claims_raw").collect()[0]['count']
silver_count = spark.sql(f"SELECT COUNT(*) as count FROM {silver_db}.claims").collect()[0]['count']
gold_count = spark.sql(f"SELECT COUNT(*) as count FROM {gold_db}.claims_enriched").collect()[0]['count']

print("📊 MEDALLION PIPELINE SUMMARY")
print("=" * 40)
print(f"🥉 Bronze Claims: {bronze_count:,} records")
print(f"🥈 Silver Claims: {silver_count:,} records")
print(f"🥇 Gold Claims: {gold_count:,} records")
print("\n🎉 Medallion pipeline completed successfully!")


HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

📊 MEDALLION PIPELINE SUMMARY
🥉 Bronze Claims: 8 records
🥈 Silver Claims: 8 records
🥇 Gold Claims: 8 records

🎉 Medallion pipeline completed successfully!


## Step 6: Data Analysis and Visualization
**Explore and analyze the gold layer data**


In [None]:
# Display the enriched claims table
display(spark.table("payer_gold.claims_enriched"))


### Analysis by Claim Status
**Analyze total charges by claim status**


In [None]:
display(spark.table("payer_gold.claims_enriched").groupBy("claim_status").sum("total_charge"))


### Analysis by Gender
**Count claims by member gender**


In [None]:
display(spark.table("payer_gold.claims_enriched").groupBy("gender").count())


### Time Series Analysis
**Total charges by claim date**


In [None]:
display(spark.table("payer_gold.claims_enriched").groupBy("claim_date").sum("total_charge").orderBy("claim_date"))


### Geographic Analysis
**Claims distribution by city**


In [None]:
display(spark.table("payer_gold.claims_enriched").groupBy("city").count())


### Charge Distribution Analysis
**Analyze the distribution of claim charges**


In [None]:
display(spark.table("payer_gold.claims_enriched").select("total_charge"))


### Date vs Charge Analysis
**Relationship between claim date and total charge**


In [None]:
display(spark.table("payer_gold.claims_enriched").select("claim_date", "total_charge"))


## AI/BI Integration

Intelligent analytics for everyone!

Databricks AI/BI is a new type of business intelligence product designed to provide a deep understanding of your data's semantics, enabling self-service data analysis for everyone in your organization. AI/BI is built on a compound AI system that draws insights from the full lifecycle of your data across the Databricks platform, including ETL pipelines, lineage, and other queries.

![AI/BI](https://www.databricks.com/sites/default/files/2025-05/hero-image-ai-bi-v2-2x.png)


## Genie - Talk with Your Data

Now everyone can get insights from data simply by asking questions in natural language.

![Genie](https://www.databricks.com/sites/default/files/2025-06/ai-bi-genie-hero.png)

### AI Assistant Prompts
Try asking the Databricks AI Assistant:
* What kind of aggregations can I do with table "payer_gold.claims_enriched"?
* What does "spark.table" command do?
* Show me the top 5 most expensive claims
* Which provider specialty has the highest average claim amount?
